# Python Assignment: Statistical vs. Neural Models for Time Series Forecasting

This assignment challenges you to compare the performance of a traditional statistical time series model (SARIMA) against a deep learning model (LSTM) on the same dataset. You will go through data preparation, model building, training, and evaluation for both approaches, finally providing a comparative analysis of their strengths and weaknesses.

## Part 1: Data Acquisition and Preparation (30 points)

We will use the 'Daily Minimum Temperatures in Melbourne, Australia, 1981-1990' dataset. This is a classic univariate time series dataset with clear seasonality and trend, making it suitable for both types of models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from tensorflow import keras
from tensorflow.keras import layers
import warnings

warnings.filterwarnings('ignore') # Suppress warnings for cleaner output
np.random.seed(42) # for reproducibility
tf.random.set_seed(42)

# 1.1 Load and Inspect Data
#    The dataset can be found at: https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-minimum-temperatures.csv
#    Load it directly, parse dates, and set 'Date' as the index.
#    Make sure the 'Temp' column is numeric.

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-minimum-temperatures.csv'

print("\n--- Loading Daily Minimum Temperatures Dataset ---")
# TODO: Load the dataset
# df = pd.read_csv(url, header=0, index_col=0, parse_dates=True, squeeze=True)
# df.index.name = 'Date'
# df.name = 'Temperature'
# df = df.astype(float)

print("Data Head:\n", df.head())
print("\nData Info:")
df.info()

# 1.2 Visualize the Time Series
#    Plot the entire time series to observe trends, seasonality, and any anomalies.

plt.figure(figsize=(15, 6))
# TODO: Plot the time series
# df.plot()
# plt.title('Daily Minimum Temperatures (1981-1990)')
# plt.xlabel('Date')
# plt.ylabel('Temperature (Celsius)')
# plt.grid(True)
# plt.show()

# 1.3 Split Data into Training and Testing Sets
#    Split the data into training and testing sets, maintaining temporal order.
#    Use the first ~8 years (e.g., up to '1988-12-31') for training and the remaining for testing.

train_end_date = '1988-12-31'
train_data = df.loc[:train_end_date].copy()
test_data = df.loc[train_end_date:].copy() # Includes 1989-01-01 onwards

print(f"\nTraining data points: {len(train_data)}")
print(f"Test data points: {len(test_data)}")

# 1.4 Data Preparation for Neural Network (Scaling and Sequencing)
#    - **Scaling:** Apply `MinMaxScaler` to the *training data only* and transform both training and testing data.
#    - **Sequencing:** Create input-output sequences for the LSTM.
#      Use `n_steps_in` past observations to predict `n_steps_out` future observations.

print("\n--- Preparing Data for Neural Network ---")
# TODO: Scale data
# scaler = MinMaxScaler(feature_range=(0, 1))
# scaled_train_data = scaler.fit_transform(train_data.values.reshape(-1, 1))
# scaled_test_data = scaler.transform(test_data.values.reshape(-1, 1))

def create_sequences(data, n_steps_in, n_steps_out):
    X, y = [], []
    for i in range(len(data) - n_steps_in - n_steps_out + 1):
        # TODO: Define input sequence (X_seq)
        # X_seq = data[i:(i + n_steps_in), 0]
        # TODO: Define output sequence (y_seq)
        # y_seq = data[(i + n_steps_in):(i + n_steps_in + n_steps_out), 0]

        # X.append(X_seq)
        # y.append(y_seq)
    return np.array(X), np.array(y)

n_steps_in = 30 # Use 30 past days to predict
n_steps_out = 1 # Predict next 1 day

# Create sequences for training and testing
X_train_seq, y_train_seq = create_sequences(scaled_train_data, n_steps_in, n_steps_out)
X_test_seq, y_test_seq = create_sequences(scaled_test_data, n_steps_in, n_steps_out)

# Reshape X for LSTM input: (samples, timesteps, features)
X_train_seq = X_train_seq.reshape(X_train_seq.shape[0], X_train_seq.shape[1], 1)
X_test_seq = X_test_seq.reshape(X_test_seq.shape[0], X_test_seq.shape[1], 1)

print(f"NN Train set: {X_train_seq.shape}, {y_train_seq.shape}")
print(f"NN Test set: {X_test_seq.shape}, {y_test_seq.shape}")


## Part 2: Statistical Model: SARIMA (30 points)

You will build, train, and forecast using a Seasonal Autoregressive Integrated Moving Average (SARIMA) model.

In [None]:
print("\n--- Training SARIMA Model ---")

# 2.1 Stationarity Test (Optional but good practice)
#    Perform an Augmented Dickey-Fuller test on the training data to check for stationarity.
#    (You might need to difference the data if it's not stationary, though SARIMA handles differencing internally with 'd' and 'D').

# result = adfuller(train_data)
# print('ADF Statistic: %f' % result[0])
# print('p-value: %f' % result[1])
# print('Critical Values:')
# for key, value in result[4].items():
#     print('\t%s: %.3f' % (key, value))

# 2.2 Determine SARIMA Orders (p,d,q)(P,D,Q,s)
#    Based on the dataset, a seasonal period 's' of 365 (daily data with annual seasonality) is appropriate.
#    For simplicity, we'll provide typical orders, but in a real scenario, you'd use ACF/PACF plots or `auto_arima`.
#    For this dataset, a common starting point is (1,1,1)(1,1,1,365).

order = (1, 1, 1)      # (p, d, q)
seasonal_order = (1, 1, 1, 365) # (P, D, Q, s)

# 2.3 Initialize and Fit SARIMA Model
#    Fit the SARIMA model on the `train_data`.

# TODO: Initialize and fit SARIMA model
# sarima_model = SARIMAX(train_data, order=order, seasonal_order=seasonal_order, enforce_stationarity=False, enforce_invertibility=False)
# sarima_results = sarima_model.fit(disp=False)
# print(sarima_results.summary())

# 2.4 Make Forecasts
#    Generate predictions for the entire `test_data` period.

start_index = len(train_data)
end_index = len(df) - 1 # Forecast up to the end of the original DataFrame

# TODO: Make predictions
# sarima_predictions = sarima_results.predict(start=start_index, end=end_index)

# Align predictions with test_data index
# sarima_predictions.index = test_data.index
print("SARIMA Predictions Head:\n", sarima_predictions.head())

# 2.5 Evaluate SARIMA Performance
#    Calculate RMSE and MAE for the SARIMA model on the test set.

rmse_sarima = np.sqrt(mean_squared_error(test_data, sarima_predictions))
mae_sarima = mean_absolute_error(test_data, sarima_predictions)

print(f"\nSARIMA RMSE: {rmse_sarima:.4f}")
print(f"SARIMA MAE: {mae_sarima:.4f}")

# 2.6 Plot SARIMA Forecast
#    Plot the actual test data alongside the SARIMA predictions.

plt.figure(figsize=(15, 7))
# TODO: Plot SARIMA forecast
# plt.plot(train_data.index, train_data, label='Training Data', color='gray')
# plt.plot(test_data.index, test_data, label='Actual Test Data', color='blue')
# plt.plot(sarima_predictions.index, sarima_predictions, label='SARIMA Predictions', color='red', linestyle='--')
# plt.title('SARIMA Model: Actual vs. Predicted Temperatures')
# plt.xlabel('Date')
# plt.ylabel('Temperature (Celsius)')
# plt.legend()
# plt.grid(True)
# plt.show()


## Part 3: Neural Model: LSTM (30 points)

You will build, train, and forecast using an LSTM neural network on the same dataset.

In [None]:
print("\n--- Training LSTM Model ---")

# 3.1 Define the LSTM Architecture
#    Create a `tf.keras.Sequential` model using `X_train_seq.shape[1]` for `input_shape`.
#    Include at least one `LSTM` layer, potentially `Dropout`, and a final `Dense` layer.

# TODO: Build the Sequential LSTM model
# model_lstm = keras.Sequential([
#     layers.LSTM(50, activation='relu', input_shape=(X_train_seq.shape[1], 1)),
#     layers.Dropout(0.2),
#     layers.Dense(n_steps_out)
# ])

# 3.2 Compile the Model
#    Use `'adam'` optimizer and `'mse'` loss, monitoring `['mae']`.

# TODO: Compile the model
# model_lstm.compile(optimizer='adam', loss='mse', metrics=['mae'])

# 3.3 Display Model Summary
model_lstm.summary()

# 3.4 Train the Model
#    Train the LSTM model on `X_train_seq` and `y_train_seq`.
#    Use a `validation_split` and consider `EarlyStopping`.

epochs = 50
batch_size = 64

# early_stopping_lstm = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

print(f"\n--- Training LSTM for {epochs} epochs ---")
# TODO: Train the model
# history_lstm = model_lstm.fit(X_train_seq, y_train_seq,
#                               epochs=epochs,
#                               batch_size=batch_size,
#                               validation_split=0.1,
#                               # callbacks=[early_stopping_lstm], # Uncomment if using early stopping
#                               verbose=1)

# 3.5 Make Predictions and Inverse Transform
#    Generate predictions on `X_test_seq` and inverse transform them using the `scaler`.
#    Also, inverse transform `y_test_seq` for comparison.

print("\n--- Making LSTM Predictions and Inverse Transforming ---")
# TODO: Make predictions
# y_pred_lstm_scaled = model_lstm.predict(X_test_seq)

# TODO: Inverse transform
# y_pred_lstm = scaler.inverse_transform(y_pred_lstm_scaled)
# y_test_actual_lstm = scaler.inverse_transform(y_test_seq)

# Create an index for LSTM predictions, aligning with the actual test data start
# Note: The LSTM test set starts `n_steps_in` days *after* the raw test_data starts
# (due to sequence creation). Adjusting index for plotting clarity.
lstm_test_index_start = test_data.index[n_steps_in]
lstm_test_index_end = test_data.index[n_steps_in + len(y_pred_lstm) - 1]
lstm_prediction_dates = pd.date_range(start=lstm_test_index_start, end=lstm_test_index_end)

print("LSTM Predictions Head (original scale):\n", pd.Series(y_pred_lstm.flatten()[:5], index=lstm_prediction_dates[:5]))

# 3.6 Evaluate LSTM Performance
#    Calculate RMSE and MAE for the LSTM model on the original scale.

rmse_lstm = np.sqrt(mean_squared_error(y_test_actual_lstm, y_pred_lstm))
mae_lstm = mean_absolute_error(y_test_actual_lstm, y_pred_lstm)

print(f"\nLSTM RMSE: {rmse_lstm:.4f}")
print(f"LSTM MAE: {mae_lstm:.4f}")

# 3.7 Plot LSTM Forecast
#    Plot the actual test data (aligned with LSTM predictions) alongside the LSTM predictions.

plt.figure(figsize=(15, 7))
# TODO: Plot LSTM forecast
# plt.plot(train_data.index, train_data, label='Training Data', color='gray')
# plt.plot(test_data.index, test_data, label='Actual Test Data', color='blue')
# plt.plot(lstm_prediction_dates, y_pred_lstm, label='LSTM Predictions', color='green', linestyle='--')
# plt.title('LSTM Model: Actual vs. Predicted Temperatures')
# plt.xlabel('Date')
# plt.ylabel('Temperature (Celsius)')
# plt.legend()
# plt.grid(True)
# plt.show()


## Part 4: Comparative Analysis (10 points)

Synthesize your findings and compare the two models.

### Your Analysis:

1.  **Tabulate and compare the RMSE and MAE results for both SARIMA and LSTM models.**

    | Model | RMSE | MAE |
    |-------|------|-----|
    | SARIMA| _(Your RMSE)_ | _(Your MAE)_ |
    | LSTM  | _(Your RMSE)_ | _(Your MAE)_ |

2.  **Based on the quantitative metrics and visual plots, which model performed better for this dataset? Briefly explain why you think this might be the case.**

    _(Your explanation here)_

3.  **Discuss the strengths and weaknesses of SARIMA compared to LSTM for time series forecasting, based on your experience in this assignment.**

    * **SARIMA Strengths:** _(Your points here)_
    * **SARIMA Weaknesses:** _(Your points here)_
    * **LSTM Strengths:** _(Your points here)_
    * **LSTM Weaknesses:** _(Your points here)_

4.  **In what real-world scenarios might you prefer a SARIMA model over an LSTM, even if the LSTM yields slightly better metrics? And vice versa?**

    * **Prefer SARIMA when:** _(Your scenarios here)_
    * **Prefer LSTM when:** _(Your scenarios here)_


## Part 5: Reflection and Advanced Considerations (5 points)

Think about the broader implications and next steps.

### Your Answers to Reflection Questions:

1.  **Hyperparameter Tuning:** Both SARIMA and LSTMs have numerous hyperparameters. Briefly describe how you would approach tuning hyperparameters for each model type to potentially optimize their performance.

    * **SARIMA Tuning:** _(Your approach here)_
    * **LSTM Tuning:** _(Your approach here)_

2.  **Feature Engineering:** This assignment used only the target variable. How could you incorporate additional features (e.g., day of week, month, holidays, external weather forecasts) into both SARIMA and LSTM models to potentially improve their predictive power?

    * **SARIMA with Exogenous Regressors:** _(Your explanation here)_
    * **LSTM with Multivariate Inputs:** _(Your explanation here)_

3.  **Computational Resources:** Briefly comment on the relative computational resources (time, memory) required to train and forecast with a SARIMA model versus a deep LSTM model for this dataset size.

    _(Your answer here)_


## Deliverables:

1.  This completed Jupyter Notebook (`statistical_vs_neural_assignment.ipynb`) with all code cells executed and reflection questions answered.
2.  Ensure all plots are clearly visible and well-labeled within the notebook.