# ARIMA 

## Overview
**ARIMA** (AutoRegressive Integrated Moving Average) is a statistical model used for time series forecasting. It combines three components:

  1. **AR (Auto Regressive)** - correlation to past values
  2. **I (Integrated)** - differencing to make time series stationary 
  3. **MA (Moving Average)** - past errors  

It is a statistical model used to forecast future values based on the influence of the past values on the present values.  

# Packages to install
conda install conda-forge::scikit-learn  
conda install -c conda-forge pmdarima

# Import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import (
    mean_absolute_error,
    root_mean_squared_error,
    mean_absolute_percentage_error)
from sklearn.model_selection import ParameterGrid

# Load and check stationarity 

ARIMA requires the time series to be **stationary** (constant mean and variance over time).  

We use the the **Augmented Dickey_Full (ADF)** test check if the timeseries is stationary.  
- The H0 (null hypothesis): the time series is NOT stationary.   
- The H1 (alternative hypthesis): the time series is stationary  
- If **p-value <= 0.05**, reject H0 and treat the time sereies as stationary

In [None]:
#import the "AUS monthly beer production" dataset and set the index to time 


In [None]:
# plot the data to check if the time series is stationary - use seasonal_decomnpose



In [None]:
# check stationarity with ADF test


# ARIMA model

In statsmodels. ARIMA is denoted as (p, d, q), where:
- **p** = number of lags or autoregressive terms. It refers to the number of past observations that directly infleunce the current value.   
The  number is identified by **PACF**  

- **d** = number of differentiating. It represents the number of differences need to make the time series stationsry. Usually: 0 ≤ d ≤ 2  
  
- **q** = order of moving average values. It represents the number of lagged forecast errors.  
   

Package that automatically discover the optinmal number of ARIMA's parameters: [auto_arima](https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html)

**Key assumptions**:
1. The time series can be made stationary through differencing
2. Past values and past errors contain information useful for forecasting
3. The relationship between past and present is linear

## Train/Test Split

We split the data into 70% training and 30% testing sets.

In [None]:
# split the dataset in training and testing


## Differencing

If the series is not stationary, we apply **differencing** to remove trends.

**Important**: We only difference the training data to avoid data leakage. The ARIMA model will handle differencing internally via the `d` parameter.

In [None]:
# First difference


# Check stationarity after differencing - use function "check stationarity"


## Identify ARIMA parameters p and q  
- p = number of lags 
- d = number of differencing  
- q = order of moving average

We use ACF and PACF plots on the **differenced** data to identify the `p` and `q` parameters:

| Plot | Identifies | How to read |
|------|------------|-------------|
| **ACF** | **q** (MA order) | Count significant lags before cutoff |
| **PACF** | **p** (AR order) | Count significant lags before cutoff |



**How to read the plots**:
- Count significant lags (bars exceeding the blue shaded confidence interval)
- The lag where the plot cuts off suggests the order

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf


# q - ACF
plt.figure(figsize=(12,6))
plt.subplot(1, 2, 1)


#p - PACF
plt.subplot(1,2,2)



## Fit ARIMA model and check summary of the model

Based on the ACF/PACF analysis, we select the parameters (p, d, q) and fit the model.

**Important**: Pass the **original** (non-differenced) training data to the ARIMA model. The model handles differencing internally via the `d` parameter.

**model results**  
In statsmodels. the ARIMA model is a special case of the more general 
**SARIMAX** (Seasonal ARIMA with eXogenous variable)

The SARIMAX model has four models under its hat: 
ARIMA
SARIMA
ARIMAX
SARIMAX

In [None]:
# fit ARIMA model


# Display summary of the model


## Forcasting  and model evaluation

We evaluate the model using common metrics:

| Metric | Description | How to interpret |
|--------|-------------|------------------|
| **MAE** | Mean Absolute Error - average absolute difference | Lower = better. Easy to understand as "average error" |
| **RMSE** | Root Mean Squared Error - penalizes larger errors | Lower = better. If RMSE >> MAE, large outlier errors exist |
| **MAPE** | Mean Absolute Percentage Error - percentage error | <10% = excellent,  10-20% = good,  20-50% = acceptable, >50% = poor |


In [None]:
# Forecasting
# We forecast the same length of test dataset (66)


In [None]:
# Function to plot results and evaluate model performance
def model_assessment(train, test, predictions, chart_title):
  """Plot train/test/prediction series and print common error metrics.

  Args:
      train (pd.Series): In-sample target used to fit the model.
      test (pd.Series): Holdout target used to evaluate the forecast.
      predictions (pd.Series or np.ndarray): Forecast aligned to `test` index.
      chart_title (str): Short label for the chart (e.g., model name).

  Notes
  -----
  * MAE = mean(|y - ŷ|)
  * RMSE = sqrt(mean((y - ŷ)^2))
  * MAPE = mean(|(y - ŷ)/y|). Use with care when y can be 0.
  """
  # Set the size of the plot to 10 inches by 4 inches
  plt.figure(figsize = (10,4))
  # Plot the train, test, and forecast data
  plt.plot(train, label = 'Train')
  plt.plot(test, label = 'Test')
  plt.plot(predictions, label = "Forecast")
  # add title and legend to the plot
  plt.title(f"Train, Test and Predictions with {chart_title}")
  plt.legend()
  plt.show()

  # Calculating the MAE, RMSE, and MAPE
  mae = mean_absolute_error(test, predictions)
  rmse = root_mean_squared_error(test, predictions)
  mape = mean_absolute_percentage_error(test, predictions)

  # Print the calculated error metrics
  print(f"The MAE is {mae:.2f}")
  print(f"The RMSE is {rmse:.2f}")
  print(f"The MAPE is {100 * mape:.2f} %")

In [None]:
# model_assessment

## Summary

### ARIMA Workflow:

1. **Load and visualize** the time series data
2. **Split** into training (70%) and testing (30%) sets
3. **Check stationarity** using ADF test
4. **Difference** the training data if not stationary
5. **Analyze ACF/PACF** on differenced data to find p and q
6. **Fit ARIMA** model on original training data with (p, d, q)
7. **Forecast** and evaluate the model

### Key Points:

- Only difference the **training data** for ACF/PACF analysis
- Pass **original data** to ARIMA model (it handles differencing via `d`)
- Use **PACF** for AR order (p) and **ACF** for MA order (q)
- Lower **AIC/BIC** indicates better model fit

# SARIMA model
SARIMA extends ARIMA model by adding season component