<a href="https://colab.research.google.com/github/Mercymerine/Machine_learning2/blob/main/Choosing_Arima_orders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pmdarima

Collecting pmdarima
  Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Downloading pmdarima-2.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pmdarima
Successfully installed pmdarima-2.0.4


In [2]:
from pmdarima.arima import auto_arima

In [3]:
help(auto_arima)

Help on function auto_arima in module pmdarima.arima.auto:

    Automatically discover the optimal order for an ARIMA model.
    
    The auto-ARIMA process seeks to identify the most optimal
    parameters for an ``ARIMA`` model, settling on a single fitted ARIMA model.
    This process is based on the commonly-used R function,
    ``forecast::auto.arima`` [3].
    
    Auto-ARIMA works by conducting differencing tests (i.e.,
    Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or
    Phillips–Perron) to determine the order of differencing, ``d``, and then
    fitting models within ranges of defined ``start_p``, ``max_p``,
    ``start_q``, ``max_q`` ranges. If the ``seasonal`` optional is enabled,
    auto-ARIMA also seeks to identify the optimal ``P`` and ``Q`` hyper-
    parameters after conducting the Canova-Hansen to determine the optimal
    order of seasonal differencing, ``D``.
    
    In order to find the best model, auto-ARIMA optimizes for a given
    ``informatio

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's look first at the stationary, non-seasonal <strong>Daily Female Births</strong> dataset:

In [6]:
#Load stationary dataset
df2 = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv')
df2.index.freq = 'D'

In [12]:
auto_arima(df2['Births'])

In [8]:
auto_arima(df2['Births'], error_action='ignore').summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,365.0
Model:,"SARIMAX(1, 1, 1)",Log Likelihood,-1226.537
Date:,"Fri, 23 Aug 2024",AIC,2459.074
Time:,05:28:18,BIC,2470.766
Sample:,0,HQIC,2463.721
,- 365,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.1252,0.060,2.097,0.036,0.008,0.242
ma.L1,-0.9624,0.017,-56.429,0.000,-0.996,-0.929
sigma2,49.1512,3.250,15.122,0.000,42.781,55.522

0,1,2,3
Ljung-Box (L1) (Q):,0.04,Jarque-Bera (JB):,25.33
Prob(Q):,0.84,Prob(JB):,0.0
Heteroskedasticity (H):,0.96,Skew:,0.57
Prob(H) (two-sided):,0.81,Kurtosis:,3.6


This shows a recommended (p,d,q) ARIMA Order of (1,1,1), with no seasonal_order component.

We can see how this was determined by looking at the stepwise results. The recommended order is the one with the lowest <a href='https://en.wikipedia.org/wiki/Akaike_information_criterion'>Akaike information criterion</a> or AIC score. Note that the recommended model may <em>not</em> be the one with the closest fit. The AIC score takes complexity into account, and tries to identify the best <em>forecasting</em> model.

In [16]:
stepwise_fit = auto_arima(df2['Births'], start_p=0, start_q=0, max_p=6, max_q=3, m=12, seasonal=False, d=None, trace=True, error_action='ignore', suppress_warnings= True, stepwise=True)




Performing stepwise search to minimize aic
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=2650.760, Time=0.02 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=2565.234, Time=0.05 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=2463.584, Time=0.14 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=2648.768, Time=0.02 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=2460.154, Time=0.27 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=2461.271, Time=0.37 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=inf, Time=0.66 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=2460.722, Time=0.28 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=2536.154, Time=0.17 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=2463.056, Time=1.87 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=2459.074, Time=0.23 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=2462.221, Time=0.30 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=2563.261, Time=0.10 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=2460.367, Time=0.43 sec
 ARIMA(1,1,2)(0,0,0)[0]             : 

___
Now let's look at the non-stationary, seasonal <strong>Airline Passengers</strong> dataset:

In [17]:
#Loading a non-stationary dataset
df1 = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv', header=0, index_col=0)
df1.index.freq = 'MS'

In [18]:
stepwise_fit = auto_arima(df1['Passengers'], start_p=1, start_q=1, max_p=3, max_q=3, m=12, start_P=0, seasonal=True, d=None, D=1, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True)
stepwise_fit.summary()

Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=1022.896, Time=0.33 sec
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=1031.508, Time=0.05 sec
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=1020.393, Time=0.15 sec
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=1021.003, Time=0.22 sec
 ARIMA(1,1,0)(0,1,0)[12]             : AIC=1020.393, Time=0.08 sec
 ARIMA(1,1,0)(2,1,0)[12]             : AIC=1019.239, Time=0.40 sec
 ARIMA(1,1,0)(2,1,1)[12]             : AIC=inf, Time=2.92 sec
 ARIMA(1,1,0)(1,1,1)[12]             : AIC=1020.493, Time=0.45 sec
 ARIMA(0,1,0)(2,1,0)[12]             : AIC=1032.120, Time=0.26 sec
 ARIMA(2,1,0)(2,1,0)[12]             : AIC=1021.120, Time=0.49 sec
 ARIMA(1,1,1)(2,1,0)[12]             : AIC=1021.032, Time=2.18 sec
 ARIMA(0,1,1)(2,1,0)[12]             : AIC=1019.178, Time=0.98 sec
 ARIMA(0,1,1)(1,1,0)[12]             : AIC=1020.425, Time=0.37 sec
 ARIMA(0,1,1)(2,1,1)[12]             : AIC=inf, Time=2.39 sec
 ARIMA(0,1,1)(1,1,1)[12]     

0,1,2,3
Dep. Variable:,y,No. Observations:,144.0
Model:,"SARIMAX(0, 1, 1)x(2, 1, [], 12)",Log Likelihood,-505.589
Date:,"Fri, 23 Aug 2024",AIC,1019.178
Time:,05:46:00,BIC,1030.679
Sample:,01-01-1949,HQIC,1023.851
,- 12-01-1960,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ma.L1,-0.3634,0.074,-4.945,0.000,-0.508,-0.219
ar.S.L12,-0.1239,0.090,-1.372,0.170,-0.301,0.053
ar.S.L24,0.1911,0.107,1.783,0.075,-0.019,0.401
sigma2,130.4480,15.527,8.402,0.000,100.016,160.880

0,1,2,3
Ljung-Box (L1) (Q):,0.01,Jarque-Bera (JB):,4.59
Prob(Q):,0.92,Prob(JB):,0.1
Heteroskedasticity (H):,2.7,Skew:,0.15
Prob(H) (two-sided):,0.0,Kurtosis:,3.87


## OPTIONAL: statsmodels ARMA_Order_Select_IC
Statsmodels has a selection tool to find orders for ARMA models on stationary data.

In [19]:
from statsmodels.tsa.stattools import arma_order_select_ic

In [20]:
help(arma_order_select_ic)

Help on function arma_order_select_ic in module statsmodels.tsa.stattools:

arma_order_select_ic(y, max_ar=4, max_ma=2, ic='bic', trend='c', model_kw=None, fit_kw=None)
    Compute information criteria for many ARMA models.
    
    Parameters
    ----------
    y : array_like
        Array of time-series data.
    max_ar : int
        Maximum number of AR lags to use. Default 4.
    max_ma : int
        Maximum number of MA lags to use. Default 2.
    ic : str, list
        Information criteria to report. Either a single string or a list
        of different criteria is possible.
    trend : str
        The trend to use when fitting the ARMA models.
    model_kw : dict
        Keyword arguments to be passed to the ``ARMA`` model.
    fit_kw : dict
        Keyword arguments to be passed to ``ARMA.fit``.
    
    Returns
    -------
    Bunch
        Dict-like object with attribute access. Each ic is an attribute with a
        DataFrame for the results. The AR order used is the row ind

In [21]:
arma_order_select_ic(df2['Births'])

  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


{'bic':              0            1            2
 0  2502.581666  2494.238838  2494.731532
 1  2490.780320  2484.505387  2486.223525
 2  2491.963247  2485.782753  2491.097242
 3  2496.498625  2491.061564  2493.581550
 4  2501.491895  2496.961067  2498.337798,
 'bic_min_order': (1, 1)}