## Auto ARIMA method
#####  Pros: 
 1) Saves time
 
 2) Removes Ambiguity
 
 3) Reduces risk of human error
#####  Cons:
 1) Blindly putting our faith in one criterion
 
 2) We can never really see how well the other models perform
 
 3) Topic experties
 
 4) Human error like misspelled parameters or misinterpretation of results

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.graphics.tsaplots as sgt
import statsmodels.tsa.stattools as sts
from statsmodels.tsa.arima_model import ARIMA
from scipy.stats.distributions import chi2
from math import sqrt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
sns.set()

In [2]:
df = pd.read_csv(r"Downloads\Index2018.csv")
df_copy = df.copy()
df_copy.date = pd.to_datetime(df_copy.date, dayfirst = True)
df_copy.set_index('date', inplace = True)
df_copy = df_copy.asfreq('b')
df_copy = df_copy.fillna(method = 'ffill')
df_copy.head()

Unnamed: 0_level_0,spx,dax,ftse,nikkei
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1994-01-07,469.9,2224.95,3445.98,18124.01
1994-01-10,475.27,2225.0,3440.58,18443.44
1994-01-11,474.13,2228.1,3413.77,18485.25
1994-01-12,474.17,2182.06,3372.02,18793.88
1994-01-13,472.47,2142.37,3360.01,18577.26


## Creating Returns

In [3]:
df_copy['ret_spx'] = df_copy.spx.pct_change(1).mul(100)
df_copy['ret_dax'] = df_copy.dax.pct_change(1).mul(100)
df_copy['ret_ftse'] = df_copy.ftse.pct_change(1).mul(100)
df_copy['ret_nikkei'] = df_copy.nikkei.pct_change(1).mul(100)

## Splitting the data

In [4]:
size = int(len(df_copy)*0.8)
df, df_test = df_copy.iloc[:size], df_copy.iloc[size:]

## Fitting a Model

In [5]:
from pmdarima.arima import auto_arima

In [6]:
model_auto = auto_arima(df.ret_ftse[1:])

In [7]:
model_auto

ARIMA(maxiter=50, method='lbfgs', order=(4, 0, 5), out_of_sample_size=0,
      scoring='mse', scoring_args={}, seasonal_order=(0, 0, 0, 0),
      with_intercept=False)

In [8]:
model_auto.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,5020.0
Model:,"SARIMAX(4, 0, 5)",Log Likelihood,-7883.705
Date:,"Mon, 26 Oct 2020",AIC,15787.41
Time:,09:44:23,BIC,15852.622
Sample:,0,HQIC,15810.261
,- 5020,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,-0.0013,0.082,-0.016,0.987,-0.162,0.159
ar.L2,-0.6526,0.078,-8.413,0.000,-0.805,-0.501
ar.L3,-0.1768,0.071,-2.486,0.013,-0.316,-0.037
ar.L4,0.1979,0.075,2.653,0.008,0.052,0.344
ma.L1,-0.0232,0.081,-0.284,0.776,-0.183,0.136
ma.L2,0.6052,0.078,7.743,0.000,0.452,0.758
ma.L3,0.0761,0.068,1.111,0.266,-0.058,0.210
ma.L4,-0.1901,0.073,-2.597,0.009,-0.334,-0.047
ma.L5,-0.1049,0.010,-11.009,0.000,-0.124,-0.086

0,1,2,3
Ljung-Box (Q):,67.07,Jarque-Bera (JB):,6361.38
Prob(Q):,0.0,Prob(JB):,0.0
Heteroskedasticity (H):,1.99,Skew:,-0.19
Prob(H) (two-sided):,0.0,Kurtosis:,8.5


Here we can see order of seasonal term is zero which indicates there's no seasonal component in our data. Also the order of integration term i.e d is zero thus given model does not need any integrated term. And as we haven't provided any exogeneous varaible we do not need SARIMAX model for our dataset. Instead an ARMA model of order(4,5) can be useful. 

- Comment on the summary table:

1) the rules of model selection are thumb rules rather than fixed.

2) Auto ARIMA also considers a single feature - the AIC.

3) We could have easily overfitted while going through models in our previous cases.

4) The default arguments of the method restricts the number of AR and MA terms.

In [12]:
model_auto_1 = auto_arima(df_copy.ret_ftse[1:], exogenous = df_copy[["ret_spx", "ret_dax", "ret_nikkei"]][1:], m = 5, 
                          max_order = None, max_p = 7, max_q = 7, max_d = 2, max_P = 4, max_Q = 4, max_D = 2,
                          maxiters = 50, alpha = 0.05, n_jobs = -1, trend = "ct", information_criterion = "oob", 
                          out_of_sample = int(len(df_copy)*0.2))
## Here we can also use ctt for quadratic trend and (1,0,0,1) for a constant term and trend of 3rd degree
## oob = out of bag

In [13]:
model_auto_1.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,6276.0
Model:,"SARIMAX(0, 0, 1)x(0, 0, [1, 2, 3, 4], 5)",Log Likelihood,-6366.707
Date:,"Mon, 26 Oct 2020",AIC,12755.414
Time:,10:35:59,BIC,12829.603
Sample:,01-10-1994,HQIC,12781.119
,- 01-29-2018,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-0.0021,0.013,-0.159,0.874,-0.028,0.024
drift,-1.749e-06,4.14e-06,-0.423,0.673,-9.86e-06,6.36e-06
ret_spx,0.0989,0.006,17.680,0.000,0.088,0.110
ret_dax,0.5525,0.005,115.093,0.000,0.543,0.562
ret_nikkei,0.0730,0.004,17.231,0.000,0.065,0.081
ma.L1,-0.1098,0.007,-15.017,0.000,-0.124,-0.095
ma.S.L5,-0.0317,0.009,-3.715,0.000,-0.048,-0.015
ma.S.L10,-0.0527,0.009,-5.765,0.000,-0.071,-0.035
ma.S.L15,-0.0236,0.009,-2.684,0.007,-0.041,-0.006

0,1,2,3
Ljung-Box (Q):,72.43,Jarque-Bera (JB):,14296.82
Prob(Q):,0.0,Prob(JB):,0.0
Heteroskedasticity (H):,0.54,Skew:,0.23
Prob(H) (two-sided):,0.0,Kurtosis:,10.38


#### Here we can see a significant drop in the value of AIC and hence we have obatined a better model by lossing our criterion.