# Assignment 1: Time Series Forecast With Python (Seasonal ARIMA)

**Lecturer**: Vincent Claes<br>
**Authors:** Bryan Honof, Jeffrey Gorissen<br>
**Start Date:** 19/10/2018
    
**Objective:** Visualize and predict the future temperatures via ARIMA

**Description:** In this notebook we train our model

**This notebook is really only used to calculate the best parameters so most of the description is left out.**

In [1]:
import warnings
import itertools

import numpy             as np
import pandas            as pd
import statsmodels.api   as sm
import matplotlib.pyplot as plt

from sklearn.metrics import mean_absolute_error

plt.style.use('fivethirtyeight')

In [2]:
data_csv = pd.read_csv('./data/data.csv')
data = pd.DataFrame()

# Convert the creation_date column to datetime64
data['dateTime'] = pd.to_datetime(data_csv['dateTime'])
# Convert the value column to float
data['temperature'] = pd.to_numeric(data_csv['temperature'])

# Set the dateTime column as index
data = data.set_index(['dateTime'])

# Sort the dataFrame just to be sure...
data = data.sort_index()

data = data.dropna()

# Double check the results
data.info()

df = data

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 352 entries, 2018-11-10 23:00:00 to 2018-11-25 14:00:00
Data columns (total 1 columns):
temperature    352 non-null float64
dtypes: float64(1)
memory usage: 5.5 KB


In [3]:
data.tail(5)

Unnamed: 0_level_0,temperature
dateTime,Unnamed: 1_level_1
2018-11-25 10:00:00,17.99
2018-11-25 11:00:00,17.66
2018-11-25 12:00:00,18.62
2018-11-25 13:00:00,17.34
2018-11-25 14:00:00,15.28


## Search for best parameters


```p``` is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.<br>
```d``` is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.<br>
```q``` is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

We will use a "grid search" to iteratively explore different combinations of parameters. For each combination of parameters, we fit a new seasonal ARIMA model with the ```SARIMAX()``` function from the statsmodels module and assess its overall quality. Once we have explored the entire landscape of parameters, our optimal set of parameters will be the one that yields the best performance for our criteria of interest. Let's begin by generating the various combination of parameters that we wish to assess:

In [4]:
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 24) for x in list(itertools.product(p, d, q))]

print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))

Examples of parameter combinations for Seasonal ARIMA...
SARIMAX: (0, 0, 1) x (0, 0, 1, 24)
SARIMAX: (0, 0, 1) x (0, 1, 0, 24)
SARIMAX: (0, 1, 0) x (0, 1, 1, 24)
SARIMAX: (0, 1, 0) x (1, 0, 0, 24)


Here we took a p, d, q value between 0 and 2. We could increase this number to get even more accurate predictions but for times sake we use 0 and 2. (We ran another test with 0 and 3 as range. The result of that test is what we used to do our prediction in the next notebook)

In [5]:
warnings.filterwarnings("ignore") # specify to ignore warning messages

AIC = []
_param = []
_seasonal_param = []

for param in pdq:
    for param_seasonal in seasonal_pdq:
        try:
            mod = sm.tsa.statespace.SARIMAX(df,
                                            order=param,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)
            
            results = mod.fit()
            pred = results.get_prediction(dynamic=False)
            
            AIC.append(round(results.aic, 2))
            _param.append(param)
            _seasonal_param.append(param_seasonal)

            print('ARIMA{}x{}24 - AIC:{}'.format(param, param_seasonal, round(results.aic, 2)))
        except:
            continue

ARIMA(0, 0, 0)x(0, 0, 0, 24)24 - AIC:3018.5
ARIMA(0, 0, 0)x(0, 0, 1, 24)24 - AIC:2453.34
ARIMA(0, 0, 0)x(0, 1, 0, 24)24 - AIC:1234.82
ARIMA(0, 0, 0)x(0, 1, 1, 24)24 - AIC:1121.17
ARIMA(0, 0, 0)x(1, 0, 0, 24)24 - AIC:1227.03
ARIMA(0, 0, 0)x(1, 0, 1, 24)24 - AIC:1186.71
ARIMA(0, 0, 0)x(1, 1, 0, 24)24 - AIC:1127.37
ARIMA(0, 0, 0)x(1, 1, 1, 24)24 - AIC:1124.18
ARIMA(0, 0, 1)x(0, 0, 0, 24)24 - AIC:2546.75
ARIMA(0, 0, 1)x(0, 0, 1, 24)24 - AIC:2029.01
ARIMA(0, 0, 1)x(0, 1, 0, 24)24 - AIC:1050.79
ARIMA(0, 0, 1)x(0, 1, 1, 24)24 - AIC:929.74
ARIMA(0, 0, 1)x(1, 0, 0, 24)24 - AIC:1047.61
ARIMA(0, 0, 1)x(1, 0, 1, 24)24 - AIC:1050.0
ARIMA(0, 0, 1)x(1, 1, 0, 24)24 - AIC:945.66
ARIMA(0, 0, 1)x(1, 1, 1, 24)24 - AIC:939.42
ARIMA(0, 1, 0)x(0, 0, 0, 24)24 - AIC:879.76
ARIMA(0, 1, 0)x(0, 0, 1, 24)24 - AIC:815.15
ARIMA(0, 1, 0)x(0, 1, 0, 24)24 - AIC:996.46
ARIMA(0, 1, 0)x(0, 1, 1, 24)24 - AIC:793.33
ARIMA(0, 1, 0)x(1, 0, 0, 24)24 - AIC:818.85
ARIMA(0, 1, 0)x(1, 0, 1, 24)24 - AIC:817.15
ARIMA(0, 1, 0)x(1, 1,

In [6]:
min(AIC)
pos = AIC.index(min(AIC))
print(_param[pos], _seasonal_param[pos], min(AIC))

order = _param[pos]
seasonal_order = _seasonal_param[pos]

(1, 1, 1) (0, 1, 1, 24) 775.46


As we can see the grid search has decided that ```(1, 1, 1) (0, 1, 1, 24) 765.03``` are the best parameters for our model. We will take these parameters and go to the next notebook to implement them. (In the other test we got a result of ```(2, 0, 2) (1, 1, 2, 24) 686.05```)

**[⬆ back to top](#table-of-contents)** <br>
[next notebook](./5_fitting_and_predicting.ipynb)