# How to Forecast a Time Series with Python

Wouldn't it be nice to know the future? This is the notebook that relates to the blog post on medium. Please check the blog for visualizations and explanations, this notebook is really just for the code :)


## Processing the Data

Let's explore the Industrial production of electric and gas utilities in the United States, from the years 1985-2018, with our frequency being Monthly production output.

You can access this data here: https://fred.stlouisfed.org/series/IPG2211A2N

This data measures the real output of all relevant establishments located in the United States, regardless of their ownership, but not those located in U.S. territories.

In [136]:
%matplotlib inline
import pandas as pd
import numpy as np
from pandas import datetime

def parser(x):
    d = datetime.strptime(x, '%b %d, %Y')
    return d

# data = pd.read_csv("Electric_Production.csv",index_col=0)
data = pd.read_csv("Nikkei_weekly_10.csv", index_col=0, usecols=["Date", "Price_Nikkei"], parse_dates=[0], date_parser=parser)
data['Price_Nikkei'] = data['Price_Nikkei'].str.replace(',','').astype(float)
data = data.reindex(index=data.index[::-1])

data.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2008-05-11,14219.48
2008-05-18,14012.2
2008-05-25,14338.54
2008-06-01,14489.44
2008-06-08,13973.73


Right now our index is actually just a list of strings that look like a date, we'll want to adjust these to be timestamps, that way our forecasting analysis will be able to interpret these values:

In [137]:
data.index

DatetimeIndex(['2008-05-11', '2008-05-18', '2008-05-25', '2008-06-01',
               '2008-06-08', '2008-06-15', '2008-06-22', '2008-06-29',
               '2008-07-06', '2008-07-13',
               ...
               '2018-03-04', '2018-03-11', '2018-03-18', '2018-03-25',
               '2018-04-01', '2018-04-08', '2018-04-15', '2018-04-22',
               '2018-04-29', '2018-05-06'],
              dtype='datetime64[ns]', name='Date', length=522, freq=None)

In [138]:
data.index = pd.to_datetime(data.index)

In [139]:
data.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2008-05-11,14219.48
2008-05-18,14012.2
2008-05-25,14338.54
2008-06-01,14489.44
2008-06-08,13973.73


In [140]:
data.index

DatetimeIndex(['2008-05-11', '2008-05-18', '2008-05-25', '2008-06-01',
               '2008-06-08', '2008-06-15', '2008-06-22', '2008-06-29',
               '2008-07-06', '2008-07-13',
               ...
               '2018-03-04', '2018-03-11', '2018-03-18', '2018-03-25',
               '2018-04-01', '2018-04-08', '2018-04-15', '2018-04-22',
               '2018-04-29', '2018-05-06'],
              dtype='datetime64[ns]', name='Date', length=522, freq=None)

Let's first make sure that the data doesn't have any missing data points:

In [141]:
data[pd.isnull(data['Price_Nikkei'])]

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1


Let's also rename this column since its hard to remember what "IPG2211A2N" code stands for:

In [142]:
# data.columns = ['Energy Production']

In [143]:
data.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2008-05-11,14219.48
2008-05-18,14012.2
2008-05-25,14338.54
2008-06-01,14489.44
2008-06-08,13973.73


In [144]:
import plotly
plotly.tools.set_credentials_file(username='aishlia', api_key='9m80xSKucsJrkBQMipcb')

In [146]:
from plotly.plotly import plot_mpl
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data, model='multiplicative')
fig = result.plot()
plot_mpl(fig)

'https://plot.ly/~aishlia/24'

In [147]:
import plotly.plotly as ply
import cufflinks as cf
# Check the docs on setting up offline plotting

In [148]:
data.iplot(title="Nikkei Prices from 2008-2018", theme='pearl')

In [149]:
from pyramid.arima import auto_arima

**he AIC measures how well a model fits the data while taking into account the overall complexity of the model. A fd that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses pd.DataFrame(future_forecast,index = test.index,columns=['Prediction']) features to achieve the same goodness-of-fit. Therefore, we are interested in finding the model that yields the lowest AIC value.

In [190]:
stepwise_model = auto_arima(data, start_p=1, start_q=1,
                           max_p=6, max_q=6, m=52,
                           start_P=0, seasonal=True,
                           d=1, D=1, trace=True,
                           error_action='ignore',  
                           suppress_warnings=True, 
                           stepwise=True) 

Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 52); AIC=7094.992, BIC=7116.281, Fit time=13.246 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 52); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 52); AIC=7165.218, BIC=7182.248, Fit time=12.125 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 52); AIC=7093.602, BIC=7110.633, Fit time=15.914 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(1, 1, 1, 52); AIC=7094.496, BIC=7115.784, Fit time=26.029 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 0, 52); AIC=7299.202, BIC=7311.975, Fit time=0.765 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 2, 52); AIC=7094.531, BIC=7115.819, Fit time=89.274 seconds


KeyboardInterrupt: 

In [151]:
stepwise_model.aic()

7648.6072306392944

## Train Test Split

In [152]:
data.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2008-05-11,14219.48
2008-05-18,14012.2
2008-05-25,14338.54
2008-06-01,14489.44
2008-06-08,13973.73


In [153]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 522 entries, 2008-05-11 to 2018-05-06
Data columns (total 1 columns):
Price_Nikkei    522 non-null float64
dtypes: float64(1)
memory usage: 8.2 KB


We'll train on 20 years of data, from the years 1985-2015 and test our forcast on the years after that and compare it to the real data.

In [158]:
train = data.loc['2008-05-11':'2016-12-25']

In [159]:
train.tail()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2016-11-27,18426.08
2016-12-04,18996.37
2016-12-11,19401.15
2016-12-18,19427.67
2016-12-25,19114.37


In [160]:
test = data.loc['2017-01-01':]

In [161]:
test.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2017-01-01,19454.33
2017-01-08,19287.28
2017-01-15,19137.91
2017-01-22,19467.4
2017-01-29,18918.2


In [162]:
test.tail()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2018-04-08,21778.74
2018-04-15,22162.24
2018-04-22,22467.87
2018-04-29,22472.78
2018-05-06,22555.0


In [163]:
test_length = len(test)
test_length

71

In [166]:
stepwise_model.fit(train)

ARIMA(callback=None, disp=0, maxiter=50, method=None, order=(0, 1, 0),
   out_of_sample_size=0, scoring='mse', scoring_args={},
   seasonal_order=(1, 1, 1, 12), solver='lbfgs', start_params=None,

In [167]:
future_forecast = stepwise_model.predict(n_periods=test_length)

In [168]:
future_forecast

array([ 19034.58802479,  18988.17107727,  19011.96374623,  19006.37624557,
        19090.25241147,  19176.80415496,  19277.39703675,  19441.58206787,
        19444.25171348,  19539.35368489,  19606.6129439 ,  19848.55881485,
        19765.39121556,  19694.88312899,  19704.05823412,  19737.49298122,
        19796.52094609,  19849.8670795 ,  19930.62815176,  20104.97721621,
        20070.17274968,  20145.68833961,  20217.66188219,  20500.00193404,
        20418.48865513,  20348.19900718,  20358.24952248,  20396.27948221,
        20455.47338514,  20508.40588875,  20589.68077474,  20766.62379838,
        20731.10969085,  20807.15611441,  20881.34568807,  21168.37609309,
        21088.86664356,  21020.48124978,  21032.48157876,  21072.71931227,
        21133.81382856,  21188.60675379,  21271.80637755,  21450.81839246,
        21417.14417882,  21495.11652045,  21571.3488769 ,  21860.59365348,
        21783.11227102,  21716.74803899,  21730.77268917,  21773.05263284,
        21836.1680584 ,  

In [178]:
future_forecast = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])

In [179]:
future_forecast.head()

Unnamed: 0_level_0,Prediction
Date,Unnamed: 1_level_1
2017-01-01,19034.588025
2017-01-08,18988.171077
2017-01-15,19011.963746
2017-01-22,19006.376246
2017-01-29,19090.252411


In [180]:
test.head()

Unnamed: 0_level_0,Price_Nikkei
Date,Unnamed: 1_level_1
2017-01-01,19454.33
2017-01-08,19287.28
2017-01-15,19137.91
2017-01-22,19467.4
2017-01-29,18918.2


In [181]:
pd.concat([test,future_forecast],axis=1).iplot()

In [183]:
future_forecast2 = pd.DataFrame(future_forecast,index = test.index,columns=['Prediction'])#future_forcast

In [185]:
pd.concat([data,future_forecast2],axis=1).iplot()