# `auto_arima`

Pyramid bring R's [`auto.arima`](https://www.rdocumentation.org/packages/forecast/versions/7.3/topics/auto.arima) functionality to Python by wrapping statsmodel [`ARIMA`](https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tsa/arima_model.py) and [`SARIMAX`](https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tsa/statespace/sarimax.py) models into a singular scikit-learn-esque estimator ([`pyramid.arima.ARIMA`](https://github.com/tgsmith61591/pyramid/blob/master/pyramid/arima/arima.py)) and adding several layers of degree and seasonal differencing tests to identify the optimal model parameters.

__Pyramid ARIMA models:__

  - Are fully picklable for easy persistence and model deployment
  - Can handle seasonal terms (unlike statsmodels ARIMAs)
  - Follow sklearn model fit/predict conventions

In [None]:
import numpy as np
import pyramid

print('numpy version: %r' % np.__version__)
print('pyramid version: %r' % pyramid.__version__)

We'll start by defining an array of data from an R time-series, `wineind`:

```r
> forecast::wineind
       Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
1980 15136 16733 20016 17708 18019 19227 22893 23739 21133 22591 26786 29740
1981 15028 17977 20008 21354 19498 22125 25817 28779 20960 22254 27392 29945
1982 16933 17892 20533 23569 22417 22084 26580 27454 24081 23451 28991 31386
1983 16896 20045 23471 21747 25621 23859 25500 30998 24475 23145 29701 34365
1984 17556 22077 25702 22214 26886 23191 27831 35406 23195 25110 30009 36242
1985 18450 21845 26488 22394 28057 25451 24872 33424 24052 28449 33533 37351
1986 19969 21701 26249 24493 24603 26485 30723 34569 26689 26157 32064 38870
1987 21337 19419 23166 28286 24570 24001 33151 24878 26804 28967 33311 40226
1988 20504 23060 23562 27562 23940 24584 34303 25517 23494 29095 32903 34379
1989 16991 21109 23740 25552 21752 20294 29009 25500 24166 26960 31222 38641
1990 14672 17543 25453 32683 22449 22316 27595 25451 25421 25288 32568 35110
1991 16052 22146 21198 19543 22084 23816 29961 26773 26635 26972 30207 38687
1992 16974 21697 24179 23757 25013 24019 30345 24488 25156 25650 30923 37240
1993 17466 19463 24352 26805 25236 24735 29356 31234 22724 28496 32857 37198
1994 13652 22784 23565 26323 23779 27549 29660 23356
```

Note that the frequency of the data is 12:

```r
> frequency(forecast::wineind)
[1] 12
```

In [3]:
# this is a dataset from R
wineind = np.array([
    # Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
    15136, 16733, 20016, 17708, 18019, 19227, 22893, 23739, 21133, 22591, 26786, 29740, 
    15028, 17977, 20008, 21354, 19498, 22125, 25817, 28779, 20960, 22254, 27392, 29945, 
    16933, 17892, 20533, 23569, 22417, 22084, 26580, 27454, 24081, 23451, 28991, 31386, 
    16896, 20045, 23471, 21747, 25621, 23859, 25500, 30998, 24475, 23145, 29701, 34365, 
    17556, 22077, 25702, 22214, 26886, 23191, 27831, 35406, 23195, 25110, 30009, 36242, 
    18450, 21845, 26488, 22394, 28057, 25451, 24872, 33424, 24052, 28449, 33533, 37351, 
    19969, 21701, 26249, 24493, 24603, 26485, 30723, 34569, 26689, 26157, 32064, 38870, 
    21337, 19419, 23166, 28286, 24570, 24001, 33151, 24878, 26804, 28967, 33311, 40226, 
    20504, 23060, 23562, 27562, 23940, 24584, 34303, 25517, 23494, 29095, 32903, 34379, 
    16991, 21109, 23740, 25552, 21752, 20294, 29009, 25500, 24166, 26960, 31222, 38641, 
    14672, 17543, 25453, 32683, 22449, 22316, 27595, 25451, 25421, 25288, 32568, 35110, 
    16052, 22146, 21198, 19543, 22084, 23816, 29961, 26773, 26635, 26972, 30207, 38687, 
    16974, 21697, 24179, 23757, 25013, 24019, 30345, 24488, 25156, 25650, 30923, 37240, 
    17466, 19463, 24352, 26805, 25236, 24735, 29356, 31234, 22724, 28496, 32857, 37198, 
    13652, 22784, 23565, 26323, 23779, 27549, 29660, 23356]
).astype(np.float64)

## Fitting an ARIMA

We will first fit a seasonal ARIMA. Note that you do not need to call `auto_arima` in order to fit a model&mdash;if you know the order and seasonality of your data, you can simply fit an ARIMA with the defined hyper-parameters:

In [5]:
from pyramid.arima import ARIMA

fit = ARIMA(order=(1, 1, 1), seasonal_order=(0, 1, 1, 12)).fit(y=wineind)

AttributeError: module 'pyramid' has no attribute '__version__'

Note that your data does not have to exhibit seasonality to work with an ARIMA. We could fit an ARIMA against the same data with no seasonal terms whatsoever (but it is unlikely that it will perform better; quite the opposite, likely).

In [4]:
fit = ARIMA(order=(1, 1, 1), seasonal_order=None).fit(y=wineind)

## Finding the optimal model hyper-parameters using `auto_arima`:

If you are unsure (as is common) of the best parameters for your model, let `auto_arima` figure it out for you. `auto_arima` is similar to an ARIMA-specific grid search, but (by default) uses a more intelligent `stepwise` algorithm laid out in a paper by Hyndman and Khandakar (2008). If `stepwise` is False, the models will be fit similar to a gridsearch. Note that it is possible for `auto_arima` not to find a model that will converge; if this is the case, it will raise a `ValueError`.

`auto_arima` can fit a random search that is much faster than the exhaustive one by enabling `random=True`. If your random search returns too many invalid (nan) models, you might try increasing `n_fits` or making it an exhaustive search.

In [5]:
# fitting a stepwise model:
from pyramid.arima import auto_arima

stepwise_fit = auto_arima(wineind, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                          start_P=0, seasonal=True, d=1, D=1, trace=True,
                          error_action='ignore',  # don't want to know if an order does not work
                          suppress_warnings=True,  # don't want convergence warnings
                          stepwise=True)  # set to stepwise

stepwise_fit.summary()

Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.811, BIC=3082.663, Fit time=0.481 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=3099.735, BIC=3112.417, Fit time=0.154 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.983, BIC=3079.665, Fit time=0.164 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=3067.666, BIC=3086.688, Fit time=0.626 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=3088.109, BIC=3100.791, Fit time=0.121 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=3067.669, BIC=3086.692, Fit time=1.284 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=3068.757, BIC=3090.951, Fit time=1.327 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3067.485, BIC=3086.508, Fit time=0.271 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_orde

0,1,2,3
Dep. Variable:,y,No. Observations:,176.0
Model:,"SARIMAX(1, 1, 2)x(0, 1, 1, 12)",Log Likelihood,-1527.386
Date:,"Tue, 25 Jul 2017",AIC,3066.771
Time:,12:46:15,BIC,3085.794
Sample:,0,HQIC,3074.487
,- 176,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-100.7446,72.306,-1.393,0.164,-242.462,40.973
ar.L1,-0.5139,0.390,-1.319,0.187,-1.278,0.250
ma.L1,-0.0791,0.403,-0.196,0.844,-0.869,0.710
ma.L2,-0.4438,0.223,-1.988,0.047,-0.881,-0.006
ma.S.L12,-0.4021,0.054,-7.448,0.000,-0.508,-0.296
sigma2,7.663e+06,7.3e+05,10.500,0.000,6.23e+06,9.09e+06

0,1,2,3
Ljung-Box (Q):,48.66,Jarque-Bera (JB):,21.62
Prob(Q):,0.16,Prob(JB):,0.0
Heteroskedasticity (H):,1.18,Skew:,-0.61
Prob(H) (two-sided):,0.54,Kurtosis:,4.31


In [6]:
rs_fit = auto_arima(wineind, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                    start_P=0, seasonal=True, n_jobs=-1, d=1, D=1, trace=True,
                    error_action='ignore',  # don't want to know if an order does not work
                    suppress_warnings=True,  # don't want convergence warnings
                    stepwise=False, random=True, random_state=42,  # we can fit a random search (not exhaustive)
                    n_fits=25)

rs_fit.summary()

Fit ARIMA: order=(3, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3068.842, BIC=3091.036, Fit time=1.311 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3072.626, BIC=3101.160, Fit time=2.147 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(1, 1, 1, 12); AIC=3071.523, BIC=3100.057, Fit time=3.500 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3068.086, BIC=3090.280, Fit time=1.071 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=3068.503, BIC=3090.696, Fit time=7.574 seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3070.025, BIC=3095.389, Fit time=1.870 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.771, BIC=3085.794, Fit time=1.144 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=3068.757, BIC=3090.951, Fit time=15.780 seconds
Fit ARIMA: order=(3, 1, 3) seasonal_ord

0,1,2,3
Dep. Variable:,y,No. Observations:,176.0
Model:,"SARIMAX(1, 1, 2)x(0, 1, 1, 12)",Log Likelihood,-1527.386
Date:,"Tue, 25 Jul 2017",AIC,3066.771
Time:,12:47:01,BIC,3085.794
Sample:,0,HQIC,3074.487
,- 176,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-100.7446,72.306,-1.393,0.164,-242.462,40.973
ar.L1,-0.5139,0.390,-1.319,0.187,-1.278,0.250
ma.L1,-0.0791,0.403,-0.196,0.844,-0.869,0.710
ma.L2,-0.4438,0.223,-1.988,0.047,-0.881,-0.006
ma.S.L12,-0.4021,0.054,-7.448,0.000,-0.508,-0.296
sigma2,7.663e+06,7.3e+05,10.500,0.000,6.23e+06,9.09e+06

0,1,2,3
Ljung-Box (Q):,48.66,Jarque-Bera (JB):,21.62
Prob(Q):,0.16,Prob(JB):,0.0
Heteroskedasticity (H):,1.18,Skew:,-0.61
Prob(H) (two-sided):,0.54,Kurtosis:,4.31


## Inspecting goodness of fit

We can look at how well the model fits in-sample data:

In [7]:
from bokeh.plotting import figure, show, output_notebook
import pandas as pd

# init bokeh
output_notebook()

def plot_arima(truth, forecasts, title="ARIMA", xaxis_label='Time',
               yaxis_label='Value', c1='#A6CEE3', c2='#B2DF8A', 
               forecast_start=None, **kwargs):
    
    # make truth and forecasts into pandas series
    n_truth = truth.shape[0]
    n_forecasts = forecasts.shape[0]
    
    # always plot truth the same
    truth = pd.Series(truth, index=np.arange(truth.shape[0]))
    
    # if no defined forecast start, start at the end
    if forecast_start is None:
        idx = np.arange(n_truth, n_truth + n_forecasts)
    else:
        idx = np.arange(forecast_start, n_forecasts)
    forecasts = pd.Series(forecasts, index=idx)
    
    # set up the plot
    p = figure(title=title, plot_height=400, **kwargs)
    p.grid.grid_line_alpha=0.3
    p.xaxis.axis_label = xaxis_label
    p.yaxis.axis_label = yaxis_label
    
    # add the lines
    p.line(truth.index, truth.values, color=c1, legend='Observed')
    p.line(forecasts.index, forecasts.values, color=c2, legend='Forecasted')
    
    return p

In [8]:
in_sample_preds = stepwise_fit.predict_in_sample()
in_sample_preds[:10]

array([     0.        ,  10084.62490951,  12109.106077  ,  16287.36977899,
        15950.67373316,  17257.56915093,  17800.08168635,  20200.10594032,
        21476.02509694,  20789.74940127])

In [9]:
show(plot_arima(wineind, in_sample_preds, 
                title="Original Series & In-sample Predictions", 
                c2='#FF0000', forecast_start=0))

## Predicting future values

After your model is fit, you can forecast future values using the `predict` function, just like in sci-kit learn:

In [10]:
next_25 = stepwise_fit.predict(n_periods=25)
next_25

array([ 21967.58977115,  25983.67393668,  30225.87927633,  35417.43402408,
        13010.67730929,  19640.68774496,  21507.21794346,  23675.72670984,
        21686.79194458,  23672.21601578,  26956.70419416,  22755.79165077,
        19809.49757076,  23580.42643003,  27847.87411451,  32925.71207085,
        10476.65058125,  17027.65713579,  18834.04336593,  20932.71593039,
        18878.92586596,  20796.93491922,  24015.32353603,  19747.63540936,
        16734.91315808])

In [11]:
# call the plotting func
show(plot_arima(wineind, next_25))

## Updating your model

ARIMAs create forecasts by using the latest observations. Over time, your forecasts will drift, and you'll need to update the model with the observed values. The _current_ solution is to re-fit the ARIMA obtained from `auto_arima` with the new data. This way, the order (e.g., p, d, q) and other args stay the same. For this example, let us add in the forecasted values, as if they were actual observed values.

In [27]:
updated_data = np.concatenate([wineind, next_25])
updated_model = stepwise_fit.fit(updated_data)
updated_model.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,201.0
Model:,"SARIMAX(1, 1, 2)x(0, 1, 1, 12)",Log Likelihood,-1747.575
Date:,"Tue, 25 Jul 2017",AIC,3507.149
Time:,13:13:44,BIC,3526.969
Sample:,0,HQIC,3515.169
,- 201,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
intercept,-121.4302,65.168,-1.863,0.062,-249.157,6.296
ar.L1,-0.5663,0.318,-1.783,0.075,-1.189,0.056
ma.L1,-0.0392,0.328,-0.120,0.905,-0.682,0.604
ma.L2,-0.4769,0.184,-2.591,0.010,-0.838,-0.116
ma.S.L12,-0.4021,0.048,-8.294,0.000,-0.497,-0.307
sigma2,6.688e+06,5.5e+05,12.155,0.000,5.61e+06,7.77e+06

0,1,2,3
Ljung-Box (Q):,53.19,Jarque-Bera (JB):,45.34
Prob(Q):,0.08,Prob(JB):,0.0
Heteroskedasticity (H):,0.55,Skew:,-0.66
Prob(H) (two-sided):,0.02,Kurtosis:,5.01


In [29]:
# visualize new forecasts
show(plot_arima(updated_data, updated_model.predict(n_periods=10)))