# Case Study 9 - Autoregression
## ECE204 Data Science & Engineering

In [None]:
import pandas as pd
import numpy as np

## Reading the Data

Let's look at a dataset that describes the minimum daily temperatures over 10 years (1981-1990) in the city of Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations. The source of the data is the Australian Bureau of Meteorology.

Note: This dataset ignores a day in leap years for consistent number of observations (365) in each year.

In [None]:
df  = pd.read_csv("daily-min-temp-melb.csv", index_col=0, parse_dates=True) 
df.head()

In [None]:
# fill in missing values
idx = pd.date_range('1981-01-01', '1990-12-31')
df = df.reindex(idx)
df.fillna(method='backfill',inplace=True)

## Visualizing Data

Let's visualize what the minimum daily temperature data looks like. Alongside with it, we also plot the rolling mean to smooth out variations in each 365-day window. <br>

**NOTE:** The rolling mean appears to start high and then stabilize after about 365 days because in the first year, it just averages the observations it does have in that window ($\leq$365). This makes the rolling mean higher for that period since the temperates are higher early on.

In [None]:
# Plotting the original data and the 365 day rolling mean
ax = df.plot(figsize=(12,5));

df.min_temp.rolling("365d").mean().plot(ax=ax, label='rolling mean 365 days')
#df.min_temp.rolling("7d").mean().plot(ax=ax, label='rolling mean 7 days')
ax.set_ylabel("Minimum Temperature")
ax.set_xlabel("Date")
ax.legend();

Autoregression relies on the relationship between a value at a particular time step (say, t), and the values at earlier time steps (t-1, t-2, and so on) or **lags**. Using a lag plot, we can visualize how the previous time step relates with the current one.

In [None]:
from pandas.plotting import lag_plot

# Visualizing the lag plot.
# By default lag=1
lag_plot(df.min_temp, alpha=0.2);

We can visualize lag plot at higher lags as well!

In [None]:
lag_plot(df.min_temp, lag=2, alpha=0.2);

In [None]:
lag_plot(df.min_temp, lag=365, alpha=0.2);

We see that while the lag plot at lag=1 does indicate a somewhat linear relationship between minimum temperature values at time steps t, and t-1 (or alternatively at t+1 and t, whichever way you want to look at it!). However, this relation becomes weaker as lags increase.

## Autoregression Model

We will now use a simple Autoregression model for the task of forecasting ahead.
We have minimum-temperature data for 10 years, so let's say we want to predict the last 7 days of this 10-year period using the other observations before it (3650-7 = 3643)

In [None]:
# Defining the train set, all days but the last 7: 1981-01-01 to 1990-12-24
dftrain = df.loc[:'1990-12-24']
dftrain.tail()

In [None]:
# The test set: The last 7 days: 1990-12-25 to 1990-12-31
dftest = df.loc['1990-12-25':].copy()
dftest

**Note:** The `statsmodels` library has autoregression models, but the syntax is slightly different from that of `sklearn` . In particular, while defining the model object, we pass in the time-series we want to forecast in addition to the`lags` argument.

In [None]:
from statsmodels.tsa.ar_model import AutoReg
#from datetime import datetime

model = AutoReg(dftrain.min_temp, lags = 180)
model_fit = model.fit()

print('Lag: %s' % model_fit.ar_lags)
print('Coefficients: %s' % model_fit.params)

yhat = model_fit.predict(start='1990-12-25', end='1990-12-31')

### Visualizing our predictions

In [None]:
# add predictions to the test dataframe
dftest["predictions"] = yhat

ax = dftest.min_temp.plot(marker='o', label="True minimum Temperature")
dftest.predictions.plot(ax=ax, marker='o', label='AR Predictions');
ax.legend();

We can now check if our predictions improve if we consider more time lags of observations for Autoregression. (Check by setting maxlag=2, 3, 365 etc.)