# Autocorrelation and time series modeling
## Time series data

Repeated samples taken at regular time intervals

Allows us to examine trends over time, as well as noise in data

When dealing with time series information, a big focus is often to forecast future values. This can be quite difficult, as uncertainty is larger the further out into the future we go. 

Time series analysis is a very complicated field, and we are just going to scratch the surface of it. You could take a whole class just on time series analysis. 

In [112]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We are going to use data from World Bank that has mean annual air temperatures for the last 100+ years.

We're going to subset to data starting at 1970.

In [1]:
df = pd.read_csv('../data/usa-annual-temp.csv')
df = df.loc[df['year']>=1970,:]
df.head()

Before we get started, we're going to convert year values to the data type datetime.

We will also set our index (row names) to be the year values. 

In [2]:
df['year'] = 
# makes sure years are in order
df.head()

With this done, we can visualize the time series as a line plot. 

In [3]:
fig, ax = plt.subplots()
ax.plot(df['temp']);
ax.set_xlabel('Year')
ax.set_ylabel('Temp (C)')

The temperature appears to be increasing over time, but it is not constantly increasing. 

## Autocorrelation

Time series data has an important feature: it is sequential, meaning that we know the values before and after a given data point. 

Data points that are close together in time are likely to be correlated. This is a good starting point for our analysis. 

What we can do is calculate the correlation between our data points and the values directly before them. This is called **autocorrelation**: correlating our data with itself.

Specifically, we are going to see how much data from a given year is correlated with the data in the following year. 

In [4]:
from scipy.stats import pearsonr

 # how many years removed?

# current year

# one year prior

# calcualte correlation
r, p = 
print('correlation:', r)
print('p-value:', p)

We can plot an **autocorrelation function** (ACF) that shows the autocorrelation for various lags. 

Significant autocorrelation is the points above or below the highlighted blue zone. 

In [6]:
from statsmodels.graphics.tsaplots import plot_acf


## Stationarity: Augmented Dickey-Fuller Test (ADF Test)

Before we can model our time series, we need our data to be **stationary**: the mean and variance of the data have to be constant across the time series.

Looking at our data, it is pretty safe to conclude it is not stationary -> the mean increases as time passes.

However, to concretely test this, we can use the ADF test. 

In [7]:
from statsmodels.tsa.stattools import adfuller


Unfortunately, the output from this function is pretty overwhelming. Here's a function to make it look a bit more clear.

The important part here is the p-value, which represents the probability of observing this time series if the true process is not stationary. 

Because our p-value is quite large, we have to accept the null hypothesis that our time series is not stationary. 

In [8]:
def adf_print(time_series):
    adf_output = adfuller(time_series)
    stat = adf_output[0]
    pval = adf_output[1]
    print('ADF Statistic:', stat)
    print('p-value:', pval)
    return None

adf_print(df)

### Regression to quantify trend

If the time series is not stationary, it likely has a trend. In our case, the trend is linear, so we can use linear regression to quantify the trend.

If your data has a trend that is not linear, you can consider transformation, or breaking your time series into pieces that have linear trends. 

In [9]:
from statsmodels.formula.api import ols

df1 = pd.read_csv('../data/usa-annual-temp.csv')
df1 = df1.loc[df1['year']>=1970,:]


model = ols(formula="temp~year", data=df1).fit()
model.summary()

### Differencing

We need to remove the trend from our data. There are several ways to do this, but one of the most reliable ways is to use **differencing** to transform it. 

Differencing is simply subtracting each value in the time series by the subsequent value.

We can visualize our time series and see it is now stationary. 

In [10]:
fig, ax = plt.subplots()
ax.plot(y_diff)
ax.set_xlabel('Year')
ax.set_ylabel('Temp (differenced)')

To be sure, we can run the ADF test again, as well. 

In [11]:
adf_print(y_diff)

We can now revisit our ACF plot and see how it has changed now.  

You can see the only significant lag is now 2. 

In [12]:
plot_acf(y_diff, lags=10);

There is a similar plot called a partial autocorrelation function (PACF). It is similar but has a slightly different interpretation. For more information, please see [this article](https://www.linkedin.com/pulse/time-series-analysis-short-introduction-/?trk=pulse-article).

Our PACF has significant lags of 2, 3, and 10. 

In [13]:
from statsmodels.graphics.tsaplots import plot_pacf


Let's use the fact that we have significant autocorrelation try to make a model.

We're going to use previous timepoints to predict later values. We can start by using data from the previous year to predict the current year.

We're going to multiply the previous value by some number to predict the value for the current year.

We can do the same with a lag of 2 as well.

In [14]:
predictions = []


for i in range(len(y_diff)):
    
    y_p = 
    predictions.append(y_p)



fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot(   , label='observed')
ax.plot(   , label='pred')
ax.legend()
plt.show()

mean_squared_error 

print(mean_squared_error)

### Question

In groups, using the code in the cell above, try to make a model with the smallest mean squared error. 

This is a manual version of an **autoregressive (AR) model**. statsmodels has a function `AutoReg` to do the fitting process for us. 

If we use 1 lag term, we call our model an AR(1) model. If we use 3 lag terms, we call our model an AR(3) model. 

In [15]:
from statsmodels.tsa.ar_model import AutoReg



In [16]:
y_pred =

fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot(y_diff, label='observed')
ax.plot(y_pred, label='pred')
ax.legend();

## MA model

Another way to model a time series is with a **moving average** model. Instead of using prior values, we use prior error: how much off our predictions were in previous years. 

To see how many years to look at, we can look at the ACF plot. For our first example, we'll just look at the most recent error. 

In [17]:
predictions = [  ]

for i in range(len(y_diff)-1):
    
    


fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot( , label='observed')
ax.plot( , label='pred')
ax.legend();

mean_squared_error = np.mean(( - )**2)

print(mean_squared_error)


Similar to AR models, we can use some functions to calculate coefficients for us. The one we will use is called ARIMA (more on this later).

In the `order` parameter of the model, we specify how many error terms we will use. If we use 2 error terms, it is an MA(2) model. 

In [18]:
from statsmodels.tsa.arima.model import ARIMA

# Fit an MA(2) model to the first simulated data


# Print out summary information on the fit


In [19]:
y_pred 

fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot(y_diff, label='observed')
ax.plot(y_pred, label='pred')
ax.legend();

We can use both AR and MA terms in our model. This is called an ARMA model (we'll get to the I in just a second.)

In [21]:
arma_model = ARIMA(y_diff, order=(  )).fit()

y_pred 

fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot(y_diff[1:], label='observed')
ax.plot(y_pred[1:], label='pred')
ax.legend();

In [20]:
# summary

## I: Integrated

The I in ARIMA stands for integrated. Basically, included a value for I does the differencing for us. This allows us to use our time series without detrending it first. 

In [22]:
mod = ARIMA( , order=()
arima_model = mod.fit()

y_pred = arima_model.predict()

fig, ax = plt.subplots(figsize=(20,10)) 
ax.plot(df['temp'][1:], label='observed')
ax.plot(y_pred[1:], label='pred')
ax.legend();

In [23]:
arima_model.summary()

Choosing the correct model is difficult. Next time, we will use a better method for picking the correct ARIMA model, as well as how to forecast, and deal with seasonal data. 

In [24]:
# mean squared error


### Question
In groups, try several different kinds of ARIMA models -> different amounts of AR, I, and MA terms. What combinations lead to the smallest mean squared error?