# ACF : Auto-Correlation-Function

* #### ACF measures how correlated a timeseries is with itself at various lags
* #### The confidence interval of ACF at a given lag is calculated using the `Bartlett-formula`
    * #### This confidence interval helps determine if the autocorrelation is significant 
* #### White-noise, Autoregression, Trend, Seasonality all leave different signatures on the ACF, which can be used to pick a relevant lag for modeling 

In [23]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sktime.utils.plotting import plot_acf, plot_lags, plot_pacf, plot_series, plot_windows, plot_correlations

# from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.nonparametric.smoothers_lowess import lowess
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.seasonal import STL


#### Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

We will work with a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [2]:
df = pd.read_csv('../../Datasets/example_retail_sales.csv', parse_dates=['ds'], index_col=['ds'])
df.head()

Unnamed: 0_level_0,y
ds,Unnamed: 1_level_1
1992-01-01,146376
1992-02-01,147079
1992-03-01,159336
1992-04-01,163669
1992-05-01,170068


## Pearson correlation coefficient
* It's a measure of linear correlation between two sets of data. 
* It is the ratio between the covariance of two variables and the product of their standard deviations; 
    * thus, it is essentially a normalized measurement of the covariance 
    * the result always has a value between âˆ’1 and 1. 

<div style='background-color:white; height:100px'>
<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/f76ccfa7c2ed7f5b085115086107bbe25d329cec' style='height:100%; margin-left:50px'>
</div>

#### Correlation describes the strength of linear relationship

#### Slope of the line has nothing to do with the correlation

#### Correlation value of 1 means positive correaltion

#### Correlation value of -1 means negative correaltion

#### Auto Correlation can only capture linear relationship, it can't capture non linear relation.

<div style='background-color:white; height:300px'> 
<img src='https://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg' style='height:100%; width:100%'>
</div>
<br/>

<span>https://en.wikipedia.org/wiki/Pearson_correlation_coefficient</span>

## Lets Build ACF from scratch

In [8]:
def ACF_util(series, lags=10, freq=None, ax=None, return_df=False, return_r=False):
    r = [1]
    res = pd.DataFrame(data=series, columns=['y'], index=series.index)
    y_mean = np.mean(series)
    for i in range(lags):
        res[f'y_lag_{i+1}'] = res['y'].shift(periods=i+1, freq=freq)
        lag_mean =  np.mean(res[f'y_lag_{i+1}'])
        
        covariance = np.mean( (res['y']-y_mean) * (res[f'y_lag_{i+1}']-lag_mean) )
        vairance = np.std(res['y']) * np.std(res[f'y_lag_{i+1}'])

        correlation = covariance/vairance
        r.append(correlation)
    if ax:
        ax.stem(r)
    else:
        plt.stem(r)
    
    if return_df and return_r:
        return res, r
    elif return_df:
        return res
    elif return_r:
        return r

In [7]:
def acf_plot_util(y, title):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,4))
    ACF_util(y, lags=25, ax=ax[0]);
    plot_acf(y, ax =ax[1], auto_ylims=True);
    fig.suptitle(title, fontsize=24)
    plt.tight_layout();

## AR Process and ACF
* AR stands for autoregressive.
#### AR(1) process - a time series determined by lag 1 

* Let's create a toy time series where there should be a correlation with lag 1 of the time series.
* We might think that this means there is only a correlation with lag 1 of the time series $y$ but we will see that is not the case. 

In [3]:

def AR_gen(p=1, lag_coef=[0.9], const=0, num_data_points=300):
    rng = np.random.RandomState(seed=42)
    data = np.zeros(num_data_points)
    index = pd.date_range(start='2000-01-01', periods=num_data_points)
    y = pd.Series(data=data, index=index)
    
    for i in range(p, num_data_points):
        y.iloc[i] = const + np.dot(y.iloc[i-p:i].values, lag_coef) + rng.normal(loc=0,scale=1)
    
    return y


In [5]:
ar_1 = AR_gen()
plot_series(ar_1)
plt.title('AR(1)', size=20)
plt.xticks(rotation=20);

<img src='./plots/AR-1-plot-default.png'>

#### ACF of AR(1)

In [91]:
ACF_util(ar_1, lags=25)

<img src='./plots/acf_util_AR_1_plot.png'>

In [92]:
acf_plot_util(ar_1, title="Autocorrelation of AR(1) process")

<img src='./plots/acf_util_AR_1_plot_compare.png'>

#### So we see that it is not possible to determine that only a lag of 1 that is the most important from the ACF, despite $y_t$ being generated only by $y_{t-1}$. 

#### The partial autcorrelation function (PACF) will help identify that a lag of 1 is important in this scenario.

## Time series with trend and seasonality

Let's assume that we do not know that there is seasonality with period 12 (i.e., yearly) in the data.

We can use the lag plots and the ACF to highlight the period of the seasonality and therefore suggest a useful lag.

In [7]:
# Load retail sales dataset with the artificially added outliers
df = pd.read_csv(
    "../../Datasets/example_retail_sales.csv",
    parse_dates=["ds"],
    index_col=["ds"],
)

plot_series(df['y'])

<img src='./plots/retail-sales-plot.png'>

#### Lets compute ACF for retail sales
* #### This data has both trend and seasonal pattern
* #### Trend, Seasonal , AR everthing leaves different signature on ACF plot
* #### The autocorrelation is one at lag 0 as expected.
* #### The autocorrelation decays slowly due to the strong trend.
* #### There are peaks at multiples of the lag of 12. This suggests there is seasonality with lag 12.

In [10]:
retail_lag, acf = ACF_util(df['y'], lags=36, return_df=True, return_r=True)

<img src='./plots/trend_seasonality_and_ACF.png'>

In [20]:
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,10) )
ax = ax.ravel()

for i, lag in enumerate([6,12,18,24,30,36]):
    retail_lag.plot(y=['y',f'y_lag_{lag}'], ax=ax[i])

<img src='./plots/acf_util_retail_lags.png'>

* #### Due to strong trend component high ACF is observed at multiple lags
* #### The time-series has a periodicity of 12months
    * ##### So at mulitple of 12 we can observe high ACF

In [27]:
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,15) )
ax = ax.ravel()

for i, lag in enumerate([6,12,18,24,30,36]):
    ax[i].scatter(retail_lag['y'], retail_lag[f'y_lag_{lag}'])
    ax[i].set(title=f'lag-{lag}')

<img src='./plots/acf_util_retail_lag_plot.png'>

* #### In summary this would suggest that we could create features using a lag of 12 (from the seasonality). 
* #### If there is a strong trend component in the data then recent values (i.e., low lags) will also be helpful. So we should also consider using a lag of 1 or 2.

## # ACF after detrending the original series

In [14]:
res = lowess(endog=df['y'], exog=range(len(df)), frac=0.1)

plot_series(df['y'])
plt.plot(res[:,1], c='salmon', linewidth=4);
plt.xticks(rotation=20);

<img src='./plots/retail-sales-trend.png'>

#### Detrend the timeseries

In [17]:
retail_detrend = df['y'] - res[:, 1]

plot_series(retail_detrend)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-detrended.png'>

* #### There are peaks at multiples of the seasonal lag of 12 due to the seasonality. This is much clearer now that we have de-trended. 

In [19]:
acf_plot_util(retail_detrend, title="Autocorrelation of retail sales detrended")

<img src='./plots/acf_util_retail_detrended.png'>

## ACF of trend component

In [22]:
trend = pd.Series(res[:,1], index=df.index)
acf_plot_util(trend, title="Autocorrelation of retail sales detrended")

<img src='./plots/acf_util_retail_trend.png'>

#### Now lets use STL decomposition and extract the residuls.

In [27]:
stl = STL(endog=df['y'], period=12, seasonal=7).fit()
# stl.plot();

<img src='./plots/retail-STL.png'>

#### Autocorrelation of residual

In [30]:
acf_plot_util(stl.resid, title="Autocorrelation of residual");

<img src='./plots/acf_residual.png'>

* #### We can see that there aren't any significant lags

## White-noise and ACF

For white noise we can expect ACF to be small at all lags because each data-point is completely independent of each other

In [9]:
num_data_points=300

rng = np.random.RandomState(seed=42)

noise = rng.normal(loc=0.0, scale=1.0, size=num_data_points)

noise = pd.Series(noise, index=pd.date_range(start='2000-01-01', periods=num_data_points, freq='D'))

plot_series(noise)
plt.xticks(rotation=30);

<img src='./plots/whitenoise_sktime_plot.png'>

 - The autocorrelation is one at lag 0 as expected.
 - The autocorrelation at all other lags are not significant. As expected from white noise because the timeseries at each point is determined independently from each other point and therefore no point is correlated to a previous point.

In [41]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,4))
ACF_util(noise, lags=25, ax=ax[0]);
plot_acf(noise, ax =ax[1]);


<img src='./plots/lag_plot_for_white_noise.png'>