## Computing the cross correlation function (CCF)

Cross correlation function computes the correlation between time-series and lagged value of feature-series.

$95\%$ confidence interval is given by the formula : $\frac{2}{\sqrt{n}}$

**An example** <br>
* time-series: $retail-sales$  
* feature-series: $advertisement-spending$




In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

from statsmodels.tsa.seasonal import STL, MSTL
from statsmodels.tsa.stattools import pacf, acf, ccf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

from sktime.utils.plotting import plot_series

In [202]:
# helper functions

def plot_ccf(y, x, ax, title='Cross correlation'):
    # cross correlation 
    # this will return an array of len equal to the len of x
    cross_corr = ccf(x, y)
    # confidence interval
    ci = 2 / np.sqrt(len(y))

    markerline, stemlines, baseline = ax.stem(cross_corr, linefmt='r', markerfmt='.')
    markerline.set_markerfacecolor('None')

    ax.xaxis.set_major_locator(MaxNLocator(integer=True))

    ax.fill_between(range(len(y)), y1=ci, y2=-ci, alpha=0.5, color='salmon')
    ax.set(title=title)


def cross_lag_plot(x, y, ax, lag, freq=None, title='Lag plot', **kwargs):
    ax.scatter(y=y, x=x.shift(periods=lag, freq=freq), **kwargs)
    ax.set(title=title, xlabel='X', ylabel='Y')


def white_noise_gen(num_points=300, seed=42):
    index = pd.date_range(start='2000-01-01', periods=num_points)
    rng = np.random.RandomState(seed=seed)
    data = rng.normal(loc=0.0, scale=1.0, size=num_points)
    noise = pd.Series(data=data, index=index)
    return noise


def AR_gen(p=1, num_points=300, seed=42, const=0, lag_coef=[0.9]):
    index = pd.date_range(start='2000-01-01', periods=num_points)
    rng = np.random.RandomState(seed=seed)
    series = pd.Series(data=np.zeros(num_points), index=index)
    
    for i in range(p, num_points):
        noise = rng.normal(loc=0.0, scale=1.0)
        series.iloc[i] = const + np.dot(series[i-p:i], lag_coef) + noise

    return series


def lag_plot_of_two_series(x, y, title='Lag plot', rows=1, cols=1, figsize=(5,5), **kwargs):
    fig, ax = plt.subplots(nrows=rows,ncols=cols, figsize=figsize, constrained_layout=True, sharex=True, sharey=True)
    ax = [ax] if not type(ax) is np.ndarray else ax.ravel()

    for i, frame in enumerate(ax):
        cross_lag_plot(x=x, y=y, ax=frame, lag=i+1, title=f'Lag-{i+1}', **kwargs)

    fig.suptitle(f'{title}\n', size=24);

## CCF and noise

In [110]:
noise_1 = white_noise_gen(seed=42)
plot_series(noise_1)
plt.xticks(rotation=20);

<img src='./plots/whitenoise_sktime_plot.png'>

In [107]:
noise_2 = white_noise_gen(seed=100)
plot_series(noise_2)
plt.xticks(rotation=20);

<img src='./plots/whitenoise_sktime_plot-seed-100.png'>

In [112]:
plt.figure(figsize=(18,4))
ax = plt.subplot()
plot_ccf(noise_1, noise_2, ax)

<img src='./plots/cross-correlation-whitenoise-noise_1-vs-noise_2.png'>

#### As expected the cross-correlation is small and not signifcant across most lags. By chance we expect some lags to be larger than the confidence interval so the fact that we see some lags just outside of the interval is not something to give attention to.

In [139]:
fig, ax = plt.subplots(nrows=3,ncols=6, figsize=(20,10), constrained_layout=True, sharex=True, sharey=True)
ax = ax.ravel()

for i, frame in enumerate(ax):
    cross_lag_plot(x=noise_1, y=noise_2, ax=frame, lag=i+1, title=f'Lag-{i+1}', marker='.', edgecolor='k', s=64)

fig.suptitle('Whitenoise has no information\n', size=24);

<img src='./plots/lag-plot-whitenoise-noise_1-vs-noise_2-.png'>

## CFF and AR process


#### Lets study Cross correlation between two AR-1 process

In [225]:
# lets create two AR-1 Process 
ar_1 = AR_gen(p=1, lag_coef=[0.6], seed=42,)
ar_2 = AR_gen(p=1, lag_coef=[0.8], seed=0, )

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15,4))
plot_series(ar_1, ax=ax[0])
plot_series(ar_2, ax=ax[1])
ax[0].tick_params(axis='x', rotation=30)
ax[1].tick_params(axis='x', rotation=30)
fig.suptitle('Two AR-1 Process')
plt.show()

<img src='./plots/two-random-ar1-process.png'>

In [226]:
plt.figure(figsize=(18,4))
ax = plt.subplot()
plot_ccf(y=ar_1, x=ar_2, title='Cross correlation between two AR-1 process', ax=ax)

<img src='./plots/two-random-ar1-process-cross-correlation.png'>

In [229]:
lag_plot_of_two_series(x=ar_1, y=ar_2, rows=3, cols=3, title='Lag plot of two AR process', marker='.', edgecolor='k')

<img src='./plots/two-random-ar1-process-lag-plot.png'>

#### As expected the cross-correlation of two random AR-1 process is small and not signifcant across most lags. By chance we expect some lags to be larger than the confidence interval so the fact that we see some lags just outside of the interval is not something to give attention to.

* Let's create a toy time series $y_t$ where we expect some correlation with the lag of a different time series $x_t$. 
* For fun, let's start with a time series which follows an AR1 process for $x_t$.
* We then create a time series $y_t$ which depends on lag p of $x_t$ and add some additional noise.

In [243]:
index = pd.date_range(start='2000-01-01', periods=300)
rng = np.random.RandomState(seed=42)
y_1 = pd.Series(data=np.zeros(300), index=index)
y_2 = pd.Series(data=np.zeros(300), index=index)

for i in range(1,300):
    y_1.iloc[i] = 0 + 0.9*y_1[i-1] + rng.normal(loc=0.0, scale=1.0)

y_2.iloc[:10] = y_1.iloc[:10]
for i in range(10,300):
    y_2.iloc[i] = 0 + 3*y_1[i-10] + rng.normal(loc=0.0, scale=1.0)

series = pd.DataFrame(data={ 'y':y_1, 'x':y_2})

series.plot(y=['y', 'x'], subplots=True, figsize=(15,4))

<img src='./plots/two-realted-ar-process.png'>

In [267]:
lag_plot_of_two_series(x=y_1, y=y_2, title='Lag plot of two realted AR-1 process', rows=3, cols=5, figsize=(10,7), edgecolor='k')

<img src='./plots/two-realted-ar-process-lag-plot.png'>

In [271]:
plt.figure(figsize=(15,8))
plot_ccf(y=y_1, x=y_2, ax=plt.subplot(), title='Cross correlation of two related AR-1 process')

<img src='./plots/two-realted-ar-process-cross-correlation.png'>

In [277]:
plt.figure(figsize=(15,4))
plot_ccf(y=y_1[:50], x=y_2[:50], ax=plt.subplot(), title='Cross correlation of two related AR-1 process')

#### We see many significant lags despite the fact that only one is important. Nevertheless, we see that the CCF peaks at the lag of $10$ allowing us to determine that it is an important lag!

<img src='./plots/two-realted-ar-process-cross-correlation-zoomed.png'>

## Data Set Synopsis

* #### The retail sales dataset is a monthly timeseries representing sales volumes collected between January 1992 and May 2016.
* #### The air passengers dataset is a monthly timeseries representing the number of US air passengers collected between January 1949 and December 1960.

retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)) and the air passengers data set (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_air_passengers.csv))

In [4]:
retail_sales = pd.read_csv('../../Datasets/example_retail_sales.csv', parse_dates=['ds'], index_col=['ds'], nrows=120)

air_passenger = pd.read_csv('../../Datasets/example_air_passengers.csv', parse_dates=['ds'], index_col=['ds'], nrows=120)

## Time-series with Trend and Seasonality

#### Just having a trend component between two time series will create cross-correlations between the two series, even if the two time series have nothing to do with one another and hence we would probably not want to use one as a predictor of the other.

In [291]:
fig, ax= plt.subplots(nrows=2, ncols=1, figsize=(18,6))
retail_sales.plot(y=['y'], ax=ax[0], title='Retail-sales')
air_passenger.plot(y=['y'], ax=ax[1], title='Air passengers')
plt.tight_layout()

<img src='./plots/retail-sales--vs--air-passenger.png'>

In [315]:
lag_plot_of_two_series(x=retail_sales, y=air_passenger, title='Lag plot : air passengers and retail sales', rows=3, cols=6, figsize=(15, 10), edgecolor='k')

<img src='./plots/retail-sales--vs--air-passenger--lag-plot.png'>

In [316]:
plt.figure(figsize=(15,4))
ax = plt.subplot()
plot_ccf(x=air_passenger, y=retail_sales, ax=ax, title='Cross-correlation between air passengers and retail sales')

<img src='./plots/retail-sales--vs--air-passenger--CCF-plot.png'>

* #### We can see multiple significant correlation at multiple lags due the presence of strong trend
* #### We can see ocillation in CCF output due to seasonality
* #### The time-series is not stationary, so lets make them stationary by removing trend and seasonal-pattern 

In [308]:
retail_stl = STL(endog=retail_sales, period=12).fit()
air_passenger_stl = STL(endog=air_passenger, period=12).fit()

In [319]:
lag_plot_of_two_series(x=retail_stl.resid, y=air_passenger_stl.resid, title='Lag plot after removing trend & seasonal pattern : air passengers and retail sales', rows=3, cols=6, figsize=(15, 10), edgecolor='k')

<img src='./plots/retail-sales--vs--air-passenger--Lag-plot-STL-detrend-deseason.png'>

In [321]:
plt.figure(figsize=(15,4))
ax = plt.subplot()
plot_ccf(x=air_passenger_stl.resid, y=retail_stl.resid, ax=ax, title='Cross Correlation on detrended and deseasonalized data :air passengers and retail sales')

<img src='./plots/retail-sales--vs--air-passenger--CCF--STL-detrend-deseason.png'>

#### This demonstrates that there is not much information in one of these time series to predict the other, as we would expect.