### How we can use **correlation based method** to decide which lags we could use.

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from matplotlib.ticker import MaxNLocator

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import MSTL
from statsmodels.tsa.stattools import ccf

### Data
Air Quality Dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Air+Quality).

In [3]:
df = pd.read_csv('../../Datasets/AirQualityUCI_ready.csv', parse_dates=['Date_Time'], index_col=['Date_Time'])
# keep only quality data
df = df.query(' index>="2004-04-01" and index<="2005-04-30" ')
# resample the data, freq=1H
df = df.asfreq('1H')

# Remove measurements from fixed stations.
# We'll only be using sensor data.
remove = [f for f in df.columns if '_true' in f]
# Remove adjusted humidity.
remove.append('AH')
df.drop(columns=remove, inplace=True)

# remove negative inputs
df[df<0]=np.nan

# Fill missing data
df = df.fillna(method="ffill")

#### Utils

In [14]:
# util for plotting ACF and PACF
def plot_acf_and_pacf(feature, lags=168, figsize=(15,4)):
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=figsize)
    plot_acf(df[feature], lags=lags, ax=ax[0], auto_ylims=True);
    plot_pacf(df[feature], lags=lags, ax=ax[1], auto_ylims=True);
    fig.suptitle(t=feature, size=20)


### lets create lags upto 168 -- a week -- 7 Days

#### ACF shows peaks at multiples of 24Hr , this indicates daily seasonality. This is corroborated with the PACF showing a peak at 24 hours.

In [18]:
for feat in df.columns:
    plot_acf_and_pacf(feature=feat)

<img src='./plots/ACF-plot-CO_sensor.png'>
<img src='./plots/ACF-plot-NMHC_sensor.png'>
<img src='./plots/ACF-plot-NOX_sensor.png'>
<img src='./plots/ACF-plot-NO2_sensor.png'>
<img src='./plots/ACF-plot-O3_sensor.png'>
<img src='./plots/ACF-plot-T_sensor.png'>
<img src='./plots/ACF-plot-RH_sensor.png'>

### Let's zoom in and look at the plots over a smaller set of lags to confirm what we see.

### Lag of 3days 

In [26]:
for feat in df.columns:
    plot_acf_and_pacf(feature=feat, lags=3*24)

<img src='./plots/ACF-plot-CO_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-NMHC_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-NOX_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-NO2_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-O3_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-T_sensor--3-day-lag.png'>
<img src='./plots/ACF-plot-RH_sensor--3-day-lag.png'>

### The ACF definitely has peaks at 12 hour periods for some sensors. This is confirmed in the PACF where we see a little peak close to a lag of 12 or 13. The PACF shows that typically it is the most recent lags (e.g., 1 or 2 hours) that matter the most and that there is some seasonality potentially with periods 12 and 24. This confirms what we have so far seen in this dataset using domain knowledge and feature selection methods.

#### Let's look at the ACF at just over 2 weeks worth of lags to see if there is any weekly seasonality visible. We'll also look at the PACF and see if there are any peaks close to 1 week.

* If there is a weekly pattern we should see a spike near lag of 168

In [29]:
for feat in df.columns:
    plot_acf_and_pacf(feature=feat, lags=15*24)

<img src='./plots/ACF-plot-CO_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-NMHC_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-NOX_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-NO2_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-O3_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-T_sensor--weekly-pattern.png'>
<img src='./plots/ACF-plot-RH_sensor--weekly-pattern.png'>

#### The ACF suggests there could be another peak occuring at 168 hours (i.e., 1 week). This is hinted at in the PACF but isn't always clear.

#### So far from looking at the ACF and PACF we would use recent lags (1 or 2 hours) and seasonal lags (12, 24, and 168 hours).

#### PACF assumes timeseries data to be stationary

#### As the time series has multiple seasonal components we shall use MSTL to decompose the time series into trend and multiple seasonal components. The residual is equivalent to subtracting the original series by the trend and seasonal components. So we shall look at the ACF and PACF of the residuals of the decomposition.

In [34]:
mstl = {}
for feat in df.columns:
    res = MSTL(endog=df[feat], periods=[24, 7*24]).fit()
    mstl[feat] = res

In [51]:
def mstl_resid_plot_acf_pacf(feat, lags=48, figsize=(18, 6)):
    plt.figure(figsize=(18,6))
    grid = GridSpec(nrows=2, ncols=2)
    ax_1 = plt.subplot(grid[0,:])
    ax_2 = plt.subplot(grid[1,0])
    ax_3 = plt.subplot(grid[1,1])

    ax_1.plot(mstl[feat].resid);
    plot_acf(mstl[feat].resid, ax=ax_2, auto_ylims=True, lags=lags);
    plot_pacf(mstl[feat].resid, ax=ax_3, auto_ylims=True, lags=lags);
    plt.suptitle(f'{feat} with trend & seasonality removed', size=20)
    plt.tight_layout();



In [54]:

for feat in df.columns:
    mstl_resid_plot_acf_pacf(feat)

<img src='./plots/MSTL-CO_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-NMHC_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-NOX_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-NO2_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-O3_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-T_sensor-ACF-PACF.png'>
<img src='./plots/MSTL-RH_sensor-ACF-PACF.png'>

#### CROSS CORRELATION

Let's see if there is any cross-correlation between any of the variables and `NO2_sensor`.

In [84]:
def plot_ccf(x, y, lags=25, ax=None, title='CROSS CORRELATION'):
    res = ccf(x, y)
    ci = 2 / np.sqrt(len(y))
    if ax is None:
        ax = plt.subplot()
    ax.stem(res[:lags+1])
    ax.fill_between(range(lags+2), -ci, ci, alpha=0.5, color='steelblue')
    ax.set(title=title)

In [95]:
ccf_x =[x for x in df.columns if x not in ['NO2_sensor']]
ccf_y='NO2_sensor'

In [99]:
fig, ax = plt.subplots(nrows=6, ncols=1, figsize=(10,16))
ax = ax.ravel()
for frame, feat in zip(ax, ccf_x):
    plot_ccf(x=mstl[feat].resid, y=mstl[ccf_y].resid, lags=168, ax=frame, title=f'CROSS CORRELATION {feat} vs {ccf_y}')

plt.tight_layout()

<img src='./plots/CROSS-CORRELATION-air-pollution-data.png'>