# Time Series Windows Pre-Processing Exercises


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import interact
import re

%matplotlib inline

We will continue with the alcohol_consumption dataset, so we still need the parser!


In [None]:
def parse_quarter(string):
    """
    Converts a string from the format YYYYQN in datetime object at the end of quarter N.
    """

    # Note: you could also just retrieve the first four elements of the string
    # and the last one... Regex is fun but often not necessary
    year, qn = re.search(r"^(20[0-9][0-9])(Q[1-4])$", string).group(1, 2)

    # year and qn will be strings, pd.datetime expects integers.
    year = int(year)

    date = None

    if qn == "Q1":
        date = pd.Timestamp(year, 3, 31)
    elif qn == "Q2":
        date = pd.Timestamp(year, 6, 30)
    elif qn == "Q3":
        date = pd.Timestamp(year, 9, 30)
    else:
        date = pd.Timestamp(year, 12, 31)

    return date


# Check that it works!
print(parse_quarter("2000Q3"))  # should show 2000-09-20 00:00:00

### Giving the parser to pandas

Pandas can parse dates using a custom made parser such as the one defined above. 

Read in `data/NZAlcoholConsumption.csv'`, specifying the parser function in the `date_parser` option and `index_col='DATE'`. 

Call your dataframe `alcohol_consumption`.

In [None]:
# Load the data using your parser, set the index to the date

alcohol_consumption = pd.read_csv(
    "data/NZAlcoholConsumption.csv",
    parse_dates=["DATE"],
    date_parser=parse_quarter,
    index_col="DATE",
)
alcohol_consumption.sort_index(inplace=True)
alcohol_consumption.head()


## Exercises: Differencing

Differencing amounts to looking at the time series formed of differences between values separated by a given lag: 

$y'_t = y_t-y_{t-l}$

where l is the lag. 

Compute the differenced time series for `TotalWine` with lag of 1 (use for this the function `diff` we saw in the materials). Call this new time series `diff_series`.

In [None]:
# Your code here
diff_series = alcohol_consumption.TotalWine.diff(1)


The code below now plots both the original and differentiated time series. What do you observe? 

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(alcohol_consumption.TotalWine, "-o", label="original ts")
plt.plot(diff_series, "-o", label="differenced ts (lag=1)")
plt.legend(fontsize=12)

Let's now build a function that allows us to evaluate easily how the time series would look for different lags.


Fill the function below by computing the differenced series, called `differenced_ts` using `diff` (for `TotalWine` and a given lag _d_).

In [None]:
def differencing_plot(d):
    # Your code here
    differenced_ts = alcohol_consumption.TotalWine.diff(d)
    plt.plot(differenced_ts, "-o")
    plt.show()


interact(differencing_plot, d=(1, 10))

## Exercise: Autocorrelation

Autocorrelation measures the correlation (similarity) between the time series and a lagged version of itself. 

Use the `plot_acf` and `plot_pacf` functions for the Total Wine feature of the alcohol consumption data.

What do you observe and what does this tell us about how we should difference our data? 

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# Your code here
plot_acf(alcohol_consumption.TotalWine, lags=15)

plot_pacf(alcohol_consumption.TotalWine, lags=15)


It's quite clear from this plot that the time series is self-similar to itself with a lag of 4 and consistently so (so also with a lag of 8, 12, etc). Using the PACF we can see that the lag multiples (8, 12, and so on) are directly due to the high autocorrelation at lag 4. 

### Check the resulting series is now stationary

Using `adfuller`, check that after differencing 4 times, we create a stationary time series.

In [None]:
from statsmodels.tsa.stattools import adfuller


# Your code here
def testStationarity(x, alpha=0.05):
    results = adfuller(x)
    pvalue = results[1]
    if pvalue < alpha:
        return "Reject the null: the time series is stationary"
    else:
        return "Accept the null: the time series is non-stationary"


testStationarity(alcohol_consumption.TotalWine.diff(4).dropna())
