# <center>Predictive modelling with timeseries<center>
# <center>Baselines, stationarity and decomposition <center>

![Image](images/timeseries.jpg)

In [None]:
import pandas as pd
import numpy as np
from utils import adf_test
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matplotlib import pyplot as plt

# jupyter lab configs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
plotly.offline.init_notebook_mode(connected=True)

In [None]:
import statsmodels as sta
sta.__version__

# 1. Baselines 

The simplest forecasts one can do using univariate datasets are:
* 1. Average
* 2. Naive
* 3. Seasonal Naive
* 4. Moving Average
    * a. normal
    * b. cumulative
    
![Image](images/baselines.png) 

*Source: Hyndman and Athanasopoulos. www.otexts.com/fpp2/*


### Load the datasets

In [None]:
# example of trend data - wine sales
wine = pd.read_csv('datasets/wine_trend.csv')

# example of seasonal data - daily temperature
temperature = pd.read_csv('datasets/temperature_seasonal.csv')
temperature.set_index('date', drop=True, inplace=True)

# load a nice example for decomposition - production of electrical equipments
ele_df = pd.read_csv('datasets/elecequip.csv')

## Plot the data

In [None]:
# example of timeseries with very strong trend
wine.plot()

In [None]:
# exaple of time series with marked seasonality
temperature.plot()

In [None]:
ele_df['value'].plot()

## Naive methods and averages

Which baseline worked best in the case of trend data? Which one was best for seasonal data?

In [None]:
t2 = temperature.reset_index().sort_values("date")
t2.head()

In [None]:
t2.loc[:500,].temp.plot()

In [None]:
t2.temp.plot()

In [None]:
t2.temp.rolling(180).max().plot()

In [None]:
wine.wine_sales.mean()

In [None]:
wine.wine_sales.plot()

In [None]:
wine.wine_sales.rolling(window=5).max().plot()

## Moving Average smoothing
### Understand the difference between `rolling` windows and `expanding` windows

Let's calculate a **moving average** and a **cumulative moving average** using the methods from pandas

In [None]:
ele_df['ma_5'] = ele_df.loc[:,'value'].rolling(window=5).mean()
ele_df['ma_10'] = ele_df.loc[:,'value'].rolling(window=10).mean()
ele_df['ma_20'] = ele_df.loc[:,'value'].rolling(window=20).mean()
ele_df['ma_exp'] = ele_df.loc[:,'value'].expanding().mean()

In [None]:
p = plt.figure(figsize=(16,8))
p = plt.plot(ele_df['ma_5'], label='Moving average - 5 steps')
p = plt.plot(ele_df['ma_10'], label='Moving average - 10 steps')
p = plt.plot(ele_df['ma_20'], label='Moving average - 20 steps')
p = plt.plot(ele_df['ma_exp'], label='Cumulative moving average')
p = plt.plot(ele_df['value'], label='Original')
p = plt.legend(loc='best')
plt.show()

---

# 2. Time series Decomposition

### What is the data made of?  🤔

Time series can be better analysed if we know how each of its components behave.  
Typically, a time series  has 3 components:  
* `S` as the seasonal component  
* `T` as the trend component
* `R` as a residual component

If we consider that these components *add to each other*, the decomposition is said *additive*. 
Thus, in **additive decomposition** we have:  
> y(t) = Seasonality(t) + Trend(t) + Residue(t)    
> y(t) = S(t) + T(t) + R(t) 

And in **multiplicative decomposition** we have: 
> y(t) = S(t) * T(t) * R(t)  

**Example:** Decomposition of the Equipments dataset 

In [None]:
ele_df['value'].plot()

### **Run the additive decomposition**  

The function `seasonal_decompose()` from `statsmodels` is very helpful:  

In [None]:
len(ele_df['value'])

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(ele_df['value'][:190], model='additive', period=12).plot()

In [None]:
ts = seasonal_decompose(ele_df['value'][:190], model='additive', period=7)

In [None]:
# here's how to get the original (raw data) of each timestep
ts.observed.head(5)

In [None]:
# this is how we can get the trend values of each time step
ts.trend.head(5)

In [None]:
# this is how we can get the seasonality values of each time step
ts.seasonal.head(5)

In [None]:
# this is how we can get the tred values of each time step
ts.resid.head(5)

### Check the actual numbers coming from the decomposition:

In [None]:
# example with multiplicative decomposition

result = seasonal_decompose(ele_df['value'][:36], model='multiplicative', period=12)
decomposed_series = pd.concat([result.observed, result.seasonal, result.trend, result.resid], axis=1)
decomposed_series['res_mult'] = decomposed_series.seasonal * decomposed_series.trend * decomposed_series.resid
decomposed_series.head(20)

In [None]:
# example with additive decomposition

result = seasonal_decompose(ele_df['value'][:36], model='additive', period=12)
decomposed_series = pd.concat([result.observed, result.seasonal, result.trend, result.resid], axis=1)
decomposed_series['res_mult'] = decomposed_series.seasonal + decomposed_series.trend + decomposed_series.resid
decomposed_series.head(20)

# 3. Stationarity

### Is the data stationary?  🤔

Stationarity means your time series does not have any `trend` or `seasonality`. Stationary time series will have no predictable patterns in the long-term. They will be very important in ARIMA, for example.
A commom method to investigate this is with the  **Augmented Dickey-Fuller Test**.

**Example:** Check if the wine data is stationary. If it is not, 
try with differencing and see what happens.

In [None]:
# check if data is stationary
adf_test(wine.wine_sales)

In [None]:
wine.wine_sales.plot()

# 4. Differencing 
### (will be super important in ARIMA)

The results of the ADF-test above show that the wine time series is non-stationary.  
Some methods will require that the data is stationary. We can still try to adjust it by using **differencing**.
>**Differencing** is the difference between consecutive observations. It reduces (or eliminates) trend and seasonality.

This procedure can be done quickly with the method `diff()` from library `statsmodels.tsa`.

Check the behavior of `diff()`

In [None]:
from statsmodels.tsa.statespace.tools import diff

In [None]:
test = [20, 20, 20]
diff(test)

test = [20, 40, 60]
diff(test)

test = [40, 45, 34, 32, 41, 34, 41]
diff(test)
np.mean(diff(test))

test = [40, 45, 34, 32, 41, 34, 41]
diff(test, k_diff=2)
np.mean(diff(test, k_diff=2))

In [None]:
wine['sales_diff'] = diff(wine['wine_sales'], k_diff=2)
wine['sales_diff'].plot()
adf_test(wine['sales_diff'])

### Apply the hypothesis test to the residuals that we obtained from the time series decomposition exercise above.
#### Do you expect the residues to be `stationary` or `non-stationary`?

In [None]:
adf_test(ts.resid)

---

# 5. Autocorrelation

Let's check the daily sales of one of Rossman's stores (Store 1):

In [None]:
sales = pd.read_csv('datasets/rossman_train.csv')
stores = pd.read_csv('datasets/rossman_store.csv')

# join store features into the sales df
sales = pd.merge(sales, stores, on='Store', how='left')

In [None]:
fig = px.scatter(sales[sales.Store==1], x="Date", y="Sales", color='DayOfWeek', width=800, height=500 )
fig = fig.add_trace(go.Line(x=sales.loc[sales.Store==1, 'Date'], y=sales[sales.Store==1].Sales, mode='lines'))
fig.show()

### Zoom in to 2-3 weeks to see more details: Do you see any repeating patterns?

* The stores are closed sundays
* Sales every mondays tend to be higher than the other business days

**Overall** we see a cycle of 7 days in the sales -> **seasonality**   
How does that translate into autocorrelation?

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

title = 'Autocorrelation - Daily Sales'
# LAGS: set the number of time steps to consider in the calculation
lags = 50 
plot_acf(sales[sales.Store==1].Sales, title=title, lags=lags);

## Interpretation of the ACF plot

* Autocorrelation with itself is always 1 
* The shaded area indicates the 95% confidence intervals for the **null hypothesis** that the autocorrelation with that specific time lag is, in fact, zero.
* Thus, in the example above, the autocorelation with time lag of 7, 14, 21 and so on, is in the order of ~62%. If we reject the null hypothesis, there's very small probability (<= 5%) of making a **Type I error**
* We can affirm that the daily sales of the Rossman store 1 has (strong) seasonal component, with a cycle of 7 days


# Autocorrelation in stationary data? Is it possible?

In [None]:
title = 'Autocorrelation of a stationary time series (data after differencing)'
lags = 30
p = plot_acf(wine.loc[1:, 'sales_diff'], title=title,lags=lags)

---

<a href='https://www.freepik.com/vectors/business'>Business vector created by freepik - www.freepik.com</a>