# <center>Predictive modelling with timeseries<center>
# <center>Baselines, stationarity and decomposition <center>

![Image](images/timeseries.jpg)

In [None]:
import pandas as pd
import numpy as np
from utils import adf_test
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
from matplotlib import pyplot as plt

# jupyter lab configs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
plotly.offline.init_notebook_mode(connected=True)

## Baselines 

The simplest forecasts one can do using univariate datasets are:
1. Average
2. Naive
3. Seasonal Naive
3. Moving Average
    a. normal
    b. cumulative
    c. exponential
    
![Image](images/baselines.png) *Source: Hyndman and Athanasopoulos. www.otexts.com/fpp2/*


### Load the datasets

In [None]:
# example of trend data - wine sales
wine = pd.read_csv('datasets/wine_trend.csv')

# example of seasonal data - daily temperature
temperature = pd.read_csv('datasets/temperature_seasonal.csv')
temperature.set_index('date', drop=True, inplace=True)

# load a nice example for decomposition - production of electrical equipments
ele_df = pd.read_csv('datasets/elecequip.csv')

## Plot the data

# 1. Naive methods and averages

Which baseline worked best in the case of trend data? Which one was best for seasonal data?

# 2. Moving Average smoothing
## Understand the difference between `rolling` windows and `expanding` windows

Let's calculate a **moving average** and a **cumulative moving average** using the methods from pandas

In [None]:
ele_df['ma_5'] = ele_df.loc[:,'value'].rolling(window=5).mean()
ele_df['ma_10'] = ele_df.loc[:,'value'].rolling(window=10).mean()
ele_df['ma_exp'] = ele_df.loc[:,'value'].expanding().mean()

In [None]:
plt.figure(figsize=(16,8))
plt.plot(ele_df['ma_5'], label='Moving average - 5')
plt.plot(ele_df['ma_10'], label='Moving average - 10')
plt.plot(ele_df['ma_exp'], label='Cumulative moving average')
plt.plot(ele_df['value'], label='Original')
plt.legend(loc='best')
plt.show()

---

# Decomposition

### What is the data made of?  🤔

Time series can be better analysed if we know how each of its components behave.  
Typically, a time series  has 3 components:  
* `S` as the seasonal component  
* `T` as the trend component
* `R` as a residual component

If we consider that these components *add to each other*, the decomposition is said *additive*. 
Thus, in **additive decomposition** we have:  
> y(t) = S(t) + T(t) + R(t)  

And in **multiplicative decomposition** we have: 
> y(t) = S(t) x T(t) x R(t)  

**Example:** Decomposition of the Equipments dataset 

In [None]:
ele_df['value'].plot()

**Run the additive decomposition**  

The function `seasonal_decompose()` from `statsmodels` is very helpful:  

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(ele_df['value'], model='additive', period=12) 
result.plot()

# Stationarity

### Is the data stationary?  🤔

Stationarity means your time series does not have any `trend` or `seasonality`. Stationary time series will have no predictable patterns in the long-term. They will be very important in ARIMA, for example.
A commom method to investigate this is with the  **Augmented Dickey-Fuller Test**.

**Example:** Check if the wine data is stationary. If it is not, 
try with differencing and see what happens.

In [None]:
# check if data is stationary
adf_test(wine.wine_sales)

### Differencing (will be super important in ARIMA)

The results show the data is non-stationary.  
Some methods will require that the data is stationary. We can still try to adjust it by using **differencing**.
>**Differencing** is the difference between consecutive observations. It reduces (or eliminates) trend and seasonality.

This procedure can be done quickly with the method `diff()` from library `statsmodels.tsa`.

In [None]:
from statsmodels.tsa.statespace.tools import diff

wine['sales_diff'] = diff(wine['wine_sales'], k_diff=1)
wine['sales_diff'].plot()
adf_test(wine['sales_diff'])

## Autocorrelation plots (ACF) 

They also help understanding if the data is stationary.  
>For a stationary time series, the ACF will drop to zero relatively quickly, while the ACF of non-stationary data decreases slowly.  
(Hyndman and Athanasopoulos 2018)


In [None]:
title = 'Autocorrelation - before diff'
lags = 10
plot_acf(wine['wine_sales'],title=title,lags=lags);
title = 'Autocorrelation - after diff'
lags = 10
plot_acf(wine.loc[1:, 'sales_diff'],title=title,lags=lags)

<a href='https://www.freepik.com/vectors/business'>Business vector created by freepik - www.freepik.com</a>