# Stationarity

[Link to really interesting video](https://www.youtube.com/watch?v=621MSxpYv60&list=PLKmQjl_R9bYd32uHImJxQSFZU5LPuXfQe&index=2)

🤔 What is Stationarity. It is when your data does not exhibit and long term trends or seasonality

## Import Libraries

In [16]:
import sympy as sym
import pandas as pd
import numpy as np
import plotly.express as px
from statsmodels.tsa.stattools import adfuller
sym.init_printing()
from IPython.display import display, Math

## Import Data

In [5]:
data = pd.read_csv("../data/airline.csv")

This is just a dataset with the number of us Passengers for an airline per month

In [6]:
data.head()

Unnamed: 0,Month,#Passengers
0,1949-01,112
1,1949-02,118
2,1949-03,132
3,1949-04,129
4,1949-05,121


# Helpers

In [7]:
def plotting(title, data, x, y, x_label, y_label):
    """ General function to plot data"""
    fig = px.line(data, x=data[x], y=data[y], labels={x: x_label, y: y_label})

    fig.update_layout(template='simple_white', font=dict(size=18),
                      title_text=title, width=650,
                      title_x=0.5, height=400)
    
    fig.show()

# Viewing the Data

In [8]:
plotting(title='Airline Passengers', data=data, x='Month',
         y='#Passengers', x_label='Date', y_label='Passengers')

Notice that the average passenger volume is increasing year on year.
Also notice that fluctuations on a yearly scale are also increasing

Therefore we can say that this is not stationary because there is a clear increase in trend and a clear increase in variance

The ways to make a time series stationary is through transformations 🤖

# Differencing Transform

The most common transformation is differencing

$$d(t) = y(t) - y(t-1)$$

where d(t) is the difference at time t between the series at points y(t) and y(t-1)

basically just take two adjacent points and get the difference

In [27]:
data["passenger_diff"] = data["#Passengers"].diff()

In [28]:
data.head()

Unnamed: 0,Month,#Passengers,passenger_diff
0,1949-01,112,
1,1949-02,118,6.0
2,1949-03,132,14.0
3,1949-04,129,-3.0
4,1949-05,121,-8.0


In [29]:
plotting(title="Airline Passengers", data=data, x='Month', y='passenger_diff',
         x_label='Date', y_label='Passenger<br>Difference Transform')

now the mean is hovering around zero and we have elimated the trend over time

as you might have noticed, we still have variance. We have stabilized the mean, but now its time to stabilize the variance.

# Logarithm Transform

In [31]:
#this is the natural log function
data['passenger_log'] = np.log(data['#Passengers'])

In [32]:
data.head()

Unnamed: 0,Month,#Passengers,passenger_diff,passenger_log
0,1949-01,112,,4.718499
1,1949-02,118,6.0,4.770685
2,1949-03,132,14.0,4.882802
3,1949-04,129,-3.0,4.859812
4,1949-05,121,-8.0,4.795791


In [33]:
plotting(title='Airline Passengers', data=data, x='Month',
         y='passenger_log', x_label='Date', y_label='Passenger<br>Log Transform')

now we can see that the fluctuations are now consistent over time, but there is still a trend 

So, we need to apply the difference transform again

# Logarithm and Difference Transform

In [34]:
data['passenger_diff_log'] = data['passenger_log'].diff()

In [35]:
data.head()

Unnamed: 0,Month,#Passengers,passenger_diff,passenger_log,passenger_diff_log
0,1949-01,112,,4.718499,
1,1949-02,118,6.0,4.770685,0.052186
2,1949-03,132,14.0,4.882802,0.112117
3,1949-04,129,-3.0,4.859812,-0.02299
4,1949-05,121,-8.0,4.795791,-0.064022


In [36]:
plotting(title='Airline Passengers', data=data, x='Month',
         y='passenger_diff_log', x_label='Data', y_label='Passenger<br>Log and Difference')

this looks pretty stationary. the fluctuations are pretty similar and all the values are pretty stationary around the mean

# Stationary Test

This is a 🧠 quantative measure to determine if the data is indeed stationary

This is the Augmented Dickey-Fuller (ADF) test. This is a statistical hypothesis test where the null hypothesis in the series is non-stationary

In [37]:
def adf_test(series):
    """Using and ADF test to determine if a series is stationary"""
    test_results = adfuller(series)
    print('ADF Statistic: ', test_results[0])
    print('P-Value: ', test_results[1])
    print('Critical Values: ')
    for thres, adf_stat in test_results[4].items():
        print('\t%s: %.2f' % (thres, adf_stat))

        

In [38]:
adf_test(data["passenger_diff_log"][1:])

ADF Statistic:  -2.717130598388133
P-Value:  0.07112054815085875
Critical Values: 
	1%: -3.48
	5%: -2.88
	10%: -2.58


if our ADF score is below -3.48, then we have a 99% confidence that our data is stationary

Since our score = -2.7, then we fall in the 90% confidence range that our data is stationary

# Conclusions

If we dont make our time series stationary, then our mean and variance are changing through time

Because its changing over time, every data point has it own mean and variance.

Because if this they all belong to a different distribution

most models prefer a normal distribution.

⁉️ This way most models will be able to fit parameters assuming there is some sort of univeral distribution across all the data points, therefore better forecasting