## Stationarity: 
### does not exihibit long term trend
### constand mean and variance through time

In [2]:
import plotly.express as px
import pandas as pd
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.ar_model import AutoReg
import numpy as np

In [3]:
## Demo plotly plot for understanding

# Example data
dates = pd.date_range('2025-01-01', periods=10)
values = [10, 12, 15, 13, 14, 17, 18, 16, 19, 20]

df = pd.DataFrame({'Date': dates, 'Value': values})

# Create interactive plot
fig = px.line(df, x='Date', y='Value', title='Interactive Time Series Plot')
fig.show()


In [4]:
df = pd.read_csv('AirPassengers.csv', parse_dates=True) #<--- important
df.describe()
df.head()
print(df.values)
print(df.shape)

[['1949-01' 112]
 ['1949-02' 118]
 ['1949-03' 132]
 ['1949-04' 129]
 ['1949-05' 121]
 ['1949-06' 135]
 ['1949-07' 148]
 ['1949-08' 148]
 ['1949-09' 136]
 ['1949-10' 119]
 ['1949-11' 104]
 ['1949-12' 118]
 ['1950-01' 115]
 ['1950-02' 126]
 ['1950-03' 141]
 ['1950-04' 135]
 ['1950-05' 125]
 ['1950-06' 149]
 ['1950-07' 170]
 ['1950-08' 170]
 ['1950-09' 158]
 ['1950-10' 133]
 ['1950-11' 114]
 ['1950-12' 140]
 ['1951-01' 145]
 ['1951-02' 150]
 ['1951-03' 178]
 ['1951-04' 163]
 ['1951-05' 172]
 ['1951-06' 178]
 ['1951-07' 199]
 ['1951-08' 199]
 ['1951-09' 184]
 ['1951-10' 162]
 ['1951-11' 146]
 ['1951-12' 166]
 ['1952-01' 171]
 ['1952-02' 180]
 ['1952-03' 193]
 ['1952-04' 181]
 ['1952-05' 183]
 ['1952-06' 218]
 ['1952-07' 230]
 ['1952-08' 242]
 ['1952-09' 209]
 ['1952-10' 191]
 ['1952-11' 172]
 ['1952-12' 194]
 ['1953-01' 196]
 ['1953-02' 196]
 ['1953-03' 236]
 ['1953-04' 235]
 ['1953-05' 229]
 ['1953-06' 243]
 ['1953-07' 264]
 ['1953-08' 272]
 ['1953-09' 237]
 ['1953-10' 211]
 ['1953-11' 18

In [5]:
df.isnull().sum()

Month          0
#Passengers    0
dtype: int64

In [6]:
fig = px.line(df, x='Month', y='#Passengers', title='')
fig.show()

In [7]:
## stationarity test
## One such method is the Augmented Dickey-Fuller (ADF) test. This is a statistical hypothesis test where the null hypothesis is the series is non-stationary (also known as a unit root test)

dftest =adfuller(df['#Passengers'])
print(dftest[1]) ## P value >0.5 this is non stationary 




0.991880243437641


In [8]:
## making the the data stationary

df['Passengers_log']=np.log(df['#Passengers'])

df['final_passenger'] = df['Passengers_log'].diff()

#fig = px.line(df, x='Month', y='#Passengers', title='')
#fig.show()
df.head()



 

Unnamed: 0,Month,#Passengers,Passengers_log,final_passenger
0,1949-01,112,4.718499,
1,1949-02,118,4.770685,0.052186
2,1949-03,132,4.882802,0.112117
3,1949-04,129,4.859812,-0.02299
4,1949-05,121,4.795791,-0.064022


In [9]:
fig = px.line(df, x='Month', y='final_passenger', title='')
fig.show()

## Box cox Transform

The Box-Cox transforms non-normal data to normal distribution like data.

Why do we need our time series data to resemble a normal distribution? Well, when fitting certain models, such as ARIMA, they use the maximum likelihood estimation (MLE) to determine their parameters. MLE by definition must fit against a certain distribution, which for most packages is the normal distribution.

We are not trying to make the whole time series "normal" across time. We are trying to make the distribution of values (the magnitudes) look normal. We are transforming the y-axis values (the data values themselves), NOT trying to destroy or shuffle the time order.

Time stays as it is — Only values are transformed so that their distribution (histogram) becomes symmetric (bell curve shaped).


The Box-Cox transformation is parameterised by λ (that takes real values from -5 to 5) and transforms the time series, y, as:

We see that with λ=0 it is the natural log transform, however there are many others depending on the value λ.

For example, if λ=0 it is the square root transform, λ=1 there is no transform and λ=3 is the cubic transform.

The value λ is chosen by seeing which value best approximates the transformed data to the normal distribution.


In [10]:
from scipy.stats import boxcox
df['P_boxcox'], lam = boxcox(df['#Passengers']) 

fig = px.line(df, x='Month', y='P_boxcox', title='')
fig.show()
print(lam)

0.14802261727063243


## Seasonality

easiest way to remove the seasonality or use other sophisticated methods

differencing with \y(t) =\y(t-m)

In [12]:
df['P_seasional']=df["#Passengers"].diff(periods=12)

fig = px.line(df, x='Month', y='P_seasional', title='')
fig.show()
print(lam)

0.14802261727063243


In [None]:
## check adf 
