# Statistics of Time Series 
<hr style="border:2px solid black">

## 1. Stationarity

- special and often desired property of a time series  
- underlying assumption of various time-series models
- maybe violated, for example, by the existence of *unit root*

**Load Packages**

In [None]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# statistics stack
from statsmodels.stats.diagnostic import het_white
from statsmodels.formula.api import ols
from statsmodels.tools.sm_exceptions import InterpolationWarning

# miscellaneous
import warnings
warnings.simplefilter('ignore')

### 1.1 (Weak) Stationarity

>- statistical properties of series do not depend on the timestep 
>- expectation, variance and autocorrelations do not change over time:
>
>$$
E(y_t)=E(y)=\mu,
\quad V(y_t)=V(y)=\sigma²,
\quad corr(y_t,y_{t-h})=c_h
$$
>
>- in the long run, a stationary time series is not predictible

**Example:**
> $y_t = 0.1 + 0.1\,y_{t-1} + \epsilon_t$

In [None]:
def white_noise(number_of_terms):
    np.random.seed(0)
    noise_terms = np.random.randn(number_of_terms)
    return noise_terms

In [None]:
y = 0.1
weak_stationary_data = []

for t in range(200):
    y =  0.1 + 0.1 * y
    weak_stationary_data.append(y)
    
weak_stationary_data += 0.1 * white_noise(200)
weak_stationary_data = pd.Series(weak_stationary_data)

In [None]:
mpl.rc('figure',figsize=(12,3),dpi=200)
weak_stationary_data.plot();

### 1.2 Difference Stationarity

- series is stationary up to a differencing

**Example:**
> $y_t = 2.0 + 1.0\,y_{t-1} + \epsilon_t$

In [None]:
y = 0.0
difference_stationary_data = []

for t in range(200):
    y = 2.0 + 1.0 * y
    difference_stationary_data.append(y)

difference_stationary_data += 10 * white_noise(200)
difference_stationary_data = pd.Series(difference_stationary_data)

In [None]:
mpl.rc('figure',figsize=(12,3),dpi=200)
difference_stationary_data.plot();

### 1.3 Trend Stationarity

- series is stationary up to a deterministic trend

**Example:**
> $y_t = 1.0 - e^{-0.01t} + \epsilon_t$

In [None]:
trend_stationary_data = [ 1.0 - np.exp(-0.01*t) for t in range(200)]
trend_stationary_data += 0.05 * white_noise(200)
trend_stationary_data = pd.Series(trend_stationary_data)

In [None]:
mpl.rc('figure',figsize=(12,3),dpi=200)
trend_stationary_data.plot();

### 1.4 Seasonal Stationarity

- series is stationary up to a seasonality

**Example:**
> $y_t = 0.1\,\sin(0.1\pi t) + \epsilon_t$

In [None]:
seasonal_stationary_data = [ 0.1*np.sin(0.1*np.pi*t) for t in range(200)]
seasonal_stationary_data += 0.02 * white_noise(200)
seasonal_stationary_data = pd.Series(seasonal_stationary_data)

In [None]:
mpl.rc('figure',figsize=(12,3),dpi=200)
seasonal_stationary_data.plot();

### 1.5 Heteroscedasticity

- series has non-constant variance (non-stationary)

**Example:**
> $y_t = 2.0 + 100\, t^2\,\epsilon_t$

In [None]:
heteroscedastic_data = [ (100 * t**2) for t in range(200)] * white_noise(200)
heteroscedastic_data += 2.0 
heteroscedastic_data = pd.Series(heteroscedastic_data)

In [None]:
mpl.rc('figure',figsize=(12,3),dpi=200)
pd.Series(heteroscedastic_data).plot();

<hr style="border:2px solid black">

## 2. Stationarity Tests

### 2.1 Hypothesis Testing

- statistical analysis that uses sample data to assess two mutually exclusive theories about population
- computes sample statistic and factors in estimates of sampling error to support one of the theories

**`null hypothesis`**

>- one of the two two mutually exclusive theories in hypothesis testing
>- typically, it states that there is no effect

**`alternative hypothesis`**

>- complementary theory to null hypothesis
>- typically, it states that population parameter does not equal to null hypothesis value

**`p-value`**

>- metric in test of hypothesis
>- probability of obtaining test results at least as extreme as the result actually observed
>- small $p$-value significant: null hypothesis unlikely
>- the critical values of 0.05 (CI 95%) or 0.003 (CI 99%) are typical for $p$-value 

**guidelines for using the p-value**

>|       p-value      |evidence against null hypothesis|
 |:------------------:|:------------------------------:|
 |     $$p>0.10$$     |       weak or no evidence      |
 | $$0.05<p\leq0.10$$ |        moderate evidence       |
 | $$0.01<p\leq0.05$$ |         strong evidence        |
 |    $$p\leq0.01$$   |       very strong evidence     |


### 2.2 Tools for Stationarity Test

**Augmented Dickey-Fuller (ADF)** 

>- null hypothesis: unit root exists (non-stationary)
>- leaves room for difference stationarity and seasonality

[`adfuller`](https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html)

In [None]:
from statsmodels.tsa.stattools import adfuller

**Kwiatkowski–Phillips–Schmidt–Shin (KPSS)**

>- null hypothesis: stationary up to a deterministic trend
>- often complements ADF in stationarity test 

[`kpss`](https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.kpss.html)

In [None]:
from statsmodels.tsa.stattools import kpss

**homoscedasticity test**

In [None]:
def white_homoscedasticity_test(series):
    """
    returns p-value for White's homoscedasticity test
    """
    series = series.reset_index(drop=True).reset_index()
    series.columns = ['time', 'value']
    series['time'] += 1
    
    olsr = ols('value ~ time', series).fit()
    p_value = het_white(olsr.resid, olsr.model.exog)[1]
    
    return round(p_value,6)

**stationarity test p-values**

In [None]:
def p_values(series):
    """
    returns p-values for ADF and KPSS Tests on a time series
    """
    # p value from Augmented Dickey-Fuller (ADF) Test
    p_adf = adfuller(series, autolag="AIC")[1]
    
    # p value from Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test
    p_kpss = kpss(series, regression="c", nlags="auto")[1]
    
    return round(p_adf,6), round(p_kpss,6)

**function for stationarity test**

In [None]:
def test_stationarity(series):
    """
    returns likely conclusions about series stationarity
    """
    # test homoscedasticity
    p_white = white_homoscedasticity_test(series)
    
    if p_white < 0.05:
        print(f"\n non-stationary: heteroscedastic (White test p-value: {p_white}) \n")
    
    # test stationarity
    else:
        p_adf, p_kpss = p_values(series)
        
        # print p-values
        print( f"\n p_adf: {p_adf}, p_kpss: {p_kpss}" )
    
        if (p_adf < 0.05) and (p_kpss >= 0.05):
            print('\n stationary or seasonal-stationary')
            
        elif (p_adf >= 0.1) and (p_kpss < 0.05):
            print('\n difference-stationary')
            
        elif (p_adf < 0.1) and (p_kpss < 0.05):
            print('\n trend-stationary')
        
        else:
            print('\n non-stationary; no robust conclusions\n')

### 2.3 Test Cases

**weakly stationary series**

In [None]:
test_stationarity(weak_stationary_data)

**difference-stationary series**

In [None]:
test_stationarity(difference_stationary_data)

**trend-stationary series**

In [None]:
test_stationarity(trend_stationary_data)

**seasonal stationary series**

In [None]:
test_stationarity(seasonal_stationary_data)

**heteroscedastic series**

In [None]:
test_stationarity(heteroscedastic_data)

<hr style="border:2px solid black">

## 3. [Box-Jenkins Methods](https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc446.htm)

- methods for time-series model selection
- based on ACF and PACF plots
- underlying assumption of weak stationarity

**(Partial) Autocorrelation Function (ACF & PACF)**

In [None]:
# function provided by statsmodels
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
def auto_correlation_plot(series):
    """
    plots autocorrelations for a given series
    """
    mpl.rc('figure',figsize=(10,2),dpi=200)
    plot_acf(series,zero=False,lags=25)
    plt.xlabel('number of lags')
    plt.ylabel('autocorrelation')

In [None]:
def partial_auto_correlation_plot(series):
    """
    plots partial autocorrelations for a given series
    """
    mpl.rc('figure',figsize=(10,2),dpi=200)
    plot_pacf(series,zero=False,lags=25)
    plt.xlabel('number of lags')
    plt.ylabel('partial autocorrelation')

### 3.1 White Noise

In [None]:
data = white_noise(200)

**autocorrelation plot**

In [None]:
auto_correlation_plot(data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(data)

### 3.2 Non-Stationary Data

In [None]:
data = heteroscedastic_data

**autocorrelation plot**

In [None]:
auto_correlation_plot(data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(data)

### 3.3 Seasonal Data

In [None]:
data = seasonal_stationary_data

**autocorrelation plot**

In [None]:
auto_correlation_plot(data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(data)

### 3.4 AR(p) Model

In [None]:
# import module for simulating data
from statsmodels.tsa.arima_process import ArmaProcess

def arma_model(ar_coef=[], ma_coef=[]):
    """
    generates sample data for AR, MA, and ARMA processes
    """
    np.random.seed(12345)
    ar = np.array([1] + [-c for c in ar_coef])
    ma = np.array([1] + ma_coef)
    data = ArmaProcess(ar,ma).generate_sample(nsample=200)
    return data

In [None]:
# generate sample data for the AR (1) model: 
# y_t = 0.75 y_{t-1} + epsilon
#ar_data = arma_model(ar_coef=[0.75])

# generate sample data for the AR (2) model: 
# y_t = 0.75 y_{t-1} - 0.25 y_{t-2} + epsilon
ar_data = arma_model(ar_coef=[0.75,-0.25])

**autocorrelation plot**

In [None]:
auto_correlation_plot(ar_data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(ar_data)

### 3.5 MA(q) Model

In [None]:
# MA(1) model
# ma_data = arma_model(ma_coef=[0.65])

# MA(2) model
ma_data = arma_model(ma_coef=[0.65,0.35])

**autocorrelation plot**

In [None]:
auto_correlation_plot(ma_data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(ma_data)

### 3.6 ARMA(p,q) Model

In [None]:
# ARMA(1,1) model
arma_data = arma_model(ar_coef=[0.75],ma_coef=[0.65])

**autocorrelation plot**

In [None]:
auto_correlation_plot(arma_data)

**partial autocorrelation plot**

In [None]:
partial_auto_correlation_plot(arma_data)

<hr style="border:2px solid black">

## Additional Topics

`pip install pmdarima`
- ARIMA estimators for Python

In [None]:
import pmdarima as pm

In [None]:
model = pm.auto_arima(
    arma_data, 
    start_p=0,
    max_p=2,
    seasonality = False,
    stationarity = True,
    trace = True,
    n_jobs=-1
)

### `Akaike Information Criterion (AIC)`

- smart metric taking care of both over- and underfitting
- relative comparison of models; the lower the AIC, the better

**AIC formula**

>$$
\text{AIC} = 2k - 2\log\hat{L}
$$
>
>- $k =$ number of estimated parameters in the model; penalizes overfitting
>- $\hat{L} = $ maximum value of the likelihood function for the model; penalizes underfitting

<hr style="border:2px solid black">

## References

- Forecasting: Principles and Practice, R. J. Hyndman & G. Athanasopoulos,
[OTexts Free Online Book](https://otexts.com/fpp3/)
- [Time Series Talk : Stationarity](https://www.youtube.com/watch?v=oY-j2Wof51c)
- [Detecting stationarity in time series data](https://towardsdatascience.com/detecting-stationarity-in-time-series-data-d29e0a21e638)
- [Unit Roots : Time Series Talk](https://www.youtube.com/watch?v=ugOvehrTRRw)
- [Time Series Talk : Augmented Dickey Fuller Test](https://www.youtube.com/watch?v=1opjnegd_hA)
- [Box-Jenkins Model Identification](https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc446.htm)
- [The Critical Value and the p-Value Approach to Hypothesis Testing](https://www.geo.fu-berlin.de/en/v/soga/Basics-of-statistics/Hypothesis-Tests/Introduction-to-Hypothesis-Testing/Critical-Value-and-the-p-Value-Approach/index.html)