# Granger-Causality Test

## The Data

In [None]:
df_news = pd.read_csv('/content/google_news.csv')
df_price = pd.read_csv('/content/price_change.csv')
df_search = pd.read_csv('/content/search_interest.csv')
df_news['date'] =  pd.to_datetime(df_news['date'])
df_price['date'] =  pd.to_datetime(df_price['date'])
df_search['date'] =  pd.to_datetime(df_search['date'])

df = pd.merge(df_news[['date', 'intro Polarity','title Polarity']], df_price[['date', 'Meta-Tech']], on='date', how='left').rename(columns = {'Meta-Tech':'price_change'})
df = df.merge(df_search[['date', 'total']], on='date', how='left').rename(columns={'total':'search_interest'})

df = df.set_index('date').rename_axis('series', axis=1).dropna().sort_values(by=['date'])
df = df[['intro Polarity',	'title Polarity','search_interest',	'price_change']]
df.head()

series,intro Polarity,title Polarity,search_interest,price_change
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-11-16,0.098788,0.0,7.272727,3.498496
2021-11-16,0.098788,0.0,7.272727,3.498496
2021-11-16,0.8,0.0,7.272727,3.498496
2021-11-17,0.0,0.0,6.666667,0.961152
2021-11-18,0.0,0.0,7.272727,2.161745


In [None]:
df = df.groupby(['date']).mean()
df.head(7)

series,intro Polarity,title Polarity,search_interest,price_change
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-11-16,0.332525,0.0,7.272727,3.498496
2021-11-17,0.0,0.0,6.666667,0.961152
2021-11-18,0.033333,0.006667,7.272727,2.161745
2021-11-23,0.011111,-0.133333,5.454545,1.953025
2021-11-24,0.0,0.0,6.060606,1.528432
2021-11-30,-0.019444,0.0,6.666667,7.196204
2021-12-01,0.054545,0.0,7.272727,7.375129


In [None]:
df.shape

(42, 4)

## Visualize the Time Series

In [None]:
import plotly.express as px

fig = px.line(df, facet_col="series", facet_col_wrap=1)
fig.update_yaxes(matches=None)
fig.show()

In [None]:
fig = px.area(df, facet_col='series', facet_col_wrap=1)
fig.update_yaxes(matches=None)
fig.show()

## ADF Test for Stationarity

The ADF test is one of the most popular statistical tests. It can be used to help us understand whether the time series is stationary or not.

* Null hypothesis: If failed to be rejected, it suggests the time series is not stationarity.

* Alternative hypothesis: The null hypothesis is rejected, it suggests the time series is stationary.

In [None]:
n_obs = 8
df_train, df_test = df[0:-n_obs], df[-n_obs:]

from statsmodels.tsa.stattools import adfuller

def adf_test(df):
    result = adfuller(df.values)
    print('ADF Statistics: %f' % result[0])
    print('p-value: %f' % result[1])
    print('Critical values:')
    for key, value in result[4].items():
        print('\t%s: %.3f' % (key, value))

print('ADF Test: intro Polarity')
adf_test(df_train['intro Polarity'])
print('ADF Test: title Polarity')
adf_test(df_train['title Polarity'])
print('ADF Test: search_interest')
adf_test(df_train['search_interest'])
print('ADF Test: price_change')
adf_test(df_train['price_change'])

ADF Test: intro Polarity
ADF Statistics: -3.229607
p-value: 0.018334
Critical values:
	1%: -3.670
	5%: -2.964
	10%: -2.621
ADF Test: title Polarity
ADF Statistics: -6.037265
p-value: 0.000000
Critical values:
	1%: -3.646
	5%: -2.954
	10%: -2.616
ADF Test: search_interest
ADF Statistics: -3.209647
p-value: 0.019433
Critical values:
	1%: -3.646
	5%: -2.954
	10%: -2.616
ADF Test: price_change
ADF Statistics: -5.659447
p-value: 0.000001
Critical values:
	1%: -3.646
	5%: -2.954
	10%: -2.616


The first 3 p-values are below the 0.05 alpha level, we can reject the null hypothesis. So all the time series are stationary.

## KPSS Test for Stationary

The KPSS test figures out if a time series is stationary around a mean or linear trend, or is non-stationary due to a unit root.

Null hypothesis: The time series is trend stationary

Alternative hypothesis: The time series is not trend stationary

In [None]:
from statsmodels.tsa.stattools import kpss

def kpss_test(df):    
    statistic, p_value, n_lags, critical_values = kpss(df.values)
    
    print(f'KPSS Statistic: {statistic}')
    print(f'p-value: {p_value}')
    print(f'num lags: {n_lags}')
    print('Critial Values:')
    for key, value in critical_values.items():
        print(f'   {key} : {value}')
        
print('KPSS Test: intro Polarity')
kpss_test(df_train['intro Polarity'])
print('KPSS Test: title Polarity')
kpss_test(df_train['title Polarity'])
print('KPSS Test: search_interest')
kpss_test(df_train['search_interest'])
print('KPSS Test: price_change')
kpss_test(df_train['price_change'])

KPSS Test: intro Polarity
KPSS Statistic: 0.25672223680817335
p-value: 0.1
num lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
KPSS Test: title Polarity
KPSS Statistic: 0.13528910525233992
p-value: 0.1
num lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
KPSS Test: search_interest
KPSS Statistic: 0.3839095696891949
p-value: 0.08409070272017463
num lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739
KPSS Test: price_change
KPSS Statistic: 0.1490986751350093
p-value: 0.1
num lags: 10
Critial Values:
   10% : 0.347
   5% : 0.463
   2.5% : 0.574
   1% : 0.739





p-value is greater than the indicated p-value


p-value is greater than the indicated p-value


p-value is greater than the indicated p-value



The p-value are all greater than 0.05 alpha level, therefore, we cannot reject the null hypothesis and derive that the four time series are stationary around a mean or linear trend.

After cross-check ADF test and KPSS test. We can conclude that all the time series data we have here are stationary. We don't need to transform the time series to be stationary by difference method.

## VAR Model

The VAR class assumes that the passed time series are stationary. Non-stationary or trending data can often be transformed to be stationary by first-differencing or some other method.

In [None]:
from statsmodels.tsa.api import VAR

model = VAR(df_train)
for i in [1,2,3]:
    result = model.fit(i)
    print('Lag Order =', i)
    print('AIC : ', result.aic)
    print('BIC : ', result.bic)
    print('FPE : ', result.fpe)
    print('HQIC: ', result.hqic, '\n')

Lag Order = 1
AIC :  -9.438249791314695
BIC :  -8.53127551163804
FPE :  8.037202378294534e-05
HQIC:  -9.133080541931639 

Lag Order = 2
AIC :  -9.203485810357153
BIC :  -7.55453291970746
FPE :  0.00010716140445152239
HQIC:  -8.656904578689172 

Lag Order = 3
AIC :  -8.969095355746681
BIC :  -6.563697464352241
FPE :  0.00015867339920518645
HQIC:  -8.184995620918972 




A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.



In [None]:
results = model.fit(maxlags=1, ic='aic')
results.summary()

  Summary of Regression Results   
Model:                         VAR
Method:                        OLS
Date:           Thu, 28, Jul, 2022
Time:                     18:02:57
--------------------------------------------------------------------
No. of Equations:         4.00000    BIC:                   -8.53128
Nobs:                     33.0000    HQIC:                  -9.13308
Log likelihood:          -11.5688    FPE:                8.03720e-05
AIC:                     -9.43825    Det(Omega_mle):     4.57116e-05
--------------------------------------------------------------------
Results for equation intro Polarity
                        coefficient       std. error           t-stat            prob
-------------------------------------------------------------------------------------
const                     -0.182200         0.079974           -2.278           0.023
L1.intro Polarity         -0.044290         0.138657           -0.319           0.749
L1.title Polarity         -0.03

The biggest correlation is 0.34 (search_interest & price_change).

## Durbin-Watson Statistic

The Durbin Watson Test is a measure of autocorrelation in residuals from regression analysis.

In [None]:
from statsmodels.stats.stattools import durbin_watson

out = durbin_watson(results.resid)

for col, val in zip(df.columns, out):
    print(col, ':', round(val, 2))

intro Polarity : 1.72
title Polarity : 1.97
search_interest : 2.26
price_change : 1.95


A value of 2 or nearly 2 indicates that there is no first-order autocorrelation. An acceptable range is 1.50 - 2.50. Therefore, we think there is no autocorrelation detected in the residuals.

## Granger Causality Test

In [None]:
from statsmodels.tsa.stattools import grangercausalitytests

maxlag=1
test = 'ssr_chi2test'

def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):    
   
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df

grangers_causation_matrix(df, variables = df.columns)

Unnamed: 0,intro Polarity_x,title Polarity_x,search_interest_x,price_change_x
intro Polarity_y,1.0,0.776,0.5837,0.1826
title Polarity_y,0.433,1.0,0.1022,0.8843
search_interest_y,0.0432,0.4049,1.0,0.3522
price_change_y,0.0314,0.9663,0.0264,1.0


The row are the response (y) and the columns are the predictors (x). If a given p-value is < significance level (0.05), for example, take the value 0.0314 in (row 4, column 1), we can reject the null hypothesis and conclude that intro Polarity_x Granger causes price_change_y. Likewise, the 0.0264 in (row 4, column 3) refers to search_interest_x	 Granger causes price_change_y.

**Conclusion**:
* intro Polarity_x Granger causes price_change_y

* search_interest_x	 Granger causes price_change_y