## Investigation of the relationship between different stock market indexes from Developed and Emerging markets

<div style='text-align: justify;'>
The relationship between international markets is an important issue for investors and researchers. Global markets present complex dynamics and 
interdependences, forming a dynamic ecosystem where various financial products and assets interact. Markets are distinguished between developed
and emerging according to their economic stability, political risk, financial system maturity, and market regulatory framework. Generally, developing
markets tend to have higher systematic risk and growth expectations. To reduce the diversifiable risk and gain higher returns, portfolio managers 
add stocks from emerging markets to portfolios that contain financial products from advanced markets. However, financial liberalization and global 
trade agreements made financial markets more cointegrated. Consequently, the information transmission among the international stock markets can 
cause a financial contagious effect that vanishes potential diversification opportunities.
</div>

In [3]:
# Libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.tsa.vector_ar.var_model import *
from statsmodels.stats.stattools import durbin_watson
from statsmodels.graphics.tsaplots import plot_acf

In [4]:
def import_data(path):
    # The data were downloaded from investment.com
    # Load Market index logarithmic returns
    df = pd.read_csv(path)

    # Change column name
    df.rename(columns={'Unnamed: 0': 'Date'}, inplace=True)

    # Convert string to date time
    df['Date'] = pd.to_datetime(df['Date'])

    # Remove time
    df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')

    # Set frequency to avoid errors later in the VAR model
    df['Date'] = pd.DatetimeIndex(df['Date']).to_period('B')
    #df['Date'] = pd.DatetimeIndex(df['Date'])
    # Set Datetime index
    df.set_index('Date', inplace=True)
    return df

In [5]:
# Load market index prices
path1="market_indexes.csv"
market_price = import_data(path1)
end=round(len(market_price)*0.9)
market_price=market_price.iloc[0:end,:]
log_market_price = market_price.apply(np.log, axis=0)
log_market_ret = log_market_price - log_market_price.shift(1)
log_market_ret.dropna(inplace=True)

# Stationarity  
<div style='text-align: justify;'>
With time-series data an initial step is to test the presence of unit roots, in other words it should be tested if the market indices are stationary.
</div>  

## ADF unit root test  


One of the most used tests for stationarity was initially introduced by (Dickey & Fuller, 1979).

The unit root test model is given by:

$$Δy_{t} = ψy_{t-1} + \sum \limits_{j=1} ^{p} a_{j}Δy_{t-j}  + e_{t}$$

<div style='text-align: justify;'>
Where ψ is the coefficient of the first lag, p is the number of lags, αj is the coefficient of the first differences and et is the error term. The number of lags will be decided using the Akaike information criterion (AIC).
</div>  

<u>Hypothesis Testing.</u>  
H0: ψ=0, yt is non-stationary  
H1: ψ<0, yt is stationary  

<div style='text-align: justify;'>
The general idea is that stationary time-series have a mean-reverting behavior and as a result previous values (yt-1) should provide relevant information for the movement of the series. In the case of ψ=0, the lagged value has no effect on the change of the variable and as a result, it is not stationary.
</div>

In [1]:
from statsmodels.tsa.stattools import adfuller
def adf(df, j):

    # Augmented Dickey-Fuller
    # The number of lags was decided using AIC which is the default IC
    # Maximum lag that is tested is 12
    result = adfuller(df, maxlag=12, autolag='AIC')
    if result[1] <= 0.05:
        print('{} is stationary'.format(j))
        print("p-value: {}".format(result[1]))
    else:
        print('{} is non stationary'.format(j))
        print('p-value: {}'.format(result[1]))
    return "{:.4f}".format(result[1])

### In levels

In [20]:
market_indices_columns = market_price.columns.values.tolist()

stationarity_level = pd.DataFrame(0, index=['p-value'], columns=market_indices_columns)

for j in market_indices_columns:
    stationarity_level[j] = adf(log_market_price[j], j)

Brazil is stationary
p-value: 0.04589581919058934
China is non stationary
p-value: 0.09376716288440134
Germany is non stationary
p-value: 0.11706504529947537
Greece is non stationary
p-value: 0.07383288104199594
India is non stationary
p-value: 0.719903539990822
US is non stationary
p-value: 0.6187102275343025


### In 1st difference

In [21]:
stationarity_1stdif = pd.DataFrame(0, index=['p-value'], columns=market_indices_columns)
for j in market_indices_columns:
    stationarity_1stdif[j] = adf(log_market_ret[j], j)

Brazil is stationary
p-value: 2.2877586712872303e-19
China is stationary
p-value: 0.0
Germany is stationary
p-value: 1.572522027603825e-24
Greece is stationary
p-value: 4.4162919292700536e-20
India is stationary
p-value: 1.1626021931785866e-17
US is stationary
p-value: 6.543707176570867e-20


### Results  
<div style='text-align: justify;'>
The result shows that all the variables apart from Brazil, are non-stationary in levels and stationary in first difference. 
As Brazil is stationary in every case, the conclusion is that the other variables are integrated of order one, I(1) 
and they should be tested for cointegration. Brazil will not be tested for cointegration as only variables integrated of the same order can be cointegrated.
</div>

# Long-Term Relationship
## Cointegration
<div style='text-align: justify;'>
The concept of cointegration was introduced by (Engle & Granger, 1987). In the VAR setting, when some variables have a common stochastic trend, or simply they move together in a consistent and predictable way, it is said that these variables are cointegrated. It is important to state that the cointegration tests are implemented in levels, so the returns are not used.
</div>  

### Engle-Granger Two-step procedure
<div style='text-align: justify;'>    
The cointegration can be tested using the Engle-Granger two-step procedure, which was proposed by (Engle & Granger, 1987).
The first step is to estimate using OLS a model that includes the variables that need to be tested for cointegration.
</div>
$$y_{1t} = a_{0} + \sum \limits_{j=1} ^{N} a_{j}y_{jt}  + e_{t}$$
<div style='text-align: justify;'>
Where y1t, yjt are the selected variables, M is the total number of the selected variables, a0 is the intercept, aj is the coefficient of the j-th variable, where j=2...N, and et is the error-term.
</div>
Once the parameters are obtained, then the estimation of the error term is of the form, 
$$ e_{t} = y_{1t} - a_{0} - \sum \limits_{j=1} ^{N} a_{j}y_{jt}$$

The second step is to test whether the residuals are stationary using the ADF unit test.

<u>Hypothesis Testing.</u>   
H0: et non-stationary, no-cointegration  
H1: et stationary, cointegration  
<div style='text-align: justify;'>
To identify relationships between two variables, the model has certain limitations such as being sensitive to the order of the lags and impacted by the ordering of variables. 
</div>

In [17]:
from statsmodels.tsa.stattools import coint
def engle_granger(df,col):
    cointegration_pvalues = pd.DataFrame(0, index=col, columns=col)
    for i in df:
        for j in df:
            if i != j:
                result = coint(df[j], df[i])
                cointegration_pvalues.loc[j, i] = result[1]
    return cointegration_pvalues

In [18]:
# Brazil is excluded from the data set for the cointegration tests
coint_ind = log_market_price.drop('Brazil',axis=1)
coint_ind.head()


Unnamed: 0_level_0,China,Germany,Greece,India,US
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-04,8.100544,9.23829,6.431524,10.151259,7.607213
2016-01-05,8.097947,9.240879,6.425825,10.149579,7.609223
2016-01-07,8.04719,9.208323,6.381393,10.120687,7.572035
2016-01-08,8.06665,9.19516,6.373286,10.124001,7.561137
2016-01-11,8.011919,9.192693,6.36592,10.119608,7.56199


In [30]:
cointegration_pvalues = engle_granger(coint_ind, coint_ind.columns.values)
cointegration = cointegration_pvalues.copy()
cointegration[cointegration <= 0.05] = 'YES'
cointegration[cointegration != 'YES'] = 'NO'
for i in range(len(cointegration)):
    cointegration.iloc[i, i] = '-'
cointegration.head()

Unnamed: 0,China,Germany,Greece,India,US
China,-,YES,NO,NO,NO
Germany,YES,-,NO,NO,NO
Greece,NO,NO,-,NO,NO
India,NO,NO,NO,-,NO
US,NO,NO,NO,NO,-


### Results
Only China and Germany are pairwise-cointegrated.

## Johansen cointegration test
This test is based on VECM model which is a transformation of the VAR model. The VECM model is suitable in cases where the variables exhibit cointegration.
VECM(p) specification: 

$$ΔY_{t} = D_{t} + ΠY_{t-1} + \sum \limits_{j=1} ^{p-1} Γ_{j}ΔΥ_{t-j}  + e_{t}$$
<div style='text-align: justify;'>
Dt is the intercept, while $Π$ and $Γ_{j}$ demonstrate the long and short-run relationship between the variables respectively. When the cointegration rank and the lag order of the model are selected, the estimation is implemented by maximum likelihood.
The Johansen test aims to define the number of cointegrating relationships and is mainly applied using two methods, the trace test, and the eigenvalue test. When Π=0, then there is no cointegration, as there is no long-run equilibrium between two or more variables. Intuitively, as the rank of a matrix denotes the maximum number of linearly independent vectors, then the rank of Π represents the number of independent linear combinations of the variables that are stationary.  
</div>  

<b>Trace Test</b>  
<u>Hypothesis Testing.</u>    
H0: rank(Π)=r  
H1: rank(Π) ≥ r  
<u>Trace Statistic:</u> 
$$Trace_{Stat} = -T\sum \limits_{i=r+1} ^{n} ln(1-\hat{λ_{i}})$$

<b>Maximum Eigenvalue Test</b>  
<u>Hypothesis Testing.</u>   
H0: rank(Π)=r  
H1: rank(Π) ≥ r + 1  
<u>Maximum Eigenvalue Statistic:</u>
$$MAX_EIG_{Stat} = -Tln(1-\hat{λ_{r+1}})$$
For r= 0,1, …... n-1

In [32]:
from statsmodels.tsa.vector_ar.vecm import *
def multivariate_johansen(df, alpha=0.05):
    """Perform Johanson's Cointegration Test and Report Summary"""
    out = coint_johansen(df,-1, 5)
    d = {'0.90':0, '0.95':1, '0.99':2}
    traces = out.lr1
    cvts = out.cvt[:, d[str(1-alpha)]]
    def adjust(val, length= 6): return str(val).ljust(length)
    # Summary
    print('Name   ::  Test Stat > C(95%)    =>   Signif  \n', '--'*20)
    for col, trace, cvt in zip(df.columns, traces, cvts):
        print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' =>  ' , trace > cvt)
# Reference: https://gist.github.com/BioSciEconomist/197bd86ea61e0b4a49707af74a0b9f9c

In [33]:
multivariate_johansen(coint_ind)

Name   ::  Test Stat > C(95%)    =>   Signif  
 ----------------------------------------
China  ::  50.48     > 60.0627   =>   False
Germany ::  31.0      > 40.1749   =>   False
Greece ::  15.54     > 24.2761   =>   False
India  ::  5.19      > 12.3212   =>   False
US     ::  0.1       > 4.1296    =>   False


### Results  

<div style='text-align: justify;'>
the results of the Johansen test, employed by trace method and a lag order of five. As the test statistics are lower than the critical values, then the variables are not cointegrated.
The Johansen test was employed by trace method and a lag order of five. As the test statistics are lower than the critical values, then the variables are not cointegrated.
</div>

# Short-Term Relationship  
## Granger-Causality test  
<div style='text-align: justify;'>
Granger causality test attempts to identify the causal relationships among the variables. The method was proposed by (Granger, 1969) and become famous by (Sims, 1972). In the bivariate case, it tries to examine the causal relationship between two variables (Y1, Y2). According to (Brooks, 2008), when Y1 Granger-cause Y2, then Y1 is able to forecast Y2 and in the VAR setting this happens when the lags of Y1 in the equation of Y2 are statistically significant. This is the case of unidirectional causality from Y1 to Y2. When Y2 also Granger-cause Y1, then this case is known as bi-directional causality.
</div>  

From the VAR(p) model, the equations of Y1 and Y2 are:  
$$ Y_{1t} = A_{10} + \sum \limits _{j=1} ^{p}A_{11j}Y_{1t-j} + \sum \limits _{j=1} ^{p}A_{12j}Y_{2t-j} + e_{1t}$$

$$ Y_{2t} = A_{20} + \sum \limits _{j=1} ^{p}A_{21j}Y_{1t-j} + \sum \limits _{j=1} ^{p}A_{22j}Y_{2t-j} + e_{2t}$$
The hypothesis testing can easily be implemented using the F-test statistic (or Wald statistic).  

<u>Hypothesis testing for Y1 Granger-cause Y2:</u>  
H0: Y1 fails to Granger-cause Y2 (the elements of A21 are 0)  
H1: Y1 Granger-cause Y2 (the elements of A21 are different from 0)  

<div style='text-align: justify;'>
It is important to note that Granger causality has a notion similar to correlation and not causation. It is a statistical method that identifies whether past values of one variable can predict the current or future values of another variable.
</div>

In [34]:
from statsmodels.tsa.stattools import grangercausalitytests
def grangers_causality_matrix(data, variables, lag_order=3, test = 'ssr_chi2test', verbose=False):
    # Code:
    # https://towardsdatascience.com/granger-causality-and-vector-auto-regressive-model-for-time-series-forecasting-3226a64889a6
    # https://phdinds-aim.github.io/time_series_handbook/04_GrangerCausality/04_GrangerCausality.html
    dataset = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in dataset.columns:
        for r in dataset.index:
            test_result = grangercausalitytests(data[[r,c]], maxlag=lag_order, verbose=False)

            """x: array_like
            The data for testing whether the time series in the second column Granger
            causes the time series in the first column."""

            p_values = [round(test_result[i+1][0][test][1],4) for i in range(lag_order)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')

            min_p_value = np.min(p_values)
            # H0: X Fails to Granger-Cause Y
            # H1: X Granger-Cause Y
            if min_p_value <= 0.05:
                dataset.loc[r,c] = "YES"
            else:
                dataset.loc[r, c] = "NO"
    for i in range(len(dataset)):
        dataset.iloc[i, i] = '-'
    dataset.columns = [var + '_X' for var in variables]
    dataset.index = [var + '_Y' for var in variables]
    return dataset

In [36]:
# The Granger-Causality test is run for a lag order of five
pairwise_causality = grangers_causality_matrix(log_market_ret, variables = log_market_ret.columns, lag_order=5)
pairwise_causality.head(6)

Unnamed: 0,Brazil_X,China_X,Germany_X,Greece_X,India_X,US_X
Brazil_Y,-,NO,YES,YES,YES,YES
China_Y,YES,-,YES,YES,YES,YES
Germany_Y,NO,YES,-,YES,YES,YES
Greece_Y,YES,YES,NO,-,YES,YES
India_Y,YES,NO,YES,YES,-,YES
US_Y,YES,NO,YES,YES,YES,-


### Results  

<div style='text-align: justify;'>  
Brazil has a bidirectional relationship with Greece, India, and the US. In addition, China has a bidirectional relationship with Greece and Germany, while Germany has a bidirectional relationship with China, India, and the US. Moreover, Greece has bidirectional relationships with the investigated indices except for the DAX index. Furthermore, India and the US have bidirectional relationships with the investigated markets except for the Chinese one. Another insight is that the rest of the relationships are unidirectional and there is no case of no causal relationship. Even though the Granger causality test does not imply theoretical causality, it is a profound indication that there is statistical causality between the market indices.
</div>

# References  
* Dickey, D., & Fuller, W. (1979). Distribution of the Estimators for Autoregressive Time Series With a Unit Root. Journal of the American Statistical Association,, 74(366), 427-431. doi:https://doi.org/10.2307/2286348  

* Engle, R., & Granger, C. (1987). Co-Integration and Error Correction: Representation, Estimation, and Testing. Econometrica, 55(2), 251-276. doi:https://doi.org/10.2307/1913236  

* Granger, W. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica, 37(3), 424-438. Retrieved from https://www.jstor.org/stable/1912791  

* Sims, C. (1972). Money, Income, and Causality. The American Economic Review, 62(4), 540-552. Retrieved from https://www.jstor.org/stable/1806097  

* Brooks, C. (2008). Introductory Econometrics for Finance (2nd ed.). Cambridge University Press.  
