# Dissertation - Pairs Trading - Documentation

## Abstract

This notebook serves as a guide, illuminates the practical application of theoretical concepts through code implementation, accompanied by step-by-step explanations. Furthermore, academic references are provided throughout the notebook, enriching the content with insights and facilitating further exploration into the subject. It caters to novices seeking insights into pairs trading strategies.

## Table of Content:
* [1. Set Up of the Backtesting](#set-up)
* [2. Distance Approach](#distance-approach)
* [3. Cointegration Approach](#cointegration-approach)
* [4. Copula Approach](#copula-approach)
* [5. Conclusion](#conclusion)
* [6. Reference](#reference)

## 1. Set Up of the Backtesting <a class="anchor" id="set-up"></a>

- Data set: data/ftse100_2021_2023_06.csv
- Formation start date (date1): 4 January 2021
- Formation end date (date2): 30 December 2022
- Trading start date (date 3): 1 January 2022
- Trading start date (date 4): 30 June 2023
- Module used: distance_approach.py, cointegration_approach.py, copula_approach.py, performance_measurement.py

## 2. Distance Approach <a class="anchor" id="distance-approach"></a>

This section closely adheres to the methodology suggested by Gatev et al. (2006). The distance approach assesses mispricing using the "spread," indicating the standardized price variance between the pair's two stocks.

In [None]:
def distance_method(date1, date2, date3, date4, x, df, capital=10000): # formation period: date1, date2. trading period: date3, date4
    # formation period
    coeff_matrix, coeff_list = forming_pairs(date1, date2, df)
    pairs = pair_tickers(coeff_matrix, coeff_list, x)
    print(pairs)
    # trading period
    period_return = []
    for i in pairs:
        pair1 = i[0]
        pair2 = i[1]
        df_pair1, df_pair2 = trading_df(pair1, pair2, date1, date2, date3, date4, df)
        profit_1 = trading(df_pair1, capital)
        profit_2 = trading(df_pair2, capital)
        pair_return = pair_profit(profit_1, profit_2)
        period_return.append(pair_return)
    pairs_and_returns = dict(zip(period_return, pairs)) 
    return(period_return)
    return(pairs)

## Formation Period

### 1. Normalization of the input data in formation period

The normalized price denotes the cumulative return sequence, computed by dividing the closing prices by the closing price on the initial day of the formation period, then adjusting them to £1 at the formation period's outset.

$$ Price_{normalized} = P_{time-t} / P_{initial} $$

*Reference code: distance_approach.py > def normalizing(data)*

### 2. Finding pairs with SSD

In this method, the spread is calculated by contrasting the normalized cumulative return ($CR$) of each stock pair combination during the 12-month formation period. The spread at time $t$ is derived using an equation. 

*Reference code: distance_approach.py > def ssd(col1, col2)*

Subsequently, the sum of squared differences (SSD) or Euclidean distance between two cumulative return series, $CR_{1}$ and $CR_{2}$, is computed using another equation.

$$SSD=\sqrt{\sum ^{T}_{t=1}\left(CR_{1,t}-CR_{2,t}\right) ^{2}}$$

*Reference code: distance_approach.py > def ssd_matrix(dta)*

### 3. Selection criteria - historical standard deviation

Afterward, the **$n$ pairs with the smallest SSD** are chosen for trading in the subsequent 6-month trading period.

In the test, selected n=5 for simplicity.

*Reference code: distance_approach.py >*
- *def forming_pairs(date1, date2, df),*
- *def pair_tickers(coeff_matrix, coeff_list, x)*

### 4. Calculating historical volatility

Moreover, the standard deviation of the spread calculated during the formation period acts as a trading criterion.

*Reference code: distance_approach.py > def standard_deviation(col1,col2)*

## Trading Period

### 1. Normalization of the input data in trading period

The normalized price denotes the cumulative return sequence, computed by dividing the closing prices by the closing price on the initial day of the trading period, then adjusting them to £1 at the trading period's outset.

*Reference code: distance_approach.py > def normalizing(data)*

### 2. Trading Signal

- Positions are initiated when the spread deviates by at least 2$\sigma$  established during the formation period.
- This entails purchasing the undervalued stock and selling the overvalued one.
- Upon the spread converging to 0, positions are terminated and the process repeats.
- Maintaining a consistent 2-$\sigma$ threshold allows for initiating positions with smaller divergences, potentially resulting in losses upon convergence for less volatile spreads.
- Regardless, any open positions at the end of the trading period are closed, irrespective of whether the mispricing persists.

*Reference code: distance_approach.py > def trading_df(pair1, pair2, date1, date2, date3, date4, df)*

### 3. Implement Trading Strategy

Execute the strategy based on the trading signals produced for the stock pair and calculate the profit accrued over the 6-month period.

*Reference code: distance_approach.py > def trading(dta, capital)*

## 3. Cointegration Approach <a class="anchor" id="cointegration-approach"></a>

"Cointegration" denotes a situation in which two time series, $X_{1}$ and $X_{2}$, display a stable linear combination. It helps pinpoint deviations from long-term equilibrium. Engle and Granger (1997) propose a two-step method for cointegration testing. 

1. Estimate the individual unit root tests (such as Augmented Dickey-Fuller tests) for each of the variables (spread) in the time series.
2. Regress one variable on the other (or a linear combination of the variables) and obtain the residuals.

In [None]:
def Coint_Trading(date1, date2, date3, date4, n, df, capital=10000): # date1, date2 formation period, date3, date4 trading period, x stocks
    # formation period
    pairs = coint_pairs(date1, date2, n, df)
    print(pairs)
    # trading period
    returns = []
    for i in pairs:
        pair1 = i[0]
        pair2 = i[1]
        std, mean = spread_stat(date1, date2, pair1, pair2, df)
        df1, df2 = signal(date3, date4, std, mean, pair1, pair2, df)
        pair1_r = trading(df1, capital)
        pair2_r = trading(df2, capital)
        pair_return = pair_profit(pair1_r, pair2_r)
        returns.append(pair_return)
    return(returns)
    return(pairs)

## Formation Period

### 1. Pair Selection ranked on SSD

In the first stage, pairs are ranked based on the Sum of Squared Differences (SSD) of their normalized prices during the formation period. This involves ranking all possible pairs to identify those with the lowest SSD, highlighting pairs showing the least divergence in price movements.

*Reference code: cointegration_approach.py > def forming_pairs(date1, date2, df), def pair_tickers(coeff_matrix, coeff_list)*

### 2. Find Cointegrated Pairs

In the second stage, selected pairs undergo further evaluation of cointegration, closely following the **Engle and Granger two-step approach**. Cumulative return analysis over the formation period determines the presence of a stable, long-term relationship between pairs, identifying those with significant cointegration.

The code sets the critical value as 0.05 and compares it to the p-value to evaluate the evidence of cointegration between the series.

*Reference code: cointegration_approach.py > def find_cointegrated_pairs(date1, date2, df)*

### 3. Selection criteria - cointegration and historical standard deviation

Cointegrated pairs are identified through the estimation of their cointegration coefficient, known as $\beta$, which serves as a hedge ratio. This ratio is utilized in constructing the pairs portfolio

Pairs lacking cointegration are excluded, while cointegrated pairs undergo $\beta$ estimation until $n$ pairs with minimal SSDs are identified for trading. 

Spreads are constructed and the series represents the scaled price difference between two stocks, as expressed by the equation:

$$Spread_{t} = X_{2,t}-\beta X_{1,t}$$

The equation above can be interpreted as purchasing one share of stock 2 and selling $\beta$ shares of stock 1 at time $t-1$, with the profit calculation taking place at time $t$ upon closing the position.

The mean ($\mu_{e}$) and standard deviation ($\sigma_{e}$) of spread is used to compute for position triggers from equation:

$$Spread_{normalized}=\dfrac{Spread-\mu_{e}}{\sigma _{e}}$$

*Reference code: cointegration_approach.py > def coint_pairs(date1, date2, x, df), def spread_stat(date1, date2, pair1, pair2, df)*

## Trading Period

The trading strategy, akin to the distance approach, initiates long and short positions upon a 2-standard deviation ($\sigma$) deviation in the normalized spread. 

The cointegration coefficient, denoted as $\beta$, can also be employed in equation:

$$ Profit = (X_{2,t}-\beta X_{1,t})-(X_{2,t-1}-\beta X_{1,t-1})\\ =Spread_{t}-Spread_{t-1}$$

Unlike the distance approach, long and short positions differ in value. 
- When the spread falls below -2$\sigma$, £1 of stock 2 is bought and £$\beta$ of stock 1 is sold;
- if the spread exceeds +2$\sigma$, $\dfrac{£1}{\beta}$ of stock 2 is sold and £1 is invested in stock 1.

Positions are reversed when the spread returns to 0, indicating the pair's equilibrium. This cycle repeats throughout the trading period.

*Reference code: cointegration_approach.py > def signal(date1, date2, std, mean, pair1, pair2, df), def trading(dta, capital)*

## 4. Copula Approach <a class="anchor" id="copula-approach"></a>


A copula is like a connector between individual distribution functions and their combined distribution function, especially useful in time series analysis. It helps us understand how different distributions relate to each other over time. Think of it as a joint distribution function where each variable's distribution is uniform. 

In below copula function, ($U_{1}, U_{2},\ldots, U_{n}$) covers all possible quantile values, while ($u_{1},u_{2},\ldots, u_{n}$) refers to specific observations from a uniform distribution.

$$ C\left( u_{1},u_{2},\ldots ,u_n\right)=P(U_{1}\leq u_{1},U_{2}\leq u_{2},\ldots,U_{n}\leq u_{n})$$

The conditional distribution function can be obtained by taking the partial derivative of the copula function, as equations:

$$h_{1}\left(u_{1}|u_{2}\right)=P\left(U_{1}\leq u_{1}|U_{2}=u_{2}\right)=\dfrac{\partial C\left(u_{1},u_{2}\right) }{\partial u_{2}}\\$$
$$h_{2}\left(u_{2}|u_{1}\right)=P\left(U_{2}\leq u_{2}|U_{1}=u_{1}\right)=\dfrac{\partial C\left(u_{1},u_{2}\right) }{\partial u_{1}}$$


Using functions $h_{1}$ and $h_{2}$ allows for estimating the likelihood of one random variable being smaller than a given value while the other assumes a fixed value. In pairs trading, these functions predict whether one stock in a pair will rise or fall relative to its price, based on the other stock's price.

In [None]:
def copula_approach(stocks, df_data, date1, date2, date3, date4, x, y): # formation period: date1, date2. trading period: date3, date4
    # prepare data set
    df_data_form, df_data_trade = train_test_split(
        df_data, date1, date2, date3, date4)
    df_data_form.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_data_form.dropna(inplace=True)
    df_data_trade.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_data_trade.dropna(inplace=True)
    # formation period
    df_tau_results = calc_tau(stocks, df_data_form).dropna()
    pairs_selected = select_pairs(df_data_form, df_tau_results)[:x]
    
    df_copula_results = copula_formation(
        df_data_form, df_tau_results.loc[pairs_selected], pairs_selected)
    # trading period
    threshold = y
    trade_results = {}
    for pair in pairs_selected:
        stock_1, stock_2 = parse_pair(pair)
        df_calculations, df_positions, df_returns = copula_trading(
            df_data, df_data_form, df_data_trade, df_copula_results, pair, threshold)

        trade_results[pair] = {'calculations': df_calculations,
                               'positions': df_positions,
                               'returns': df_returns,
                               'metrics': metrics(df_returns.sum(axis=1))}
    print(df_returns)
    pairs_returns = []
    for pair in pairs_selected:
        df_pair_return = trade_results[pair]['returns'].astype(float) #trade_results
        pairs_returns.append(df_pair_return.sum(axis=1))

    return(pairs_returns)

## Formation Period

### 1. Select Stock Pairs with Kendall's Tau $\tau$:

While parametric measures like SSD and cointegration are often used, the copula approach favors non-parametric tests such as Kendall's $\tau$ for assessing pair correlation. In this project, cumulative returns are calculated alongside Kendall's $\tau$ for each pair of stocks.

Higher Kendall's $\tau$ values indicate stronger pair movements, aligning well with copula analysis principles rooted in concordance. The top $n$ pairs are selected based on Kendall's $\tau$.

*Reference code: copula_approach.py > def calc_tau(stocks: list, df_data: pd.DataFrame) -> pd.DataFrame, def select_pairs(df_data_form: pd.DataFrame, df_tau_results: pd.DataFrame) -> list*

### 2. Fitting Copula Functions to Selected Pairs:

The top $n$ pairs undergo copula fitting in two steps. 

1. Firstly, daily returns from the formation period are fitted to marginal distributions for each pair.
2. Then, the copula that best fits the uniform marginals, using estimated parameters, is selected and parameterized.


Following the methodology of Liew and Wu (2013) and Xie et. al. (2014), commonly used copulas in financial asset analysis like Gaussian, Student-t, Gumbel, Clayton, and Frank are examined. 

The optimal copula accurately represents interdependence patterns among stocks, determined by maximizing log-likelihood and computing the Akaike Information Criterion (AIC) for performance evaluation. The copula with the highest AIC is chosen.

*Reference code: copula_approach.py > def fit_copula(df_data: pd.DataFrame, pairs_selected: list) -> pd.DataFrame, def copula_formation(df_data_form:pd.DataFrame, df_tau_results:pd.DataFrame, pairs_selected:list)*

## Trading Period

### 1. Calculating Conditional Probabilities of Pairs

During the trading period, conditional probabilities are computed using the daily returns of two stocks in a pair, as detailed in above $h_{1}$ and $h_{2}$ functions.

A value of 0.5 for $h_{1}$ or $h_{2}$ denotes a 50% probability that the price of stock 1 ($U_{1}$) or stock 2 ($U_{2}$) will be lower than its current observed value, based on the present price of the other stock. 

Values exceeding 0.5 indicate a higher likelihood of the stock price decreasing relative to its current value, while values below 0.5 suggest a greater probability of the stock price increasing.

*Reference code: copula_approach.py > def copula_trading(df_data, df_data_form:pd.DataFrame, df_data_trade:pd.DataFrame, df_copula_results:pd.DataFrame, pair:str, threshold:float)*

### 2. Establishing Cumulative Mispricing Indices (CMPI)

Following the approach of Xie et al. (2014), two indices defining mispricing are formulated using below equations.

$$m_{1,t}=h_{1}\left(u_{1}|u_{2}\right)-0.5=P\left(U_{1}\leq u_{1}|U_{2}=u_{2}\right)-0.5\\$$
$$m_{2,t}=h_{2}\left(u_{2}|u_{1}\right)-0.5=P\left(U_{2}\leq u_{2}|U_{1}=u_{1}\right)-0.5$$

On a daily basis, the cumulative mispriced indices (CMPI) $M_{1}$ and $M_{2}$ are calculated using below equations, with an initial value of zero at the beginning of the trading period.

$$M_{1,t}=M_{1,t-1}+m_{1,t}\\$$
$$M_{2,t}=M_{2,t-1}+m_{2,t}$$

Positive $M_{1}$ and negative $M_{2}$ values suggest that stock 1 is overvalued compared to stock 2, whereas negative $M_{1}$ and positive $M_{2}$ values indicate the opposite.

*Reference code: copula_approach.py > def copula_trading(df_data, df_data_form:pd.DataFrame, df_data_trade:pd.DataFrame, df_copula_results:pd.DataFrame, pair:str, threshold:float)*

### 3. Implementing Trading Strategy

A long-short position is initiated when one CMPI exceeds 0.5 while the other falls below -0.5 simultaneously. Positions are reversed when both CMPI return to zero. 

Continuous monitoring for potential trades is then conducted for the remainder of the trading period.

*Reference code: copula_approach.py > def copula_trading(df_data, df_data_form:pd.DataFrame, df_data_trade:pd.DataFrame, df_copula_results:pd.DataFrame, pair:str, threshold:float)*

## 5. Conclusion <a class="anchor" id="conclusion"></a>

Based on backtesting results, the distance approach yields a -80% cumulative return with an 81% max drawdown. In contrast, cointegration and copula approaches achieve positive returns of 42% and 41%, respectively, with lower drawdowns.

Among the three strategies, copula demonstrates the highest Sharpe ratio (12.06), indicating superior risk-adjusted returns. Despite achieving a similar cumulative return to cointegration, copula exhibits significantly lower volatility during the trading period, suggesting better risk management and potential for stable performance.

In [4]:
import pandas as pd

overall_performance = pd.read_excel('test/profitability_3_approaches.xlsx')
overall_performance

Unnamed: 0.1,Unnamed: 0,Start Period,End Period,Risk-Free Rate,Time in Market,Cumulative Return,CAGR﹪,Sharpe,Prob. Sharpe Ratio,Sortino,...,1Y,3Y (ann.),5Y (ann.),10Y (ann.),All-time (ann.),Avg. Drawdown,Avg. Drawdown Days,Recovery Factor,Ulcer Index,Serenity Index
0,Distance Approach,2021-01-04,2022-01-04,2.0%,100.0%,-80.39%,-67.53%,-13.42,0.0%,-10.51,...,-80.39%,-67.53%,-67.53%,-67.53%,-67.53%,-81.41%,338,1.73,0.61,-0.29
1,Cointegration Approach,2021-01-04,2022-01-04,2.0%,100.0%,41.83%,27.29%,5.21,83.6%,9.21,...,41.83%,27.29%,27.29%,27.29%,27.29%,-12.45%,92,1.74,0.11,1.51
2,Copula Approach,2021-01-04,2022-01-04,2.0%,100.0%,40.79%,26.64%,12.06,100.0%,72.05,...,40.79%,26.64%,26.64%,26.64%,26.64%,-1.59%,18,14.06,0.01,106.88


## 6. Reference <a class="anchor" id="reference"></a>

[Engle and Granger, 1987] Engle, R. F. and Granger, C.W. J. (1987). Co-Integration and Error Correction: Representation, Estimation, and Testing. Econometrica, 55(2):251–276.

[Gatev et al., 2006] Gatev, E., Goetzmann, W. N., and Rouwenhorst, K. G. (2006). Pairs trading: Performance of a relative-value arbitrage rule. Review of Financial Studies, 19(3):797–827.

[Liew and Wu, 2013] Liew, R. Q. andWu, Y. (2013). Pairs Trading: A Copula Approach. Journal of Derivatives and Hedge Funds, 19(1):12–30.

[Xie et al., 2014] Xie, W., Liew, R. Q., Wu, Y., and Zou, X. (2014). Pairs Trading with Copulas. Journal of Derivatives & Hedge Funds.