# Basic Distance Approach - Following Hudson & Thames Video

This notebook implements the Basic Distance Approach for pairs trading as presented in the Hudson & Thames workshop:
https://www.youtube.com/watch?v=sKgDeqI39b4

## Plan Overview

### Phase 1: Pairs Formation
1. **Normalize input data** using min-max normalization
2. **Pair Selection** - 4 methods:
   - Basic method: Smallest Euclidean square distance (SSD)
   - Industry method: Pairs within same industry group
   - Zero Crossings method: Pairs with higher number of zero crossings
   - Variance method: Pairs with higher historical standard deviation
3. **Calculate historical volatility** for each portfolio

### Phase 2: Trading Signal Generation
1. **Normalize input data** using same min/max from formation period
2. **Create portfolios** - difference between normalized prices
3. **Generate signals**:
   - If portfolio > 2 std → Sell signal (-1)
   - If portfolio < -2 std → Buy signal (+1)
   - Close position when portfolio crosses zero

### Phase 3: Results Comparison
- Compare all 4 pair selection methods on same dataset
- Stock universe: S&P 500 (IT, Industrials, Financials, Healthcare)
- Formation: Jan 2018 - Dec 2018 (12 months)
- Trading: Jan 2019 - July 2019 (6 months)
- Output: Equity curves for each method

## Import Required Libraries

In [None]:
# Import arbitragelab distance approach
import sys
sys.path.insert(0, 'myfinlab_github')

from arbitragelab.distance_approach import basic_distance_approach
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully")

## Step 1: Load Data

Download S&P 500 stock data for 4 major industries:
- Information Technology
- Industrials
- Financials
- Healthcare

In [None]:
# Get industry data from Wikipedia 
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
stock_table = table[0]

# Industry groups to use
industry_group = ['Information Technology', 'Industrials', 'Financials', 'Health Care']

# Get tickers from S&P 500 which are in those industry groups
ticker_industry = stock_table[stock_table['GICS Sector'].isin(industry_group)].reset_index(drop=True)

# Get a dataframe of ticker and industry group
ticker_industry = ticker_industry[['Symbol', 'GICS Sector']]

# Get tickers to use as a list
tickers = ticker_industry['Symbol'].to_list()
remove_tickers = ['CARR', 'ABC', 'BRK.B', 'VNT', 'OTIS']  # Removed tickers
tickers = [ticker for ticker in tickers if ticker not in remove_tickers]

# Get a dictionary of industry group
industry_dict = pd.Series(ticker_industry['GICS Sector'].values,
                          index=ticker_industry['Symbol']).to_dict()

print(f"Total tickers: {len(tickers)}")
print(f"Sample tickers: {tickers[:5]}")

In [None]:
# Loading data - Formation period: Jan 2018 to Dec 2018 (12 months)
# Trading period: Jan 2019 to July 2019 (6 months)
print("Downloading training data (Jan 2018 - Dec 2018)...")
train_data = yf.download(tickers, start="2018-01-03", end="2019-01-01", progress=False)
train_data = train_data["Adj Close"]
print(f"Training data shape: {train_data.shape}")

print("\nDownloading test data (Jan 2019 - July 2019)...")
test_data = yf.download(tickers, start="2019-01-02", end="2019-07-01", progress=False)
test_data = test_data["Adj Close"]
print(f"Test data shape: {test_data.shape}")

# Display first few rows
train_data.head()

## Step 2: Phase 1 - Pairs Formation

### Method 1: Basic Distance Approach (Euclidean SSD)

In [None]:
# Initialize strategy for basic method
strategy_basic = basic_distance_approach.DistanceStrategy()

# Form pairs using basic method (smallest Euclidean square distance)
print("Forming pairs using Basic Distance Approach...")
strategy_basic.form_pairs(train_data, num_top=20)

# Get results
pairs_basic = strategy_basic.get_pairs()
scaling_basic = strategy_basic.get_scaling_parameters()
historical_std_basic = strategy_basic.train_std

print(f"\nTop 5 pairs (Basic method):")
for i, pair in enumerate(pairs_basic[:5], 1):
    print(f"{i}. {pair}")
print(f"\nTotal pairs formed: {len(pairs_basic)}")

### Method 2: Industry Group Method

In [None]:
# Initialize strategy for industry method
strategy_industry = basic_distance_approach.DistanceStrategy()

# Form pairs using industry method
print("Forming pairs using Industry Group method...")
strategy_industry.form_pairs(train_data, industry_dict=industry_dict, num_top=20)

# Get results
pairs_industry = strategy_industry.get_pairs()

print(f"\nTop 5 pairs (Industry method):")
for i, pair in enumerate(pairs_industry[:5], 1):
    print(f"{i}. {pair}")
print(f"\nTotal pairs formed: {len(pairs_industry)}")

### Method 3: Zero Crossings Method

In [None]:
# Initialize strategy for zero crossings method
strategy_zero_crossing = basic_distance_approach.DistanceStrategy()

# Form pairs using zero crossings method
print("Forming pairs using Zero Crossings method...")
strategy_zero_crossing.form_pairs(train_data, method='zero_crossing', 
                                  industry_dict=industry_dict, num_top=20)

# Get results
pairs_zero_crossing = strategy_zero_crossing.get_pairs()
num_crossings = strategy_zero_crossing.get_num_crossing()

print(f"\nTop 5 pairs (Zero Crossings method):")
for i, pair in enumerate(pairs_zero_crossing[:5], 1):
    crossings = num_crossings.get(pair, 0)
    print(f"{i}. {pair} (crossings: {crossings})")
print(f"\nTotal pairs formed: {len(pairs_zero_crossing)}")

### Method 4: Variance (Standard Deviation) Method

In [None]:
# Initialize strategy for variance method
strategy_variance = basic_distance_approach.DistanceStrategy()

# Form pairs using variance method
print("Forming pairs using Variance method...")
strategy_variance.form_pairs(train_data, method='variance',
                             industry_dict=industry_dict, num_top=20)

# Get results
pairs_variance = strategy_variance.get_pairs()
historical_std_variance = strategy_variance.train_std

print(f"\nTop 5 pairs (Variance method):")
for i, pair in enumerate(pairs_variance[:5], 1):
    std = historical_std_variance.get(pair, 0)
    print(f"{i}. {pair} (std: {std:.4f})")
print(f"\nTotal pairs formed: {len(pairs_variance)}")

## Step 3: Compare Pairs Selected by Each Method

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Basic': pd.Series(pairs_basic),
    'Industry': pd.Series(pairs_industry),
    'Zero_Crossings': pd.Series(pairs_zero_crossing),
    'Variance': pd.Series(pairs_variance)
})

print("Pairs selected by each method:")
print(comparison_df.head(10))

# Count unique pairs across all methods
all_pairs = set(pairs_basic + pairs_industry + pairs_zero_crossing + pairs_variance)
print(f"\nTotal unique pairs across all methods: {len(all_pairs)}")

## Step 4: Phase 2 - Trading Signal Generation

Generate trading signals for all 4 methods using the test data.

**Trading Rules:**
- If portfolio value > 2 std → Sell signal (-1)
- If portfolio value < -2 std → Buy signal (+1)
- Close position when portfolio crosses zero

In [None]:
# Generate trading signals for all methods
divergence = 2  # 2 standard deviations threshold

print("Generating trading signals for all methods...")
print(f"Threshold: {divergence} standard deviations\n")

# Basic method
strategy_basic.trade_pairs(test_data, divergence=divergence)
signals_basic = strategy_basic.get_signals()
portfolios_basic = strategy_basic.get_portfolios()
print(f"Basic method: {len(signals_basic.columns)} pairs, {signals_basic.sum().sum():.0f} total signals")

# Industry method
strategy_industry.trade_pairs(test_data, divergence=divergence)
signals_industry = strategy_industry.get_signals()
portfolios_industry = strategy_industry.get_portfolios()
print(f"Industry method: {len(signals_industry.columns)} pairs, {signals_industry.sum().sum():.0f} total signals")

# Zero crossings method
strategy_zero_crossing.trade_pairs(test_data, divergence=divergence)
signals_zero = strategy_zero_crossing.get_signals()
portfolios_zero = strategy_zero_crossing.get_portfolios()
print(f"Zero Crossings method: {len(signals_zero.columns)} pairs, {signals_zero.sum().sum():.0f} total signals")

# Variance method
strategy_variance.trade_pairs(test_data, divergence=divergence)
signals_variance = strategy_variance.get_signals()
portfolios_variance = strategy_variance.get_portfolios()
print(f"Variance method: {len(signals_variance.columns)} pairs, {signals_variance.sum().sum():.0f} total signals")

## Step 5: Phase 3 - Calculate Portfolio Returns

Calculate returns for each method using equal-weighted portfolios.

In [None]:
def calculate_portfolio_returns(strategy, test_data, signals):
    """
    Calculate portfolio returns for a given strategy.
    Uses equal weighting (50% long, 50% short for each pair).
    """
    pairs = strategy.get_pairs()
    
    # Calculate daily returns
    test_data_returns = (test_data / test_data.shift(1) - 1)[1:]
    
    # Store individual pair returns and total portfolio return
    pair_returns = {}
    total_return = pd.Series(0.0, index=test_data_returns.index)
    
    for pair in pairs:
        first_stock, second_stock = pair
        
        # Equal weighted portfolio (50% each leg)
        pair_return = (test_data_returns[first_stock] * 0.5 - 
                       test_data_returns[second_stock] * 0.5)
        
        # Apply trading signals (shift by 1 to avoid look-ahead)
        pair_return = pair_return * signals[str(pair)].shift(1)
        
        # Calculate cumulative returns
        pair_cumret = (pair_return + 1).cumprod()
        
        # Store final return for this pair
        pair_returns[pair] = pair_cumret.iloc[-1] - 1
        
        # Add to total portfolio
        total_return = total_return.add(pair_cumret, fill_value=0)
    
    # Equal weight across all pairs
    total_return = total_return / len(pairs)
    
    return pair_returns, total_return

# Calculate returns for each method
print("Calculating portfolio returns...\n")

pair_returns_basic, portfolio_basic = calculate_portfolio_returns(
    strategy_basic, test_data, signals_basic)
print(f"Basic Method - Final Return: {portfolio_basic.iloc[-1] - 1:.4f}")

pair_returns_industry, portfolio_industry = calculate_portfolio_returns(
    strategy_industry, test_data, signals_industry)
print(f"Industry Method - Final Return: {portfolio_industry.iloc[-1] - 1:.4f}")

pair_returns_zero, portfolio_zero = calculate_portfolio_returns(
    strategy_zero_crossing, test_data, signals_zero)
print(f"Zero Crossings Method - Final Return: {portfolio_zero.iloc[-1] - 1:.4f}")

pair_returns_variance, portfolio_variance = calculate_portfolio_returns(
    strategy_variance, test_data, signals_variance)
print(f"Variance Method - Final Return: {portfolio_variance.iloc[-1] - 1:.4f}")

## Step 6: Plot Equity Curves for All Methods

In [None]:
# Plot all equity curves together
plt.figure(figsize=(14, 8))

# Convert to percentage returns
(portfolio_basic - 1).plot(label='Basic Method', linewidth=2)
(portfolio_industry - 1).plot(label='Industry Method', linewidth=2)
(portfolio_zero - 1).plot(label='Zero Crossings Method', linewidth=2)
(portfolio_variance - 1).plot(label='Variance Method', linewidth=2)

plt.title('Basic Distance Approach - Equity Curves Comparison\nForm: Jan-Dec 2018 | Trade: Jan-Jul 2019',
          fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Portfolio Return', fontsize=12)
plt.legend(loc='best', fontsize=11)
plt.grid(True, alpha=0.3)
plt.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

# Print summary statistics
print("\n" + "="*60)
print("FINAL PORTFOLIO RETURNS SUMMARY")
print("="*60)
print(f"Basic Method:          {portfolio_basic.iloc[-1] - 1:>7.2%}")
print(f"Industry Method:       {portfolio_industry.iloc[-1] - 1:>7.2%}")
print(f"Zero Crossings Method: {portfolio_zero.iloc[-1] - 1:>7.2%}")
print(f"Variance Method:       {portfolio_variance.iloc[-1] - 1:>7.2%}")
print("="*60)

## Step 7: Individual Plot for Each Method

In [None]:
# Plot individual equity curves
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.flatten()

methods = [
    ('Basic Method', portfolio_basic),
    ('Industry Method', portfolio_industry),
    ('Zero Crossings Method', portfolio_zero),
    ('Variance Method', portfolio_variance)
]

for ax, (name, portfolio) in zip(axes, methods):
    (portfolio - 1).plot(ax=ax, linewidth=2)
    ax.set_title(f'{name}\nFinal Return: {portfolio.iloc[-1] - 1:.2%}', 
                 fontsize=12, fontweight='bold')
    ax.set_xlabel('Date')
    ax.set_ylabel('Return')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='black', linestyle='--', linewidth=0.5)

plt.suptitle('Basic Distance Approach - Individual Method Equity Curves',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 8: Summary Statistics Table

In [None]:
# Create summary statistics
def calculate_stats(portfolio, name):
    returns = portfolio.diff().dropna()
    cumret = portfolio.iloc[-1] - 1
    
    # Calculate max drawdown
    cum_returns = (portfolio - 1)
    rolling_max = cum_returns.expanding().max()
    drawdown = cum_returns - rolling_max
    max_dd = drawdown.min()
    
    # Sharpe ratio (annualized)
    daily_returns = returns
    sharpe = daily_returns.mean() / daily_returns.std() * np.sqrt(252) if daily_returns.std() > 0 else 0
    
    return {
        'Method': name,
        'Final Return': f'{cumret:.2%}',
        'Max Drawdown': f'{max_dd:.2%}',
        'Sharpe Ratio': f'{sharpe:.2f}',
        'Win Rate': f'{(returns > 0).sum() / len(returns):.1%}'
    }

stats_df = pd.DataFrame([
    calculate_stats(portfolio_basic, 'Basic'),
    calculate_stats(portfolio_industry, 'Industry'),
    calculate_stats(portfolio_zero, 'Zero Crossings'),
    calculate_stats(portfolio_variance, 'Variance')
])

print("\n" + "="*70)
print("SUMMARY STATISTICS - ALL METHODS")
print("="*70)
print(stats_df.to_string(index=False))
print("="*70)

## Conclusion

This notebook implemented the **Basic Distance Approach** for pairs trading following the Hudson & Thames workshop methodology.

### Key Findings:

1. **Basic Method**: Uses smallest Euclidean square distance - simplest but may not capture profitable opportunities

2. **Industry Method**: Restricts pairs to same industry - adds fundamental relationship constraint

3. **Zero Crossings Method**: Selects pairs with frequent divergences/convergences - theoretically captures more trading opportunities

4. **Variance Method**: Selects high-volatility pairs - aims for larger profit opportunities

### Key Formulas Used:

- **Normalization**: $P_{normalized} = \frac{P - min(P)}{max(P) - min(P)}$

- **Euclidean SSD**: $SSD = \sum^{N}_{t=1} (P^1_t - P^2_t)^{2}$

- **Trading Signal**: If portfolio > 2σ → Sell (-1), If portfolio < -2σ → Buy (+1)

- **Zero Crossings**: Number of times normalized spread crosses zero

### References:
- Gatev, E., Goetzmann, W.N., Rouwenhorst, K.G. (2006) - Pairs Trading: Performance of a Relative Value Arbitrage Rule
- Do, B., Faff, R. (2010) - Does Simple Pairs Trading Still Work?
- Krauss, C. (2017) - Statistical Arbitrage Pairs Trading Strategies: Review and Outlook

---
*Note: Results shown are for the specific period (Jan 2018 - Jul 2019). Performance may vary for different time periods and market conditions.*