# Day 4: Backtesting Best Practices

## Week 12 - Backtesting & Validation

### üéØ Learning Objectives
- Implement point-in-time data handling
- Avoid survivorship bias
- Build sanity checks into backtests
- Create production-ready backtesting framework

### ‚è±Ô∏è Time Allocation
- Theory review: 30 min
- Guided exercises: 90 min
- Practice problems: 60 min
- Interview prep: 30 min

---

**Author**: ML Quant Finance Mastery  
**Difficulty**: Intermediate  
**Prerequisites**: Day 1-3

## 1. Setup and Data Loading

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# Download market data
print("üì• Downloading market data...")
tickers = ['SPY', 'AAPL', 'MSFT', 'GOOGL', 'JPM']
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

data = yf.download(tickers, start=start_date, end=end_date, progress=False, auto_adjust=True)
prices = data['Close'].dropna()
returns = prices.pct_change().dropna()

print(f"‚úÖ Loaded {len(prices)} days of data")

## 2. Point-in-Time Data Handling

### The Problem

Many data sources provide **revised** data, not what was known at the time.

**Examples:**
- GDP revised months after initial release
- Earnings restated for accounting changes
- Stock prices adjusted for future splits

### The Solution

Use data as it existed at decision time.

In [None]:
class PointInTimeData:
    """
    Ensures data integrity for backtesting
    
    Prevents lookahead bias by tracking when data was available
    """
    
    def __init__(self, df, date_col='date', value_col='value', 
                 available_col='available_date'):
        """
        Parameters:
        -----------
        df : DataFrame with columns for date, value, and when it became available
        """
        self.df = df.copy()
        self.date_col = date_col
        self.value_col = value_col
        self.available_col = available_col
    
    def get_as_of(self, as_of_date):
        """
        Get data as it was known on a specific date
        
        Only returns data that was available by as_of_date
        """
        mask = self.df[self.available_col] <= as_of_date
        available_data = self.df[mask].copy()
        
        # Get most recent value for each underlying date
        available_data = available_data.sort_values(self.available_col)
        available_data = available_data.drop_duplicates(
            subset=[self.date_col], 
            keep='last'
        )
        
        return available_data.set_index(self.date_col)[self.value_col]

# Simulate point-in-time data (GDP example)
dates = pd.date_range('2020-01-01', periods=12, freq='Q')
pit_data = []

for i, date in enumerate(dates):
    # Initial release: available 1 month after quarter end
    initial_value = 100 + i * 2 + np.random.randn() * 0.5
    pit_data.append({
        'date': date,
        'value': initial_value,
        'available_date': date + pd.Timedelta(days=30),
        'revision': 'initial'
    })
    
    # First revision: 2 months after
    revised_value = initial_value + np.random.randn() * 0.3
    pit_data.append({
        'date': date,
        'value': revised_value,
        'available_date': date + pd.Timedelta(days=60),
        'revision': 'first'
    })
    
    # Final revision: 3 months after
    final_value = revised_value + np.random.randn() * 0.1
    pit_data.append({
        'date': date,
        'value': final_value,
        'available_date': date + pd.Timedelta(days=90),
        'revision': 'final'
    })

pit_df = pd.DataFrame(pit_data)
pit_handler = PointInTimeData(pit_df)

# Demonstrate difference
query_date = pd.Timestamp('2020-07-01')
print("üìä POINT-IN-TIME DATA DEMONSTRATION")
print("=" * 60)
print(f"Query date: {query_date}")
print("\nData as known on query date:")
print(pit_handler.get_as_of(query_date))
print("\n‚ö†Ô∏è Using final revisions in backtest = LOOKAHEAD BIAS!")

## 3. Avoiding Survivorship Bias

In [None]:
def simulate_survivorship_bias():
    """
    Demonstrate survivorship bias in stock selection
    """
    np.random.seed(42)
    n_stocks = 100
    n_days = 252 * 5  # 5 years
    
    # Generate returns for all stocks
    all_returns = np.random.randn(n_stocks, n_days) * 0.02
    
    # Some stocks will go bankrupt (cumulative return < -90%)
    cumulative = np.cumprod(1 + all_returns, axis=1)
    
    # Mark stocks that survive (never dropped below 10% of initial value)
    survivors = np.all(cumulative > 0.1, axis=1)
    
    # Calculate average return WITH vs WITHOUT survivorship bias
    # Biased: only survivors
    survivor_returns = all_returns[survivors, :]
    avg_survivor = np.mean(survivor_returns) * 252
    
    # Unbiased: all stocks (but use return=0 after bankruptcy)
    all_returns_adjusted = all_returns.copy()
    for i in range(n_stocks):
        if not survivors[i]:
            # Find bankruptcy day
            bankrupt_day = np.where(cumulative[i, :] < 0.1)[0]
            if len(bankrupt_day) > 0:
                all_returns_adjusted[i, bankrupt_day[0]:] = -1  # Total loss
    
    avg_all = np.mean(all_returns_adjusted) * 252
    
    return {
        'n_total': n_stocks,
        'n_survivors': survivors.sum(),
        'survival_rate': survivors.mean(),
        'avg_return_biased': avg_survivor,
        'avg_return_unbiased': avg_all,
        'bias': avg_survivor - avg_all
    }

results = simulate_survivorship_bias()

print("üìä SURVIVORSHIP BIAS DEMONSTRATION")
print("=" * 60)
print(f"Total stocks: {results['n_total']}")
print(f"Survivors: {results['n_survivors']} ({results['survival_rate']:.0%})")
print(f"\nBIASED return (survivors only): {results['avg_return_biased']:.1%}")
print(f"UNBIASED return (all stocks): {results['avg_return_unbiased']:.1%}")
print(f"BIAS: {results['bias']:.1%}")
print("\n‚ö†Ô∏è Ignoring failed companies inflates historical returns!")

## 4. Sanity Checks for Backtests

In [None]:
class BacktestSanityChecker:
    """
    Validate backtest results for common issues
    """
    
    def __init__(self, returns, signals, transaction_costs=0.0010):
        self.returns = np.array(returns)
        self.signals = np.array(signals)
        self.costs = transaction_costs
        
        # Calculate strategy returns
        self.strategy_returns = self.signals * self.returns
        turnover = np.abs(np.diff(self.signals, prepend=0))
        self.net_returns = self.strategy_returns - turnover * self.costs
    
    def check_sharpe_ratio(self):
        """Check if Sharpe is suspiciously high"""
        sharpe = np.mean(self.net_returns) / np.std(self.net_returns) * np.sqrt(252)
        
        status = "‚úÖ OK" if sharpe < 3 else "‚ö†Ô∏è SUSPICIOUS"
        return {
            'check': 'Sharpe Ratio',
            'value': f'{sharpe:.2f}',
            'threshold': '< 3.0',
            'status': status,
            'note': 'Sharpe > 3 is rare and may indicate overfitting'
        }
    
    def check_turnover(self):
        """Check if turnover is realistic"""
        daily_turnover = np.abs(np.diff(self.signals, prepend=0)).mean()
        annual_turnover = daily_turnover * 252
        
        status = "‚úÖ OK" if annual_turnover < 50 else "‚ö†Ô∏è HIGH"
        return {
            'check': 'Annual Turnover',
            'value': f'{annual_turnover:.0%}',
            'threshold': '< 5000%',
            'status': status,
            'note': 'Very high turnover often destroys profits after costs'
        }
    
    def check_drawdown(self):
        """Check if drawdown is acceptable"""
        cumulative = (1 + pd.Series(self.net_returns)).cumprod()
        running_peak = cumulative.expanding().max()
        drawdown = (cumulative - running_peak) / running_peak
        max_dd = drawdown.min()
        
        annual_return = np.mean(self.net_returns) * 252
        calmar = annual_return / abs(max_dd) if max_dd != 0 else 0
        
        status = "‚úÖ OK" if calmar > 0.5 else "‚ö†Ô∏è POOR"
        return {
            'check': 'Risk/Return (Calmar)',
            'value': f'{calmar:.2f}',
            'threshold': '> 0.5',
            'status': status,
            'note': 'Low Calmar suggests poor risk-adjusted returns'
        }
    
    def check_consistency(self):
        """Check if performance is consistent over time"""
        # Split into halves
        mid = len(self.net_returns) // 2
        sharpe_1 = np.mean(self.net_returns[:mid]) / np.std(self.net_returns[:mid]) * np.sqrt(252)
        sharpe_2 = np.mean(self.net_returns[mid:]) / np.std(self.net_returns[mid:]) * np.sqrt(252)
        
        ratio = sharpe_2 / sharpe_1 if sharpe_1 != 0 else 0
        
        status = "‚úÖ OK" if 0.5 < ratio < 2.0 else "‚ö†Ô∏è INCONSISTENT"
        return {
            'check': 'Time Consistency',
            'value': f'{ratio:.2f}',
            'threshold': '0.5 - 2.0',
            'status': status,
            'note': 'Large variation suggests overfitting to specific period'
        }
    
    def run_all_checks(self):
        """Run all sanity checks"""
        checks = [
            self.check_sharpe_ratio(),
            self.check_turnover(),
            self.check_drawdown(),
            self.check_consistency()
        ]
        return checks

# Create a test strategy
spy_returns = returns['SPY'].values
momentum = pd.Series(spy_returns).rolling(20).mean()
signal = np.sign(momentum.values)
signal = np.nan_to_num(signal)

# Run sanity checks
checker = BacktestSanityChecker(spy_returns, signal, transaction_costs=0.0010)
results = checker.run_all_checks()

print("üìä BACKTEST SANITY CHECK REPORT")
print("=" * 70)
for check in results:
    print(f"\n{check['check']}")
    print(f"   Value: {check['value']} (threshold: {check['threshold']})")
    print(f"   Status: {check['status']}")
    print(f"   Note: {check['note']}")

## 5. Production-Ready Backtest Framework

In [None]:
class ProductionBacktester:
    """
    Production-ready backtesting framework
    
    Features:
    - Point-in-time data handling
    - Transaction costs
    - Walk-forward validation
    - Comprehensive metrics
    - Sanity checks
    """
    
    def __init__(self, prices, cost_bps=10, slippage_bps=2):
        self.prices = prices
        self.returns = prices.pct_change().dropna()
        self.cost = (cost_bps + slippage_bps) / 10000
        
    def generate_signals(self, signal_func, **kwargs):
        """Generate trading signals using provided function"""
        self.signals = signal_func(self.returns, **kwargs)
        return self.signals
    
    def calculate_returns(self):
        """Calculate strategy returns with costs"""
        # Gross returns
        self.gross_returns = self.signals * self.returns.values
        
        # Turnover and costs
        self.turnover = np.abs(np.diff(self.signals, prepend=0))
        self.costs = self.turnover * self.cost
        
        # Net returns
        self.net_returns = self.gross_returns - self.costs
        
        return self.net_returns
    
    def calculate_metrics(self):
        """Calculate comprehensive metrics"""
        net = pd.Series(self.net_returns)
        
        # Returns
        total_return = (1 + net).prod() - 1
        n_years = len(net) / 252
        annual_return = (1 + total_return) ** (1/n_years) - 1
        annual_vol = net.std() * np.sqrt(252)
        
        # Risk metrics
        sharpe = net.mean() / net.std() * np.sqrt(252) if net.std() > 0 else 0
        
        downside = net[net < 0].std()
        sortino = net.mean() / downside * np.sqrt(252) if downside > 0 else 0
        
        # Drawdown
        cumulative = (1 + net).cumprod()
        running_peak = cumulative.expanding().max()
        drawdown = (cumulative - running_peak) / running_peak
        max_dd = drawdown.min()
        
        calmar = annual_return / abs(max_dd) if max_dd != 0 else 0
        
        # Trading metrics
        win_rate = (net > 0).mean()
        avg_turnover = self.turnover.sum() / (len(net) / 252)
        total_costs = self.costs.sum()
        
        self.metrics = {
            'Total Return': f'{total_return:.2%}',
            'Annual Return': f'{annual_return:.2%}',
            'Annual Vol': f'{annual_vol:.2%}',
            'Sharpe': f'{sharpe:.2f}',
            'Sortino': f'{sortino:.2f}',
            'Max Drawdown': f'{max_dd:.2%}',
            'Calmar': f'{calmar:.2f}',
            'Win Rate': f'{win_rate:.2%}',
            'Annual Turnover': f'{avg_turnover:.0%}',
            'Total Cost Drag': f'{total_costs:.2%}'
        }
        
        return self.metrics
    
    def run_sanity_checks(self):
        """Run all sanity checks"""
        checker = BacktestSanityChecker(self.returns.values, self.signals, self.cost)
        return checker.run_all_checks()
    
    def generate_report(self):
        """Generate comprehensive backtest report"""
        print("=" * 70)
        print("                    BACKTEST REPORT")
        print("=" * 70)
        
        print("\nüìä PERFORMANCE METRICS")
        print("-" * 50)
        for metric, value in self.metrics.items():
            print(f"   {metric:<20}: {value}")
        
        print("\nüîç SANITY CHECKS")
        print("-" * 50)
        checks = self.run_sanity_checks()
        for check in checks:
            print(f"   {check['check']:<20}: {check['status']}")
        
        print("\n" + "=" * 70)

# Example usage
def momentum_signal(returns, lookback=20):
    """Simple momentum signal"""
    momentum = pd.Series(returns).rolling(lookback).mean()
    signal = np.sign(momentum.values)
    return np.nan_to_num(signal)

# Run backtest
backtester = ProductionBacktester(prices['SPY'], cost_bps=10, slippage_bps=2)
backtester.generate_signals(momentum_signal, lookback=20)
backtester.calculate_returns()
backtester.calculate_metrics()
backtester.generate_report()

## 6. ‚è±Ô∏è TIMED CODING CHALLENGE (30 minutes)

**Challenge:** Extend the `ProductionBacktester` to include:
1. Walk-forward validation
2. Multiple asset support
3. Portfolio-level metrics
4. Visualization dashboard

In [None]:
# YOUR CODE HERE
# Extend ProductionBacktester class

## 7. Interview Question of the Day

**Q: What are the key differences between backtesting for research vs production deployment?**

Think about:
1. Data handling requirements
2. Execution assumptions
3. Risk monitoring
4. Code quality standards

In [None]:
print("üìä RESEARCH vs PRODUCTION BACKTESTING")
print("=" * 70)

comparison = {
    'Aspect': ['Data', 'Execution', 'Costs', 'Validation', 'Code'],
    'Research': [
        'May use final revisions',
        'Assume fill at close',
        'Often ignored',
        'In-sample OK for exploration',
        'Notebooks, quick iteration'
    ],
    'Production': [
        'Point-in-time only',
        'VWAP/TWAP, partial fills',
        'Conservative estimates',
        'Strict walk-forward',
        'Tested, version controlled'
    ]
}

print(pd.DataFrame(comparison).to_string(index=False))

## 8. Key Takeaways

| Practice | Description |
|----------|-------------|
| Point-in-Time | Use data as known at decision time |
| No Survivorship | Include delisted/failed stocks |
| Sanity Checks | Sharpe < 3, consistent over time |
| Realistic Costs | Commission + spread + slippage + impact |
| Walk-Forward | Never shuffle time series |

---

**Tomorrow:** Common Pitfalls & Overfitting