# Day 3: Hypothesis Testing for Trading
## Week 2: Statistics & Probability for Finance

---

**Learning Objectives:**
- Understand null/alternative hypotheses in trading context
- Apply t-tests, chi-square tests, and A/B testing
- Calculate statistical significance and p-values
- Test trading strategy performance

In [None]:
# Day 3 Setup: Hypothesis Testing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

plt.style.use('seaborn-v0_8-whitegrid')

# Load market data
df = pd.read_csv('../datasets/raw_data/combined_adjusted_close.csv', 
                 index_col='Date', parse_dates=True)
prices = df[['AAPL', 'MSFT', 'SPY', 'JPM']].dropna()
returns = prices.pct_change().dropna()

print("=" * 60)
print("HYPOTHESIS TESTING FOR TRADING - DAY 3")
print("=" * 60)

## 1. Hypothesis Testing Framework

**Key Concepts:**
- **H‚ÇÄ (Null Hypothesis)**: Default assumption (e.g., strategy has no edge)
- **H‚ÇÅ (Alternative Hypothesis)**: What we want to prove (e.g., strategy beats market)
- **p-value**: Probability of seeing data this extreme if H‚ÇÄ is true
- **Œ± (Significance Level)**: Threshold for rejection, typically 0.05

In [None]:
# Test: Is SPY's mean return significantly different from zero?
print("=" * 60)
print("TEST 1: Is SPY Mean Return Different from Zero?")
print("=" * 60)

spy_returns = returns['SPY'].values

# One-sample t-test
# H0: Œº = 0 (mean return is zero)
# H1: Œº ‚â† 0 (mean return is not zero)
t_stat, p_value = stats.ttest_1samp(spy_returns, 0)

print(f"\nSample mean: {np.mean(spy_returns):.6f}")
print(f"Sample std:  {np.std(spy_returns, ddof=1):.6f}")
print(f"n = {len(spy_returns)}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.6f}")
print(f"\nConclusion (Œ±=0.05): {'Reject H‚ÇÄ - Mean is significantly different from 0' if p_value < 0.05 else 'Fail to reject H‚ÇÄ'}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(spy_returns, bins=60, density=True, alpha=0.7, color='steelblue', edgecolor='white')
ax.axvline(0, color='red', lw=2, linestyle='--', label='H‚ÇÄ: Œº = 0')
ax.axvline(np.mean(spy_returns), color='green', lw=2, label=f'Sample Mean: {np.mean(spy_returns):.5f}')
ax.set_xlabel('Daily Returns', fontsize=11)
ax.set_ylabel('Density', fontsize=11)
ax.set_title(f'SPY Returns Distribution\nt-stat={t_stat:.2f}, p={p_value:.4f}', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

## 2. Two-Sample Tests - Comparing Strategies

In [None]:
# Test: Does AAPL outperform SPY?
print("=" * 60)
print("TEST 2: Does AAPL Outperform SPY?")
print("=" * 60)

aapl_returns = returns['AAPL'].values
spy_returns = returns['SPY'].values

print(f"\nAAPL Mean: {np.mean(aapl_returns)*252:.2%} (annualized)")
print(f"SPY Mean:  {np.mean(spy_returns)*252:.2%} (annualized)")

# Independent two-sample t-test
t_stat, p_value = stats.ttest_ind(aapl_returns, spy_returns)
print(f"\nIndependent t-test (two-tailed):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value:     {p_value:.6f}")

# One-tailed test: AAPL > SPY
p_one_tailed = p_value / 2 if t_stat > 0 else 1 - p_value/2
print(f"\nOne-tailed test (AAPL > SPY):")
print(f"  p-value:     {p_one_tailed:.6f}")
print(f"  Conclusion:  {'AAPL significantly outperforms SPY' if p_one_tailed < 0.05 else 'Cannot conclude AAPL outperforms'}")

# Paired t-test (more appropriate since same time period)
t_stat_paired, p_value_paired = stats.ttest_rel(aapl_returns, spy_returns)
print(f"\nPaired t-test (same time periods):")
print(f"  t-statistic: {t_stat_paired:.4f}")
print(f"  p-value:     {p_value_paired:.6f}")
print(f"  Conclusion:  {'Significant difference' if p_value_paired < 0.05 else 'No significant difference'}")

## 3. Testing Trading Strategy Performance

In [None]:
# Simulate a simple momentum strategy
print("=" * 60)
print("TEST 3: Is Momentum Strategy Statistically Significant?")
print("=" * 60)

# Simple momentum: go long if last month was positive
spy_series = returns['SPY']
momentum_signal = spy_series.rolling(21).mean().shift(1) > 0
strategy_returns = spy_series[momentum_signal].dropna()
benchmark_returns = spy_series.dropna()

print(f"\nStrategy Stats (Long only when momentum positive):")
print(f"  Days in market: {len(strategy_returns)} / {len(benchmark_returns)}")
print(f"  Strategy mean:  {np.mean(strategy_returns)*252:.2%}")
print(f"  Benchmark mean: {np.mean(benchmark_returns)*252:.2%}")
print(f"  Strategy vol:   {np.std(strategy_returns)*np.sqrt(252):.2%}")

# Test if strategy returns differ from benchmark
t_stat, p_value = stats.ttest_ind(strategy_returns, benchmark_returns)
print(f"\nStatistical Test (Strategy vs Benchmark):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value:     {p_value:.6f}")
print(f"  Conclusion:  {'Strategy significantly different' if p_value < 0.05 else 'No significant difference'}")

# Test if Sharpe Ratio is significantly > 0
strategy_sharpe = (np.mean(strategy_returns) - 0.05/252) / np.std(strategy_returns)
print(f"\nSharpe Ratio Analysis:")
print(f"  Daily Sharpe: {strategy_sharpe:.4f}")
print(f"  Annualized:   {strategy_sharpe * np.sqrt(252):.2f}")

## 4. Multiple Testing Problem

In [None]:
# Multiple Testing Problem Demonstration
print("=" * 60)
print("THE MULTIPLE TESTING PROBLEM")
print("=" * 60)

np.random.seed(42)

# Test 20 'strategies' that are actually random
n_strategies = 20
n_days = 252
alpha = 0.05

print(f"\nTesting {n_strategies} random 'strategies'...")
p_values = []

for i in range(n_strategies):
    # Generate random returns (no real edge)
    fake_returns = np.random.normal(0, 0.01, n_days)
    _, p = stats.ttest_1samp(fake_returns, 0)
    p_values.append(p)

significant = sum(p < alpha for p in p_values)
print(f"\nResults:")
print(f"  Strategies with p < {alpha}: {significant} / {n_strategies}")
print(f"  Expected by chance: {n_strategies * alpha:.1f}")

# Bonferroni correction
alpha_bonferroni = alpha / n_strategies
significant_bonf = sum(p < alpha_bonferroni for p in p_values)
print(f"\nWith Bonferroni correction (Œ± = {alpha_bonferroni:.4f}):")
print(f"  Significant strategies: {significant_bonf} / {n_strategies}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(range(1, n_strategies+1), p_values, color='steelblue', alpha=0.7)
for i, (bar, p) in enumerate(zip(bars, p_values)):
    if p < alpha:
        bar.set_color('green')
ax.axhline(alpha, color='red', linestyle='--', lw=2, label=f'Œ± = {alpha}')
ax.axhline(alpha_bonferroni, color='orange', linestyle='--', lw=2, label=f'Bonferroni Œ± = {alpha_bonferroni:.4f}')
ax.set_xlabel('Strategy', fontsize=11)
ax.set_ylabel('p-value', fontsize=11)
ax.set_title('Multiple Testing: Random Strategies\n(Green = "Significant" at Œ±=0.05)', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.show()

print("\n‚ö†Ô∏è Key Insight: Testing many strategies inflates false positives!")
print("   Use Bonferroni, FDR, or out-of-sample validation.")

## 5. Chi-Square Tests - Testing Independence

In [None]:
# Chi-Square Test: Are market regimes independent of day of week?
print("=" * 60)
print("CHI-SQUARE TEST: Returns vs Day of Week")
print("=" * 60)

# Create categories
spy_df = returns['SPY'].to_frame()
spy_df['day_of_week'] = spy_df.index.dayofweek
spy_df['return_category'] = pd.cut(spy_df['SPY'], 
                                    bins=[-np.inf, -0.01, 0, 0.01, np.inf],
                                    labels=['Big Down', 'Small Down', 'Small Up', 'Big Up'])

# Contingency table
contingency = pd.crosstab(spy_df['day_of_week'], spy_df['return_category'])
contingency.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
print("\nContingency Table:")
print(contingency)

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(f"\nChi-square test results:")
print(f"  œá¬≤ statistic: {chi2:.4f}")
print(f"  p-value:      {p_value:.6f}")
print(f"  Degrees of freedom: {dof}")
print(f"  Conclusion:   {'Day of week affects returns' if p_value < 0.05 else 'Returns independent of day'}")

## üìù Key Takeaways - Day 3

### Hypothesis Testing for Interviews:

1. **One-Sample t-test**: Test if mean differs from a value
   - Is strategy return > 0?
   - Is alpha significant?

2. **Two-Sample t-test**: Compare two groups
   - Independent: Different samples
   - Paired: Same time periods

3. **Multiple Testing Problem**
   - More tests = more false positives
   - Bonferroni: Œ± / n_tests
   - FDR control for large-scale testing

4. **Chi-Square Test**: Test independence
   - Categorical variables
   - Regime detection

### Interview Questions:
- "How would you test if a trading strategy has real alpha?"
- "What is the multiple testing problem and how do you address it?"
- "Explain Type I vs Type II errors in trading context"
- "When would you use a paired vs independent t-test?"