# Module 05: Statistical Validation and Hypothesis Testing

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 75 minutes

**Prerequisites**: Module 04 (Data Quality Assessment)

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Implement proper train/validation/test splits** that prevent data leakage
2. **Apply k-fold cross-validation correctly** to obtain robust performance estimates
3. **Conduct statistical hypothesis tests** with appropriate null and alternative hypotheses
4. **Calculate power analysis** for a priori sample size determination
5. **Understand Type I and Type II errors** and their trade-offs
6. **Apply multiple comparison corrections** to control family-wise error rates

## Setup

Let's import the libraries we'll use in this notebook.

In [None]:
# Standard data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical testing
from scipy import stats
from scipy.stats import ttest_ind, chi2_contingency, f_oneway, mannwhitneyu

# Machine learning and validation
from sklearn.model_selection import KFold, TimeSeriesSplit, cross_val_score, train_test_split
from sklearn.datasets import make_classification, make_regression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, mean_squared_error

# Power analysis
from statsmodels.stats.power import tt_solve_power, tt_ind_solve_power, FTestAnovaPower
from statsmodels.stats.multitest import multipletests

# Configuration for better visualizations
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)

print("‚úì Libraries imported successfully!")

## 1. Data Splitting: Train, Validation, and Test Sets

### The Principle: Never Touch Your Test Set Twice

A fundamental rule in rigorous research: your **test set represents the future**. Once you've decided to test your final model on it, you cannot use it again to make any other decisions.

### The Three-Way Split

For most machine learning projects, divide your data into three independent sets:

| Set | Purpose | Size | Usage |
|-----|---------|------|-------|
| **Training** | Fit model parameters | 70-80% | Seen by model during learning |
| **Validation** | Tune hyperparameters, select model | 10-15% | Used for decision-making |
| **Test** | Final performance estimate | 10-15% | Touched ONCE at the very end |

### Why Three Sets?

**With just train/test split:**
- You might accidentally memorize patterns in the test set during hyperparameter tuning
- When you use test performance to decide which model is "best", you're optimizing for that specific test set
- This leads to **optimistic bias** in your final performance estimates

**With train/validation/test split:**
- Training set: Only the model sees this (closed to you)
- Validation set: You use this for all decisions (model selection, hyperparameter tuning)
- Test set: You report this number once, at the very end (final ground truth)

### Special Case: Time-Series Data

For time-series data, **respect temporal ordering**:

```
Training Data    Validation Data    Test Data
[Jan-May]        [Jun-Jul]          [Aug-Dec]  ‚Üê Earlier ‚Üí Later

‚úì CORRECT: Train on past, validate and test on future

‚úó WRONG: Random shuffle
[Jan, May, Jul]  [Feb, Jun, Aug]    [Mar, Apr, Sep]  ‚Üê Data leakage!
```

Training on future data to predict the past is impossible in real applications.

In [None]:
# Example 1: Creating proper train/validation/test split

# Create a synthetic dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    random_state=42
)

print(f"Total dataset size: {len(X)} samples")
print()

# Step 1: First split - separate test set (untouchable)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, 
    test_size=0.15,  # 15% for final testing
    random_state=42,
    stratify=y  # Maintain class balance
)

# Step 2: Split remaining data into training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val,
    test_size=0.176,  # 176/1000 * (1 - 0.15) ‚âà 15% of total
    random_state=42,
    stratify=y_train_val
)

# Display the split
print("DATA SPLIT SUMMARY")
print("="*50)
print(f"Training set:   {len(X_train)} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set: {len(X_val)} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set:       {len(X_test)} samples ({len(X_test)/len(X)*100:.1f}%)")
print("="*50)

# Verify class balance is maintained
print("\nClass Distribution (Positive Class):")
print(f"Original:   {(y == 1).sum() / len(y) * 100:.1f}%")
print(f"Training:   {(y_train == 1).sum() / len(y_train) * 100:.1f}%")
print(f"Validation: {(y_val == 1).sum() / len(y_val) * 100:.1f}%")
print(f"Test:       {(y_test == 1).sum() / len(y_test) * 100:.1f}%")
print("\n‚úì Class balance maintained across all sets")

In [None]:
# Example 2: Demonstrating data leakage - what NOT to do

print("DATA LEAKAGE EXAMPLE: What Happens When You Peek at Test Data")
print("="*60)

# WRONG WAY: Fitting preprocessing on ALL data (including test)
from sklearn.preprocessing import StandardScaler

# ‚ùå WRONG: Fit scaler on entire dataset
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X)  # Scaler sees test data!

# Split after scaling (data leakage!)
X_train_wrong, X_test_wrong, y_train_wrong, y_test_wrong = train_test_split(
    X_scaled_wrong, y, test_size=0.15, random_state=42
)

# ‚úì RIGHT WAY: Fit preprocessing only on training data
X_train_clean, X_test_clean, y_train_clean, y_test_clean = train_test_split(
    X, y, test_size=0.15, random_state=42
)

scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train_clean)  # Fit only on training
X_test_scaled = scaler_right.transform(X_test_clean)  # Transform test with training stats

# Compare: Train a model both ways
model_wrong = LogisticRegression(max_iter=1000, random_state=42)
model_wrong.fit(X_train_wrong, y_train_wrong)
acc_wrong = accuracy_score(y_test_wrong, model_wrong.predict(X_test_wrong))

model_right = LogisticRegression(max_iter=1000, random_state=42)
model_right.fit(X_train_scaled, y_train_clean)
acc_right = accuracy_score(y_test_clean, model_right.predict(X_test_scaled))

print(f"\nLeaked approach (scaler sees test data): {acc_wrong:.4f}")
print(f"Correct approach (clean split):          {acc_right:.4f}")
print(f"\nDifference: {(acc_wrong - acc_right)*100:.2f} percentage points")
print("\n‚ö†Ô∏è  The 'leaked' approach shows artificially high performance!")
print("   In real deployment, performance would be worse.")

### Exercise 1: Proper Train/Validation/Test Split

You have 5000 customer records with a binary outcome (churned/retained):
- 3200 retained customers (label=0)
- 1800 churned customers (label=1)

Your task: Create a proper train/validation/test split with:
1. Test set: 15% of data
2. Validation set: 15% of remaining data
3. Training set: Rest
4. Maintain class balance across all sets

Calculate and report:
- Size of each set
- Proportion of churned customers in each set
- Any differences in class balance

**Hint**: Use `train_test_split` twice with `stratify=y` parameter

In [None]:
# Exercise 1: Create proper train/validation/test split

# Create customer churn dataset
np.random.seed(42)
n_total = 5000
y_churn = np.concatenate([
    np.zeros(3200),   # Retained
    np.ones(1800)     # Churned
])
np.random.shuffle(y_churn)

X_churn = np.random.randn(n_total, 10)  # Random features

print("Customer Churn Dataset:")
print(f"Total samples: {len(y_churn)}")
print(f"Churned (label=1): {(y_churn == 1).sum()} ({(y_churn == 1).sum()/len(y_churn)*100:.1f}%)")
print(f"Retained (label=0): {(y_churn == 0).sum()} ({(y_churn == 0).sum()/len(y_churn)*100:.1f}%)")
print()

# TODO: Implement proper 3-way split here
# Step 1: Separate 15% for test set
# X_train_val, X_test, y_train_val, y_test = train_test_split(...)

# Step 2: Split remaining into train and validation
# X_train, X_val, y_train, y_val = train_test_split(...)

# Report results:
# print(f"Training:   {len(X_train)} samples")
# print(f"Validation: {len(X_val)} samples")
# print(f"Test:       {len(X_test)} samples")
# print(f"\nChurned rate by set:")
# print(f"Training:   {(y_train == 1).sum() / len(y_train) * 100:.1f}%")
# print(f"Validation: {(y_val == 1).sum() / len(y_val) * 100:.1f}%")
# print(f"Test:       {(y_test == 1).sum() / len(y_test) * 100:.1f}%")

## 2. Cross-Validation: More Robust Performance Estimates

### The Problem with Single Train/Test Split

A single 80/20 split depends heavily on which specific samples end up in training vs. test:

- Split A: Get lucky with test set ‚Üí Accuracy = 88%
- Split B: Get unlucky with test set ‚Üí Accuracy = 82%
- Which is the "true" performance?

**Answer**: K-fold cross-validation averages over many different splits.

### How K-Fold Cross-Validation Works

```
Original Data: [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ] (10 samples, k=5)

Fold 1: Train [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ]  Test [‚ñÄ‚ñÄ]  ‚Üí Accuracy‚ÇÅ = 85%
Fold 2: Train [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ]  Test [‚ñÄ‚ñÄ]  ‚Üí Accuracy‚ÇÇ = 87%
Fold 3: Train [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ]  Test [‚ñÄ‚ñÄ]  ‚Üí Accuracy‚ÇÉ = 86%
Fold 4: Train [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ]  Test [‚ñÄ‚ñÄ]  ‚Üí Accuracy‚ÇÑ = 84%
Fold 5: Train [‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ‚ñÄ]  Test [‚ñÄ‚ñÄ]  ‚Üí Accuracy‚ÇÖ = 88%

Mean Performance: (85+87+86+84+88)/5 = 86.0% ¬± 1.6%
```

### Advantages of K-Fold CV

‚úì **More stable**: Uses all data for both training and testing
‚úì **Better error bars**: SD of k accuracy scores shows variability
‚úì **Smaller variance**: More realistic estimate of true performance
‚úì **Full utilization**: No data wasted (vs. holding out 20% permanently)

### Choosing K

- **k=5**: Fast, standard choice for most problems
- **k=10**: More computation, slightly lower variance
- **k=3**: Minimum for small datasets
- **Leave-One-Out (k=n)**: Maximum stability, very slow

### Important: CV for Hyperparameter Tuning

**The right way**:
```python
for each combination of hyperparameters:
    cv_scores = cross_val_score(model, X_train, y_train, cv=5)
    mean_cv_score = cv_scores.mean()
    
best_params = hyperparameters with highest mean_cv_score

# Fit final model on full training set
final_model = Model(best_params).fit(X_train, y_train)

# Report on held-out test set (ONLY ONCE)
test_score = final_model.score(X_test, y_test)
```

In [None]:
# Example 3: K-Fold Cross-Validation in Action

# Use our training data from earlier
model = LogisticRegression(max_iter=1000, random_state=42)

# Method 1: Manual k-fold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = []
fold_number = 1

print("K-FOLD CROSS-VALIDATION (Manual Implementation)")
print("="*60)

for train_idx, val_idx in kfold.split(X_train):
    # Split data
    X_fold_train = X_train[train_idx]
    y_fold_train = y_train[train_idx]
    X_fold_val = X_train[val_idx]
    y_fold_val = y_train[val_idx]
    
    # Train and evaluate
    model.fit(X_fold_train, y_fold_train)
    score = accuracy_score(y_fold_val, model.predict(X_fold_val))
    fold_scores.append(score)
    
    print(f"Fold {fold_number}: Accuracy = {score:.4f}")
    fold_number += 1

print("="*60)
print(f"Mean CV Accuracy: {np.mean(fold_scores):.4f}")
print(f"Std Dev:          {np.std(fold_scores):.4f}")
print(f"95% CI:           [{np.mean(fold_scores) - 1.96*np.std(fold_scores)/np.sqrt(5):.4f}, "
      f"{np.mean(fold_scores) + 1.96*np.std(fold_scores)/np.sqrt(5):.4f}]")

# Method 2: Using scikit-learn's cross_val_score (cleaner)
print("\n" + "="*60)
print("Using scikit-learn's cross_val_score:")
print("="*60)

cv_scores = cross_val_score(
    LogisticRegression(max_iter=1000, random_state=42),
    X_train, y_train,
    cv=5,
    scoring='accuracy'
)

print(f"CV Scores: {cv_scores}")
print(f"\nMean: {cv_scores.mean():.4f} ¬± {cv_scores.std():.4f}")

In [None]:
# Example 4: Time-Series Cross-Validation (Respecting Temporal Order)

# Create synthetic time-series data
n_time_steps = 200
X_ts = np.random.randn(n_time_steps, 5)
# Create target with time-series pattern
y_ts = (np.sin(np.arange(n_time_steps) / 20) + np.random.randn(n_time_steps) * 0.1) > 0

print("TIME-SERIES CROSS-VALIDATION")
print("="*60)
print(f"Total samples: {n_time_steps} (time steps)\n")

# Use TimeSeriesSplit to respect temporal ordering
ts_cv = TimeSeriesSplit(n_splits=4)

fold_number = 1
ts_scores = []

for train_idx, test_idx in ts_cv.split(X_ts):
    # Notice: test_idx is always AFTER train_idx (temporal ordering)
    train_start, train_end = train_idx[0], train_idx[-1]
    test_start, test_end = test_idx[0], test_idx[-1]
    
    print(f"Fold {fold_number}:")
    print(f"  Training: time steps {train_start:3d}-{train_end:3d} ({len(train_idx):3d} samples)")
    print(f"  Testing:  time steps {test_start:3d}-{test_end:3d} ({len(test_idx):3d} samples)")
    
    # Train model
    model_ts = LogisticRegression(max_iter=1000, random_state=42)
    model_ts.fit(X_ts[train_idx], y_ts[train_idx])
    score = accuracy_score(y_ts[test_idx], model_ts.predict(X_ts[test_idx]))
    ts_scores.append(score)
    print(f"  Accuracy: {score:.4f}\n")
    
    fold_number += 1

print("="*60)
print(f"Mean Time-Series CV Accuracy: {np.mean(ts_scores):.4f} ¬± {np.std(ts_scores):.4f}")
print("\n‚úì Temporal ordering respected (no future data in training)")

### Exercise 2: Cross-Validation with Error Bars

Using the training data from earlier:
1. Perform 10-fold cross-validation with LogisticRegression
2. Calculate mean accuracy and standard error
3. Calculate 95% confidence interval: mean ¬± 1.96 * SE
4. Visualize the distribution of fold accuracies (histogram or boxplot)
5. Interpret: What does the spread of scores tell you?

**Hint**: Use `cross_val_score()` and then calculate statistics manually, or use `scipy.stats.sem()` for standard error

In [None]:
# Exercise 2: 10-Fold Cross-Validation with Error Bars

# TODO: Perform 10-fold cross-validation
# cv_scores = cross_val_score(...)

# TODO: Calculate mean and standard error
# mean_accuracy = cv_scores.mean()
# se = stats.sem(cv_scores)  # Standard error of the mean
# ci_lower = mean_accuracy - 1.96 * se
# ci_upper = mean_accuracy + 1.96 * se

# TODO: Print results
# print(f"Mean Accuracy: {mean_accuracy:.4f}")
# print(f"Standard Error: {se:.4f}")
# print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")

# TODO: Create visualization
# fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# axes[0].hist(cv_scores, bins=5, edgecolor='black')
# axes[0].set_xlabel('Accuracy')
# axes[0].set_ylabel('Frequency')
# axes[0].set_title('Distribution of Fold Accuracies')
# axes[0].axvline(mean_accuracy, color='red', linestyle='--', label='Mean')
# axes[0].legend()

# axes[1].boxplot(cv_scores)
# axes[1].set_ylabel('Accuracy')
# axes[1].set_title('Boxplot of Fold Accuracies')
# plt.tight_layout()
# plt.show()

print("TODO: Complete Exercise 2")

## 3. Hypothesis Testing: Statistical Inference

### Core Concepts

**Null Hypothesis (H‚ÇÄ)**: The default assumption - "no effect exists"
- "There is no difference between groups A and B"
- "Feature X has no predictive power"

**Alternative Hypothesis (H‚ÇÅ)**: What you're testing for - "some effect exists"
- "Groups A and B differ"
- "Feature X predicts the outcome"

**Test Statistic**: A number calculated from data that measures evidence against H‚ÇÄ
- t-statistic, F-statistic, œá¬≤ statistic, etc.

**P-value**: Probability of observing data this extreme (or more) IF H‚ÇÄ is true
- P-value = 0.03 means "if there were no effect, we'd see this data 3% of the time"
- CRITICAL: P-value is NOT "probability that H‚ÇÄ is true"

**Significance Level (Œ±)**: The threshold for deciding H‚ÇÄ is unlikely
- Standard: Œ± = 0.05
- Decision: If p < Œ±, reject H‚ÇÄ; otherwise fail to reject H‚ÇÄ

### Common Hypothesis Tests

| Question | Test | Null Hypothesis |
|----------|------|------------------|
| Do two groups differ? | t-test (independent) | Œº‚ÇÅ = Œº‚ÇÇ |
| Do 3+ groups differ? | ANOVA | All group means equal |
| Is there association? | Chi-square | Variables independent |
| Correlation significant? | Pearson test | r = 0 |
| Non-normal data? | Mann-Whitney U | Distributions identical |

### Critical: Report Exact P-Values

‚ùå **Avoid**: "p < 0.05"
‚úì **Report**: "p = 0.032" or "p = 0.003"

**Why?** "p < 0.05" loses information. The exact value tells readers how strong the evidence is.
- p = 0.049 (barely significant)
- p = 0.0001 (very strong evidence)

In [None]:
# Example 5: Independent Samples T-Test

# Research question: Do users with high engagement have lower churn?
# Create two groups
np.random.seed(42)

# Low engagement group: higher churn rate (mean = 0.45)
low_engagement_churn = np.random.binomial(n=1, p=0.45, size=150)

# High engagement group: lower churn rate (mean = 0.25)
high_engagement_churn = np.random.binomial(n=1, p=0.25, size=150)

print("INDEPENDENT SAMPLES T-TEST")
print("="*60)
print("Research Question: Does engagement affect churn?")
print()
print(f"Low engagement group:  mean churn = {low_engagement_churn.mean():.3f}")
print(f"High engagement group: mean churn = {high_engagement_churn.mean():.3f}")
print(f"Difference: {low_engagement_churn.mean() - high_engagement_churn.mean():.3f}")
print()

# Null hypothesis: Œº_low = Œº_high (no difference)
# Alternative hypothesis: Œº_low ‚â† Œº_high (two-tailed)

t_stat, p_value = ttest_ind(low_engagement_churn, high_engagement_churn)

print("HYPOTHESIS TEST RESULTS")
print("="*60)
print(f"H‚ÇÄ (Null):       Engagement does not affect churn")
print(f"H‚ÇÅ (Alternative): Engagement affects churn (two-tailed)")
print()
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")
print(f"Œ± level:     0.05")
print()

if p_value < 0.05:
    print(f"‚úì REJECT H‚ÇÄ: Evidence suggests engagement affects churn (p = {p_value:.4f})")
else:
    print(f"‚úó FAIL TO REJECT H‚ÇÄ: Insufficient evidence (p = {p_value:.4f})")

print()
print("INTERPRETATION:")
print(f"- If there were no effect of engagement on churn,")
print(f"  we would observe data this extreme {p_value*100:.2f}% of the time")
print(f"- This is {'unusual' if p_value < 0.05 else 'not unusual'} at Œ± = 0.05")

In [None]:
# Example 6: Effect Sizes and Confidence Intervals

from scipy.stats import t

print("EFFECT SIZES AND CONFIDENCE INTERVALS")
print("="*60)

# Calculate Cohen's d (effect size)
n1, n2 = len(low_engagement_churn), len(high_engagement_churn)
mean1, mean2 = low_engagement_churn.mean(), high_engagement_churn.mean()
std1, std2 = low_engagement_churn.std(), high_engagement_churn.std()

# Pooled standard deviation
pooled_std = np.sqrt(((n1-1)*std1**2 + (n2-1)*std2**2) / (n1 + n2 - 2))

# Cohen's d
cohens_d = (mean1 - mean2) / pooled_std

print(f"Cohen's d: {cohens_d:.4f}")
print()
print("Effect size interpretation:")
print("  |d| < 0.2: Small effect")
print("  0.2 ‚â§ |d| < 0.5: Small-to-medium")
print("  0.5 ‚â§ |d| < 0.8: Medium-to-large")
print("  |d| ‚â• 0.8: Large effect")

if abs(cohens_d) < 0.2:
    effect_interpretation = "Small"
elif abs(cohens_d) < 0.5:
    effect_interpretation = "Small-to-medium"
elif abs(cohens_d) < 0.8:
    effect_interpretation = "Medium-to-large"
else:
    effect_interpretation = "Large"

print(f"\n‚Üí This is a {effect_interpretation} effect")

# Calculate 95% Confidence Interval for difference in means
se_diff = np.sqrt((std1**2 / n1) + (std2**2 / n2))
df = n1 + n2 - 2
t_critical = t.ppf(0.975, df)  # 97.5th percentile for 95% CI

mean_diff = mean1 - mean2
ci_lower = mean_diff - t_critical * se_diff
ci_upper = mean_diff + t_critical * se_diff

print(f"\n95% Confidence Interval for difference in means:")
print(f"[{ci_lower:.4f}, {ci_upper:.4f}]")
print()
print("Interpretation:")
print(f"- The true difference in churn rates is likely between {ci_lower:.3f} and {ci_upper:.3f}")
print(f"- Since the CI doesn't include 0, we're confident the difference is real")

### Exercise 3: Hypothesis Testing with Real Scenario

A company A/B tests a new feature with:
- Control group: 500 users, 245 converted
- Treatment group: 480 users, 260 converted

Your task:
1. State H‚ÇÄ and H‚ÇÅ formally
2. Perform an appropriate hypothesis test
3. Report the exact p-value
4. Calculate effect size (Cohen's h or similar)
5. Calculate 95% CI for the difference in conversion rates
6. Make a business decision: Should we deploy this feature? Why/why not?

**Hint**: Use chi-square test or compare proportions test

In [None]:
# Exercise 3: A/B Test Hypothesis Testing

# Data from A/B test
control_conversions = 245
control_total = 500
treatment_conversions = 260
treatment_total = 480

print("A/B TEST ANALYSIS")
print("="*60)
print(f"Control:   {control_conversions}/{control_total} conversions = {control_conversions/control_total:.1%}")
print(f"Treatment: {treatment_conversions}/{treatment_total} conversions = {treatment_conversions/treatment_total:.1%}")
print()

# TODO: 
# 1. State hypotheses
print("HYPOTHESES:")
print("H‚ÇÄ: ???")
print("H‚ÇÅ: ???")
print()

# 2. Perform chi-square test
# Create contingency table
contingency_table = np.array([
    [control_conversions, control_total - control_conversions],
    [treatment_conversions, treatment_total - treatment_conversions]
])

# TODO: chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

# 3. Report p-value
# print(f"œá¬≤ statistic: {chi2_stat:.4f}")
# print(f"p-value: {p_value:.4f}")
# print(f"Significant at Œ±=0.05? {p_value < 0.05}")

# 4. Calculate effect size
# TODO: Calculate Cram√©r's V or Cohen's h

# 5. Calculate 95% CI for difference
# TODO: Use proportion CI formula

# 6. Business decision
# TODO: Summarize findings

print("TODO: Complete Exercise 3")

## 4. Power Analysis: Planning Sample Size

### The Problem: How Many Subjects Do I Need?

**Scenario**: You want to run an experiment but don't know how large your sample needs to be.

- Too small: Risk failing to detect a real effect (low power)
- Too large: Waste resources, unnecessary cost

**Power analysis solves this** by determining minimum sample size for a desired power level.

### Key Concepts: Four Things That Are Related

```
1. Sample Size (n)
2. Significance Level (Œ±) - usually 0.05
3. Power (1 - Œ≤) - usually 0.80
4. Effect Size (d) - the minimum difference you want to detect

Given ANY THREE, you can calculate the FOURTH
```

### Understanding Power

**Power** = Probability of detecting a real effect IF it exists
- Power = 0.80: 80% chance of detecting the effect
- Power = 0.20: Œ≤ (Type II error rate) = probability of missing the effect

**Standard target**: Power = 0.80
- This is a convention: 80% power for 5% significance level
- Higher power (0.90) requires larger sample sizes

### A Priori vs. Post Hoc Power Analysis

| Type | When | Purpose | Interpretation |
|------|------|---------|----------------|
| **A Priori** | BEFORE collecting data | Plan sample size | "We need n=150 subjects" |
| **Post Hoc** | AFTER collecting data | Understand results | ‚úó Generally not recommended |

**Warning**: Post-hoc power analysis is controversial because it's just a function of the p-value. Use confidence intervals instead.

In [None]:
# Example 7: A Priori Power Analysis for t-test

print("A PRIORI POWER ANALYSIS")
print("="*60)
print("Scenario: Plan a study comparing treatment vs. control")
print()

# Parameters
effect_size = 0.5  # Cohen's d = 0.5 (medium effect)
alpha = 0.05
power = 0.80

# Calculate required sample size
n_required = tt_ind_solve_power(
    effect_size=effect_size,
    nobs1=None,  # What we're solving for
    alpha=alpha,
    power=power,
    ratio=1.0,  # Equal sample sizes
    alternative='two-sided'
)

print(f"Effect size (Cohen's d): {effect_size}")
print(f"Significance level (Œ±):  {alpha}")
print(f"Desired power:           {power}")
print(f"\n‚Üí Required sample size per group: {np.ceil(n_required):.0f}")
print(f"‚Üí Total sample size: {np.ceil(n_required * 2):.0f}")

print("\n" + "="*60)
print("SENSITIVITY ANALYSIS: How does sample size change?")
print("="*60)

# Vary effect size
effect_sizes = [0.2, 0.5, 0.8]  # Small, medium, large
sample_sizes = []

for es in effect_sizes:
    n = tt_ind_solve_power(effect_size=es, nobs1=None, alpha=0.05, 
                          power=0.80, alternative='two-sided')
    sample_sizes.append(n)
    effect_label = {0.2: "Small", 0.5: "Medium", 0.8: "Large"}[es]
    print(f"Effect size = {es} ({effect_label:7s}): n = {np.ceil(n):.0f} per group")

# Vary power
print("\nVarying power (effect size = 0.5):")
powers = [0.80, 0.90, 0.95]

for pw in powers:
    n = tt_ind_solve_power(effect_size=0.5, nobs1=None, alpha=0.05,
                          power=pw, alternative='two-sided')
    print(f"Power = {pw}: n = {np.ceil(n):.0f} per group")

In [None]:
# Example 8: Power Curves - Visualizing Sample Size Trade-offs

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Sample size vs. Effect size
effect_sizes = np.linspace(0.1, 1.0, 50)
sample_sizes_by_effect = []

for es in effect_sizes:
    n = tt_ind_solve_power(effect_size=es, nobs1=None, alpha=0.05,
                          power=0.80, alternative='two-sided')
    sample_sizes_by_effect.append(n)

axes[0].plot(effect_sizes, sample_sizes_by_effect, linewidth=2.5, color='steelblue')
axes[0].scatter([0.2, 0.5, 0.8], 
               [tt_ind_solve_power(es, None, 0.05, 0.80, alternative='two-sided')
                for es in [0.2, 0.5, 0.8]],
               s=100, color='red', zorder=5, label='Standard effects')
axes[0].set_xlabel('Effect Size (Cohen\'s d)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Sample Size per Group', fontsize=11, fontweight='bold')
axes[0].set_title('Sample Size vs. Effect Size\n(Power = 0.80, Œ± = 0.05)', 
                  fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Right plot: Sample size vs. Power
powers = np.linspace(0.50, 0.99, 50)
sample_sizes_by_power = []

for pw in powers:
    n = tt_ind_solve_power(effect_size=0.5, nobs1=None, alpha=0.05,
                          power=pw, alternative='two-sided')
    sample_sizes_by_power.append(n)

axes[1].plot(powers, sample_sizes_by_power, linewidth=2.5, color='coral')
axes[1].axvline(x=0.80, color='green', linestyle='--', linewidth=2, label='Standard (80%)')
axes[1].axhline(y=tt_ind_solve_power(0.5, None, 0.05, 0.80, alternative='two-sided'),
                color='green', linestyle='--', linewidth=2, alpha=0.5)
axes[1].scatter([0.80], 
               [tt_ind_solve_power(0.5, None, 0.05, 0.80, alternative='two-sided')],
               s=100, color='red', zorder=5)
axes[1].set_xlabel('Power (1 - Œ≤)', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Sample Size per Group', fontsize=11, fontweight='bold')
axes[1].set_title('Sample Size vs. Power\n(Effect Size = 0.5, Œ± = 0.05)',
                  fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()

plt.tight_layout()
plt.show()

print("Key Insights:")
print("- Smaller effects require much larger sample sizes")
print("- Higher power requirements increase sample size")
print("- Sample size scales with 1/d¬≤ (very nonlinear)")

### Understanding Type I and Type II Errors

In hypothesis testing, two types of mistakes are possible:

```
                 H‚ÇÄ is TRUE    H‚ÇÄ is FALSE
Reject H‚ÇÄ        Type I Error  ‚úì Correct
                 (Œ±)           (Power = 1-Œ≤)

Fail to Reject   ‚úì Correct     Type II Error
H‚ÇÄ               (1-Œ±)         (Œ≤)
```

**Type I Error (Œ± = "False Positive")**:
- We claim an effect exists when it actually doesn't
- Example: "This drug works" when it doesn't
- Probability: Œ± (usually set to 0.05)
- Medical analogy: False diagnosis of illness

**Type II Error (Œ≤ = "False Negative")**:
- We fail to detect an effect that actually exists
- Example: "This drug doesn't work" when it does
- Probability: Œ≤ (we set power = 1 - Œ≤, usually 0.80)
- Medical analogy: Missing real illness

### The Trade-off

- **Lower Œ±** (stricter criteria) ‚Üí fewer false positives, but more false negatives
- **Higher power** (detect effects) ‚Üí requires larger sample size
- In many fields, Type I errors are considered worse than Type II
  - But in medical screening, missing disease (Type II) is often worse

### Example: Drug Approval

- **H‚ÇÄ**: Drug is ineffective
- **H‚ÇÅ**: Drug is effective
- **Type I Error**: Approve ineffective drug (patient harm)
- **Type II Error**: Reject effective drug (patients don't get help)
- **Solution**: Require strong evidence (low Œ± = 0.01), but also high power (0.90+)

In [None]:
# Example 9: Type I and Type II Error Visualization

# Visualize Type I and Type II errors
from scipy.stats import norm

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distributions under H‚ÇÄ and H‚ÇÅ
x = np.linspace(-4, 6, 1000)
h0_dist = norm(loc=0, scale=1)  # Null hypothesis distribution
h1_dist = norm(loc=2, scale=1)  # Alternative hypothesis (effect size d=2)

alpha = 0.05
critical_value = norm.ppf(1 - alpha/2)  # Two-tailed

# Left plot: Type I Error
axes[0].plot(x, h0_dist.pdf(x), 'b-', linewidth=2, label='Distribution under H‚ÇÄ')
axes[0].fill_between(x[x > critical_value], 0, h0_dist.pdf(x[x > critical_value]),
                      alpha=0.3, color='red', label=f'Type I Error (Œ± = {alpha})')
axes[0].axvline(critical_value, color='red', linestyle='--', linewidth=2)
axes[0].set_xlabel('Test Statistic', fontsize=11)
axes[0].set_ylabel('Probability Density', fontsize=11)
axes[0].set_title('Type I Error: False Positive\n(Reject H‚ÇÄ when it\'s true)', 
                  fontsize=12, fontweight='bold')
axes[0].legend(loc='upper right')
axes[0].set_ylim([0, 0.5])

# Right plot: Type II Error and Power
axes[1].plot(x, h0_dist.pdf(x), 'b-', linewidth=2, label='Under H‚ÇÄ (no effect)')
axes[1].plot(x, h1_dist.pdf(x), 'g-', linewidth=2, label='Under H‚ÇÅ (effect exists)')

# Type II Error region
axes[1].fill_between(x[x < critical_value], 0, h1_dist.pdf(x[x < critical_value]),
                      alpha=0.3, color='orange', label=f'Type II Error (Œ≤)')

# Power region
axes[1].fill_between(x[x > critical_value], 0, h1_dist.pdf(x[x > critical_value]),
                      alpha=0.3, color='green', label=f'Power (1 - Œ≤)')

axes[1].axvline(critical_value, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Test Statistic', fontsize=11)
axes[1].set_ylabel('Probability Density', fontsize=11)
axes[1].set_title('Type II Error vs. Power\n(With true effect present)',
                  fontsize=12, fontweight='bold')
axes[1].legend(loc='upper right')
axes[1].set_ylim([0, 0.5])

plt.tight_layout()
plt.show()

# Calculate actual error rates for this example
type1_rate = alpha
type2_rate = h1_dist.cdf(critical_value)
power = 1 - type2_rate

print("ERROR RATE SUMMARY (for effect size d=2)")
print("="*50)
print(f"Type I Error Rate (Œ±):        {type1_rate:.1%}")
print(f"Type II Error Rate (Œ≤):       {type2_rate:.1%}")
print(f"Power (1 - Œ≤):                {power:.1%}")
print()
print("Interpretation:")
print(f"- 5% chance of false positive")
print(f"- {type2_rate:.1%} chance of false negative")
print(f"- {power:.1%} chance of detecting the effect if it exists")

## 5. Multiple Comparisons Problem

### The Issue: Testing Many Hypotheses

Imagine you test 20 different hypotheses, all with Œ± = 0.05:

```
Probability of at least one false positive = 1 - (0.95)^20 = 0.64

64% chance you'll find a "significant" result even if there are
NO true effects at all!
```

This is the **multiple comparisons problem**.

### When Do You Have Multiple Comparisons?

‚úì Testing many features in a model (feature selection)
‚úì Comparing multiple groups (ANOVA followed by pairwise tests)
‚úì Testing many outcome variables
‚úì Testing at multiple time points
‚úì Exploratory analysis with many hypotheses

### Solutions: Correction Methods

| Method | How It Works | When to Use | Trade-off |
|--------|-------------|-------------|----------|
| **Bonferroni** | Divide Œ± by m: Œ±_adj = Œ±/m | Few comparisons (m < 20) | Very conservative, loses power |
| **Holm-Bonferroni** | Stepwise Bonferroni | Few comparisons | Less conservative than Bonferroni |
| **False Discovery Rate (FDR)** | Control proportion of false positives | Many comparisons | More powerful than Bonferroni |
| **None** | Report with multiplicity warning | Pre-specified hypothesis | Only if justified |

### Bonferroni Correction

**Simple rule**: Divide your significance level by the number of tests

```
Instead of:  Œ± = 0.05
Use:         Œ±_adjusted = 0.05 / m, where m = number of tests

Example: 10 tests
Œ±_adjusted = 0.05 / 10 = 0.005

Only reject H‚ÇÄ if p < 0.005 (much stricter)
```

**Advantage**: Simple, easy to explain
**Disadvantage**: Very conservative (loses power to detect real effects)

### False Discovery Rate (FDR)

**Idea**: Instead of controlling probability of even one false positive (family-wise error), control the **proportion** of false positives among all significant findings.

**Example**:
- Test 100 features
- Find 20 significant at FDR q = 0.05
- Expected: ~1 false positive among the 20 (5%)

**Advantage**: More powerful than Bonferroni
**Disadvantage**: Slightly more complex to interpret

In [None]:
# Example 10: Multiple Comparisons Problem

print("MULTIPLE COMPARISONS PROBLEM DEMONSTRATION")
print("="*60)

# Simulate testing 20 features on random data (NO TRUE EFFECTS)
np.random.seed(42)
n_tests = 20
n_features = 20
n_samples = 100

# Create random features and random target (no correlation)
X_random = np.random.randn(n_samples, n_features)
y_random = np.random.randn(n_samples)

# Test each feature for correlation
p_values = []
for i in range(n_features):
    # Correlation test
    r = np.corrcoef(X_random[:, i], y_random)[0, 1]
    t_stat = r * np.sqrt(n_samples - 2) / np.sqrt(1 - r**2)
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n_samples - 2))
    p_values.append(p_value)

p_values = np.array(p_values)

# Count false positives at Œ±=0.05 (without correction)
false_positives_uncorrected = (p_values < 0.05).sum()

print(f"Testing {n_features} features on random data (NO true effects)")
print()
print(f"Uncorrected Œ± = 0.05:")
print(f"  Number of 'significant' results: {false_positives_uncorrected} out of {n_features}")
print(f"  {false_positives_uncorrected/n_features*100:.1f}% false positive rate")
print()
print(f"Expected false positives (random chance): {n_features * 0.05:.1f}")
print(f"Theoretical risk of ‚â•1 false positive: {(1 - (1-0.05)**n_features)*100:.1f}%")

# Apply Bonferroni correction
bonferroni_alpha = 0.05 / n_features
false_positives_bonf = (p_values < bonferroni_alpha).sum()

print(f"\nBonferroni corrected Œ± = 0.05/{n_features} = {bonferroni_alpha:.4f}:")
print(f"  Number of 'significant' results: {false_positives_bonf} out of {n_features}")
print()

# Apply FDR correction (Benjamini-Hochberg)
reject, p_adjust_fdr, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)
false_positives_fdr = reject.sum()

print(f"FDR (Benjamini-Hochberg) with q = 0.05:")
print(f"  Number of 'significant' results: {false_positives_fdr} out of {n_features}")
print()

print("COMPARISON:")
print("="*60)
print(f"No correction:      {false_positives_uncorrected} false positives (danger!)")
print(f"Bonferroni:         {false_positives_bonf} false positives (safe, but conservative)")
print(f"FDR:                {false_positives_fdr} false positives (balanced)")

In [None]:
# Example 11: Applying Multiple Comparison Corrections

print("PRACTICAL EXAMPLE: Feature Selection with Multiple Comparisons")
print("="*60)

# Create realistic data: some features have real effect, most don't
np.random.seed(42)
n_samples = 200
n_features = 30

X = np.random.randn(n_samples, n_features)

# Only first 3 features have real effect
y = (X[:, 0] + 0.5*X[:, 1] - 0.7*X[:, 2] + 0.3*np.random.randn(n_samples)) > 0

# Calculate p-values for each feature
p_values = []
correlations = []

for i in range(n_features):
    # T-test: feature vs. outcome
    group_0 = X[y == 0, i]
    group_1 = X[y == 1, i]
    _, p_val = ttest_ind(group_0, group_1)
    p_values.append(p_val)
    correlations.append(np.corrcoef(X[:, i], y)[0, 1])

p_values = np.array(p_values)
correlations = np.array(correlations)

# Sort by p-value
sorted_idx = np.argsort(p_values)
p_sorted = p_values[sorted_idx]
feature_idx_sorted = sorted_idx[:10]  # Show top 10

print("Top 10 Most Significant Features:")
print()
print("Feature  Raw p-value  Bonferroni  FDR      Significant?")
print("-" * 60)

# Apply corrections
bonf_threshold = 0.05 / n_features
reject_fdr, p_fdr, _, _ = multipletests(p_values, method='fdr_bh', alpha=0.05)

for rank, feat_idx in enumerate(feature_idx_sorted, 1):
    p_raw = p_values[feat_idx]
    significant_raw = "‚úì" if p_raw < 0.05 else ""
    significant_bonf = "‚úì" if p_raw < bonf_threshold else ""
    significant_fdr = "‚úì" if reject_fdr[feat_idx] else ""
    
    print(f"{feat_idx:3d}      {p_raw:.4f}      {significant_bonf}         {significant_fdr}      "
          f"Real effect: {feat_idx < 3}")

print()
print(f"Bonferroni threshold (Œ±/m): {bonf_threshold:.4f}")
print()
print("INTERPRETATION:")
print("- Without correction: 7 features appear 'significant'")
print(f"- Bonferroni: {(p_sorted[:10] < bonf_threshold).sum()} remain significant (very strict)")
print(f"- FDR: {reject_fdr[feature_idx_sorted].sum()} remain significant (balanced)")
print()
print("Best practice: Use FDR for exploratory analysis, Bonferroni for")
print("few pre-specified hypotheses.")

## 6. Summary: Putting It All Together

### The Complete Validation Pipeline

```
1. SPLIT DATA
   ‚îú‚îÄ‚îÄ Train (70-80%)
   ‚îú‚îÄ‚îÄ Validation (10-15%)
   ‚îî‚îÄ‚îÄ Test (10-15%)

2. CROSS-VALIDATE
   ‚îú‚îÄ‚îÄ K-fold CV on training data
   ‚îú‚îÄ‚îÄ Estimate performance ¬± confidence interval
   ‚îî‚îÄ‚îÄ Tune hyperparameters on CV scores

3. HYPOTHESIS TEST
   ‚îú‚îÄ‚îÄ State H‚ÇÄ and H‚ÇÅ
   ‚îú‚îÄ‚îÄ Choose appropriate test
   ‚îú‚îÄ‚îÄ Calculate exact p-value
   ‚îú‚îÄ‚îÄ Report effect size and confidence interval
   ‚îî‚îÄ‚îÄ Interpret in context

4. ACCOUNT FOR MULTIPLICITY
   ‚îú‚îÄ‚îÄ Count number of tests
   ‚îú‚îÄ‚îÄ Apply appropriate correction (Bonferroni or FDR)
   ‚îî‚îÄ‚îÄ Report adjusted p-values

5. FINAL EVALUATION
   ‚îú‚îÄ‚îÄ Fit final model on training + validation
   ‚îú‚îÄ‚îÄ Report performance on test set (ONCE)
   ‚îî‚îÄ‚îÄ Don't revisit test set
```

### Key Principles

‚úÖ **Data integrity**: Separate train/val/test, fit preprocessing only on training data

‚úÖ **Robust estimates**: Use cross-validation to get mean ¬± SE, not point estimates

‚úÖ **Statistical rigor**: Report exact p-values, effect sizes, confidence intervals

‚úÖ **Multiplicity awareness**: Apply corrections when testing multiple hypotheses

‚úÖ **Planning**: Use power analysis to determine sample size A PRIORI

‚úÖ **Reproducibility**: Set random seeds, document decisions, make code transparent

### Common Mistakes to Avoid

‚ùå Using test set performance to choose model or tune hyperparameters

‚ùå Running many tests and only reporting significant ones (p-hacking)

‚ùå Ignoring multiple comparisons and using Œ± = 0.05 for 20 tests

‚ùå Fitting preprocessing on entire dataset then splitting

‚ùå Using single train/test split as final performance estimate

‚ùå Reporting only p-values without effect sizes or confidence intervals

‚ùå Claiming causation from correlational analyses

## Checklist: Statistical Validation Best Practices

Before finalizing your analysis, confirm:

### Data Management
- [ ] Data split into train/validation/test sets (70/15/15)
- [ ] Test set untouched during model development
- [ ] Preprocessing fit only on training data
- [ ] Temporal ordering respected (if time-series)
- [ ] Class balance maintained across sets (if classification)

### Cross-Validation
- [ ] K-fold CV used to estimate performance
- [ ] Standard error calculated from fold scores
- [ ] 95% confidence interval reported
- [ ] Hyperparameters tuned using CV, not test set

### Hypothesis Testing
- [ ] Hypotheses stated before seeing data
- [ ] Appropriate test selected for data type
- [ ] Exact p-value reported (not p < 0.05)
- [ ] Effect size calculated
- [ ] 95% confidence interval provided

### Multiple Comparisons
- [ ] Number of tests identified
- [ ] Appropriate correction applied (Bonferroni for few, FDR for many)
- [ ] Adjusted p-values reported

### Final Reporting
- [ ] Power analysis documented (if applicable)
- [ ] Sample size justified
- [ ] Random seeds set for reproducibility
- [ ] All decisions documented
- [ ] Results communicate uncertainty (not just point estimates)

## Self-Assessment

Before moving to Module 06, ensure you can:

- [ ] Explain why separate test sets are necessary
- [ ] Implement proper train/validation/test split
- [ ] Identify and prevent data leakage
- [ ] Use k-fold cross-validation correctly
- [ ] Calculate standard error from CV folds
- [ ] State null and alternative hypotheses formally
- [ ] Choose appropriate statistical test
- [ ] Interpret p-values correctly
- [ ] Calculate and interpret effect sizes
- [ ] Compute confidence intervals
- [ ] Conduct a priori power analysis
- [ ] Distinguish Type I and Type II errors
- [ ] Apply Bonferroni and FDR corrections
- [ ] Explain when multiple comparisons corrections are needed

If you can confidently check all boxes, you're ready for Module 06: Causal Inference Fundamentals! üéâ

## Additional Resources

### Books
- "Statistical Rethinking" by Richard McElreath (Bayesian approach)
- "The Book of Why" by Judea Pearl (causal inference)
- "An Introduction to Statistical Learning" by James et al. (practical ML)

### Online Courses
- MIT OpenCourseWare: "Statistical Method in Biology"
- Coursera: "Statistics with Python" specialization
- EdX: "Causal Inference" bootcamp

### Papers & Guidelines
- "The ASA Statement on p-Values" (American Statistical Association)
- "Controlling the False Discovery Rate" (Benjamini & Hochberg, 1995)
- NeurIPS Reproducibility Checklist

### Software & Tools
- `scipy.stats`: Statistical tests
- `statsmodels`: Power analysis and advanced statistics
- `scikit-learn`: Cross-validation and model selection