# Investigation 1: Find the 12.1 Point Gap

**Mystery**: Why does the ablation study show 65.2% accuracy for Full Week 10+ on 2024 data, but your actual Week 10-14 production showed 53.1% on 2025 data?

**Gap**: 12.1 percentage points

**Hypotheses**:
1. Different feature engineering implementations
2. Different data (2024 vs 2025)
3. Different RFE feature selection
4. Implementation bugs in Week 10 production

**Approach**:
1. Extract feature engineering code from Week10/Model.ipynb
2. Compare side-by-side with ablation study
3. Run Week 10 exact code on 2024 data
4. Compare results to ablation study's 65.2%

## Setup

In [None]:
import pandas as pd
import numpy as np
import json
from pathlib import Path

print("‚úÖ Libraries loaded")

## Step 1: Load and Compare Week 10 Model Code

In [None]:
# Load Week 10 Model.ipynb
week10_path = Path('/Users/akulaggarwal/Desktop/NFL Performance Prediction/Week10/Model.ipynb')

if week10_path.exists():
    with open(week10_path, 'r') as f:
        week10_nb = json.load(f)
    
    print(f"‚úÖ Loaded Week 10 Model.ipynb")
    print(f"   Total cells: {len(week10_nb['cells'])}")
else:
    print("‚ùå Week 10 Model.ipynb not found")
    week10_nb = None

## Step 2: Extract Feature Engineering Code from Week 10

In [None]:
if week10_nb:
    # Find cells containing feature engineering
    feature_cells = []
    
    for i, cell in enumerate(week10_nb['cells']):
        if cell['cell_type'] == 'code':
            source = ''.join(cell['source'])
            
            # Look for key feature engineering patterns
            keywords = [
                'def create_team_features',
                'def create_game_features',
                'momentum_last3',
                'injury_pct',
                'defensive_ypg',
                'vegas_spread',
                'temporal',
                'def create_ensemble_model'
            ]
            
            for keyword in keywords:
                if keyword in source:
                    feature_cells.append((i, keyword, source[:500]))
                    break
    
    print(f"Found {len(feature_cells)} cells with feature engineering code:")
    for cell_idx, keyword, preview in feature_cells:
        print(f"\nCell {cell_idx}: Contains '{keyword}'")
        print(f"Preview: {preview}...")
else:
    print("Cannot proceed without Week 10 notebook")

## Step 3: Key Comparison Points

Let's extract specific implementations to compare:

In [None]:
if week10_nb:
    print("="*80)
    print("COMPARISON: Ablation Study vs Week 10 Production")
    print("="*80)
    
    # Extract specific code snippets for comparison
    for i, cell in enumerate(week10_nb['cells']):
        if cell['cell_type'] == 'code':
            source = ''.join(cell['source'])
            
            # Check momentum implementation
            if 'momentum_last3' in source and 'def ' in source:
                print("\n" + "="*80)
                print("MOMENTUM IMPLEMENTATION (Week 10 Production):")
                print("="*80)
                print(source[:1000])
                print("\n" + "="*80)
                print("MOMENTUM IMPLEMENTATION (Ablation Study):")
                print("="*80)
                print("""
def add_momentum_features(features, weekly_data, team, season, week):
    team_stats = weekly_data[
        (weekly_data['recent_team'] == team) &
        (weekly_data['season'] == season) &
        (weekly_data['week'] < week)
    ]
    
    if len(team_stats) >= 3:
        last_3_weeks = sorted(team_stats['week'].unique())[-3:]
        last_3_stats = team_stats[team_stats['week'].isin(last_3_weeks)]
        # Approximate win rate from points scored vs allowed
        features['momentum_last3'] = last_3_stats.groupby('week')['fantasy_points'].sum().mean() / 30.0
    else:
        features['momentum_last3'] = 0.5
    return features
                """)
                print("\nüîç KEY DIFFERENCE CHECK: Compare the formulas above")
            
            # Check defensive stats implementation
            if 'defensive_ypg' in source and 'def ' in source:
                print("\n" + "="*80)
                print("DEFENSIVE STATS IMPLEMENTATION (Week 10 Production):")
                print("="*80)
                print(source[:1000])
                print("\n" + "="*80)
                print("DEFENSIVE STATS IMPLEMENTATION (Ablation Study):")
                print("="*80)
                print("""
def add_defensive_features(features, pbp_data, team, season, week):
    defensive_plays = pbp_data[
        (pbp_data['defteam'] == team) &
        (pbp_data['season'] == season) &
        (pbp_data['week'] < week)
    ]
    
    if len(defensive_plays) > 0:
        weeks = defensive_plays['week'].nunique()
        features['defensive_ypg'] = defensive_plays['yards_gained'].sum() / weeks if weeks > 0 else 0
        # Estimate points allowed from EPA
        features['defensive_ppg'] = defensive_plays['epa'].sum() * 6 / weeks if weeks > 0 else 0  # <-- BUG: Why √ó 6?
    else:
        features['defensive_ypg'] = 0
        features['defensive_ppg'] = 0
    return features
                """)
                print("\nüîç KEY DIFFERENCE CHECK: Look for '√ó 6' or 'multiply by 6' bug")
            
            # Check temporal weighting
            if 'temporal' in source.lower() and 'weight' in source.lower():
                print("\n" + "="*80)
                print("TEMPORAL WEIGHTING (Week 10 Production):")
                print("="*80)
                print(source[:800])
else:
    print("Cannot proceed without Week 10 notebook")

## Step 4: Check Model Architecture Differences

In [None]:
if week10_nb:
    print("="*80)
    print("MODEL ARCHITECTURE COMPARISON")
    print("="*80)
    
    for i, cell in enumerate(week10_nb['cells']):
        if cell['cell_type'] == 'code':
            source = ''.join(cell['source'])
            
            # Check ensemble configuration
            if 'RandomForestClassifier' in source and 'max_depth' in source:
                print("\nWeek 10 Ensemble Configuration:")
                print(source[:600])
                
            if 'GradientBoostingClassifier' in source:
                print("\n‚úÖ Found 4th model (Gradient Boosting) in Week 10")
                
    print("\n" + "="*80)
    print("Ablation Study Configuration:")
    print("="*80)
    print("""
Full Week 10+ Config:
- RandomForestClassifier(n_estimators=200, max_depth=15)
- LogisticRegression(C=1.0, max_iter=1000)
- XGBClassifier(max_depth=8, learning_rate=0.1)
- GradientBoostingClassifier(max_depth=8, learning_rate=0.1)  # 4th model
- CalibratedClassifierCV(method='isotonic', cv=3)
- Temporal weighting: exp(-0.15 √ó years_ago)
    """)

## Step 5: Data Comparison (2024 vs 2025)

In [None]:
print("="*80)
print("DATA COMPARISON")
print("="*80)

print("""
Ablation Study:
- Training: 2015-2023 (9 years)
- Testing: 2024 Weeks 1-18 (256 games)
- Walk-forward validation: Each week trained on all previous data
- Result: 65.2% accuracy

Week 10-14 Production:
- Training: 2015-2024 (10 years) 
- Testing: 2025 Weeks 10-14 (~65 games)
- Data availability: 2025 may have incomplete/different data
- Result: 53.1% accuracy

KEY DIFFERENCES:
1. Test year: 2024 (complete) vs 2025 (partial, early season)
2. Training data: Excludes 2024 vs Includes 2024
3. Sample size: 256 games vs ~65 games
4. Data quality: 2024 finalized vs 2025 may have updates/corrections
""")

print("\nüîç ACTION ITEM: Run Week 10 exact code on 2024 data to isolate data vs implementation issues")

## Step 6: Hypothesis Testing Plan

Based on what we find above, create specific tests:

In [None]:
print("="*80)
print("HYPOTHESIS TESTING PLAN")
print("="*80)

hypotheses = [
    {
        'hypothesis': 'Different momentum calculation',
        'test': 'Compare momentum formula: Production uses wins, Study uses fantasy points proxy',
        'expected_impact': '¬±5 percentage points',
        'how_to_test': 'Run both formulas on same 2024 data, measure delta'
    },
    {
        'hypothesis': 'Different defensive stats (√ó 6 bug)',
        'test': 'Check if Week 10 has √ó 6 bug',
        'expected_impact': '¬±4 percentage points',
        'how_to_test': 'Search Week 10 code for defensive_ppg formula'
    },
    {
        'hypothesis': '2024 vs 2025 data differences',
        'test': 'Run Week 10 code on 2024 instead of 2025',
        'expected_impact': '¬±8 percentage points',
        'how_to_test': 'Modify Week 10 to test on 2024, compare to 65.2%'
    },
    {
        'hypothesis': 'RFE selected different features',
        'test': 'Compare RFE output from Week 10 vs ablation study',
        'expected_impact': '¬±3 percentage points',
        'how_to_test': 'Print selected features from both, check overlap'
    },
    {
        'hypothesis': 'Temporal weighting not actually applied in Week 10',
        'test': 'Verify sample_weight parameter actually used',
        'expected_impact': '¬±0.4 percentage points',
        'how_to_test': 'Check if .fit() call includes sample_weight'
    }
]

for i, h in enumerate(hypotheses, 1):
    print(f"\nHypothesis {i}: {h['hypothesis']}")
    print(f"  Test: {h['test']}")
    print(f"  Expected impact: {h['expected_impact']}")
    print(f"  How to test: {h['how_to_test']}")

## Step 7: Summary of Findings

After running cells above, document key differences found:

In [None]:
print("="*80)
print("INVESTIGATION 1 SUMMARY")
print("="*80)

print("""
Execute cells above to find:

1. MOMENTUM IMPLEMENTATION DIFFERENCES
   [ ] Production vs Study formulas match exactly
   [ ] Different calculation method found
   [ ] Missing/extra steps identified

2. DEFENSIVE STATS BUG
   [ ] Week 10 has √ó 6 bug
   [ ] Week 10 uses correct formula
   [ ] Different approach entirely

3. MODEL ARCHITECTURE
   [ ] Depths match (RF=15, XGB=8)
   [ ] 4th model (GB) present
   [ ] Temporal weighting applied
   [ ] Calibration method matches

4. DATA DIFFERENCES
   [ ] 2024 vs 2025 likely explains gap
   [ ] Implementation differences likely explain gap
   [ ] Combination of both

NEXT STEPS:
Based on findings above, proceed to:
- Investigation 2: Fix identified bugs
- Investigation 3: Validate on 2023 data
- Investigation 4: Test feature pairs
""")

## Action Items

After completing this notebook:

1. **Document differences found** in `investigation_1_findings.md`
2. **Create fix list** for Investigation 2
3. **Run Week 10 on 2024 data** to verify if implementation matches study
4. **Calculate expected accuracy delta** for each difference found