# Evolver Loop 7 Analysis: Residual Pipeline Failure & Pivot Strategy

## Summary of Current State

**Best CV Score**: 0.020470 (exp_000, baseline XGBoost)
**Target Score**: 0.058410
**Submissions Made**: 0/5
**Experiments Completed**: 7

## The Residual Pipeline Catastrophe

The three-stage residual pipeline (Linear → NN → XGBoost) has **FAILED**:
- Linear only: 0.208762
- Linear + NN: 0.201961 (improvement: -0.006801)
- Linear + NN + XGBoost: 0.208070 (degradation: +0.006109 from NN stage)
- **Pipeline vs Baseline**: +0.187600 WORSE

### Why It Failed

1. **Insufficient Linear Foundation**: Used default Ridge(alpha=1.0) without tuning
2. **Feature Poverty**: Only 8 basic features throughout pipeline
3. **Residuals Contain Only Noise**: After Linear+NN, no systematic signal remains
4. **Wrong Approach Context**: Winners used residual modeling as 10-20% of solution, not 100%

## What Winners Actually Did

From winning solution analysis:
- **7-12 diverse models** (XGBoost, CatBoost, LGBM, NN, Linear, KNN, etc.)
- **200-400 engineered features** (not 8!)
- **Target encoding** with proper cross-validation
- **Hill climbing ensemble** (not sequential pipelines)
- **Residual modeling as ONE component** among many

## The Path Forward

### Immediate Actions

1. **SUBMIT baseline XGBoost** to get LB feedback (calibrate CV-LB gap)
2. **ABANDON sequential pipeline** - return to direct modeling
3. **Implement comprehensive feature engineering** (50-100 features)
4. **Build diverse base models** for ensemble

### Key Techniques to Implement

1. **Target Encoding** (Critical!)
   - Sex encoding with smoothing
   - Cross-validation to prevent leakage
   - This was a KEY winner technique

2. **Advanced Feature Engineering**
   - Log1p transforms
   - Product features (all pairs)
   - Ratio features
   - Binned features
   - Groupby z-scores

3. **Model Diversity**
   - XGBoost with different feature sets
   - CatBoost with binned features
   - LightGBM
   - Neural Network
   - Linear with polynomial features

4. **Ensemble Methods**
   - Hill climbing (primary)
   - Ridge regression (secondary)

## Analysis: Why Our CV is Too Good

Our baseline XGBoost achieves 0.020470, but winners needed 0.058-0.059. This suggests:
1. Synthetic data is too simple/easy
2. OR we're overfitting with product features
3. OR missing key complexity of real data

The residual pipeline failure suggests #2 or #3 - we need more sophisticated modeling, not just stacking weak models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_log_error
import json

# Load session state
with open('/home/code/session_state.json', 'r') as f:
    session_state = json.load(f)

experiments = session_state['experiments']

# Create summary DataFrame
exp_summary = pd.DataFrame([
    {
        'exp_id': exp['id'],
        'model_type': exp['model_type'],
        'cv_score': exp['score'],
        'notes': exp['notes'][:200] + '...' if len(exp['notes']) > 200 else exp['notes']
    }
    for exp in experiments
])

print("Experiment Summary:")
print(exp_summary[['exp_id', 'model_type', 'cv_score']].to_string(index=False))

print(f"\n{'='*60}")
print("KEY INSIGHTS:")
print(f"{'='*60}")
print(f"1. Best CV: {exp_summary['cv_score'].min():.6f} (exp_000)")
print(f"2. Worst CV: {exp_summary['cv_score'].max():.6f}")
print(f"3. Residual pipeline degraded: {exp_summary[exp_summary['exp_id']=='exp_006']['cv_score'].iloc[0]:.6f} → {exp_summary[exp_summary['exp_id']=='exp_007']['cv_score'].iloc[0]:.6f}")
print(f"4. Target: 0.058410 (we're TOO GOOD - need to understand why)")
print(f"5. Submissions: 0/5 made (need LB feedback)")

## Feature Engineering Analysis

### Current Feature Counts:
- exp_000 (baseline): ~15 features (products, ratios, logs)
- exp_001-004: 18-23 features (added target encoding, binned)
- exp_005-007: 8 features (minimal for residual pipeline)

### What Winners Used:
- 200-400 engineered features
- All pairwise products
- Groupby statistics
- Target encoding
- Polynomial features

### Gap Analysis:
We're using 8-23 features while winners used 200-400. This is a **10-20x difference**!

## Next Steps Priority

1. **Submit exp_000** to get LB feedback
2. **Implement proper target encoding** (cross-validated)
3. **Create 50-100 features** for diverse models
4. **Train 5-7 diverse models** with different:
   - Algorithms (XGB, CatBoost, LGBM, NN, Linear)
   - Feature sets
   - Hyperparameters
5. **Implement hill climbing ensemble**
6. **Iterate based on CV-LB gap analysis**

In [None]:
# Analyze residual pipeline failure in detail
print("RESIDUAL PIPELINE FAILURE ANALYSIS")
print("="*60)

# Load residuals data
residuals_lr = pd.read_csv('/home/code/experiments/005_linear_regression/residuals_lr.csv')
residuals_nn = pd.read_csv('/home/code/experiments/006_neural_network_residuals/residuals_after_nn.csv')

print(f"Original target std: {residuals_lr['actual'].std():.2f}")
print(f"After Linear: residual std: {residuals_lr['residual'].std():.2f}")
print(f"Variance explained by Linear: {(1 - residuals_lr['residual'].std()/residuals_lr['actual'].std())*100:.1f}%")

print(f"\nAfter Linear+NN: residual std: {residuals_nn['residual'].std():.2f}")
print(f"Variance explained by Linear+NN: {(1 - residuals_nn['residual'].std()/residuals_lr['actual'].std())*100:.1f}%")

print(f"\nResidual after Linear+NN: mean={residuals_nn['residual'].mean():.3f}, std={residuals_nn['residual'].std():.2f}")
print(f"This residual is mostly noise - XGBoost overfits to it!")

# Check if residuals are normally distributed (should be for good modeling)
from scipy import stats
shapiro_stat, shapiro_p = stats.shapiro(residuals_nn['residual'].sample(5000) if len(residuals_nn) > 5000 else residuals_nn['residual'])
print(f"\nShapiro-Wilk test for normality: p={shapiro_p:.6f}")
print(f"Residuals {'are' if shapiro_p > 0.05 else 'are NOT'} normally distributed")