# Evolver Loop 3: CV-LB Gap Analysis & Submission Strategy

**Goal**: Understand why we have a 10.09% CV-LB gap and decide whether to submit exp_003.

**Key Questions**:
1. Is the gap due to overfitting, distribution shift, or both?
2. Did hyperparameter tuning (exp_003) actually help generalization?
3. What should we try next based on LB feedback?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
import json

# Load session state to track experiments
with open('/home/code/session_state.json', 'r') as f:
    session = json.load(f)

print("="*60)
print("EXPERIMENT SUMMARY")
print("="*60)
for i, exp in enumerate(session['experiments']):
    print(f"Exp {i}: {exp['name']}")
    print(f"  CV Score: {exp['score']:.4f}")
    print(f"  Model: {exp['model_type']}")
    print()

print("\n" + "="*60)
print("SUBMISSION HISTORY")
print("="*60)
for sub in session['submissions']:
    print(f"Experiment: {sub['experiment_id']}")
    print(f"  CV: {sub['cv_score']:.4f}")
    print(f"  LB: {sub['lb_score']:.4f}")
    print(f"  Gap: {sub['cv_score'] - sub['lb_score']:.4f}")
    print()

print(f"Remaining submissions: {session['remaining_submissions']}/10")

## 1. CV-LB Gap Analysis

The evaluator identified a 9.2% gap in exp_002. Let's see if exp_003 improved this.

In [None]:
# Calculate gaps
gaps = []
for sub in session['submissions']:
    gaps.append({
        'experiment': sub['experiment_id'],
        'cv_score': sub['cv_score'],
        'lb_score': sub['lb_score'],
        'gap': sub['cv_score'] - sub['lb_score']
    })

gaps_df = pd.DataFrame(gaps)
print(gaps_df)

# Project exp_003 gap (assuming similar gap as exp_002)
exp_003_cv = 84.73
exp_002_gap = 9.20  # from evaluator feedback
projected_lb = exp_003_cv - exp_002_gap

print(f"\n{'='*60}")
print("EXP_003 PROJECTION")
print(f"{'='*60}")
print(f"CV Score: {exp_003_cv:.2f}%")
print(f"Exp_002 Gap: {exp_002_gap:.2f}%")
print(f"Projected LB: {projected_lb:.2f}%")
print(f"Projected Improvement: +{projected_lb - 74.64:.2f}% vs exp_002")

## 2. Feature Importance Analysis - Title_Mr Dominance

The evaluator flagged Title_Mr at 38.9% importance as a potential overfitting signal. Let's investigate.

In [None]:
# Load exp_003 data to analyze feature patterns
train_df = pd.read_csv('/home/data/train.csv')
test_df = pd.read_csv('/home/data/test.csv')

# Recreate Title feature for analysis
def extract_title(name):
    title = name.split(',')[1].split('.')[0].strip()
    if title in ['Mr', 'Mrs', 'Miss', 'Master']:
        return title
    elif title in ['Dr']:
        return 'Dr'
    elif title in ['Col', 'Major', 'Capt']:
        return 'Military'
    elif title in ['Countess', 'Lady', 'Sir', 'Don', 'Dona', 'Jonkheer']:
        return 'Noble'
    elif title in ['Rev']:
        return 'Clergy'
    else:
        return 'Other'

train_df['Title'] = train_df['Name'].apply(extract_title)
test_df['Title'] = test_df['Name'].apply(extract_title)

print("TITLE DISTRIBUTION IN TRAIN vs TEST:")
print("="*60)
train_titles = train_df['Title'].value_counts(normalize=True) * 100
test_titles = test_df['Title'].value_counts(normalize=True) * 100

title_comp = pd.DataFrame({
    'Train_%': train_titles,
    'Test_%': test_titles
}).fillna(0)
print(title_comp.round(2))

# Calculate survival rates by title
survival_by_title = train_df.groupby('Title')['Survived'].agg(['count', 'mean', 'std'])
survival_by_title['survival_rate'] = survival_by_title['mean'] * 100
print(f"\n{'='*60}")
print("SURVIVAL RATES BY TITLE (TRAIN):")
print("="*60)
print(survival_by_title[['count', 'survival_rate']].round(2))

## 3. Distribution Shift Analysis

Check for distribution differences between train and test that could explain the gap.

In [None]:
# Check key feature distributions
check_features = ['Pclass', 'Sex', 'Embarked', 'HasCabin']

# Add HasCabin to test
train_df['HasCabin'] = train_df['Cabin'].notna().astype(int)
test_df['HasCabin'] = test_df['Cabin'].notna().astype(int)

print("DISTRIBUTION SHIFT ANALYSIS")
print("="*60)

for feature in check_features:
    print(f"\n{feature}:")
    if feature in ['Pclass', 'HasCabin']:
        train_dist = train_df[feature].value_counts(normalize=True).sort_index() * 100
        test_dist = test_df[feature].value_counts(normalize=True).sort_index() * 100
    else:
        train_dist = train_df[feature].value_counts(normalize=True) * 100
        test_dist = test_df[feature].value_counts(normalize=True) * 100
    
    comp = pd.DataFrame({
        'Train_%': train_dist,
        'Test_%': test_dist
    }).fillna(0)
    print(comp.round(2))

## 4. Decision: Should We Submit exp_003?

**Arguments FOR submitting:**
- Evaluator's top priority: "SUBMIT TO LEADERBOARD NOW"
- We need LB feedback to validate if hyperparameter tuning helps
- 8 submissions remaining is plenty
- CV improved from 83.84% to 84.73% (+0.89%)

**Arguments AGAINST submitting:**
- Title_Mr dominance (38.9%) suggests potential overfitting
- We haven't tried ensembles yet (evaluator's #3 concern)
- Could waste a submission if the model is overfit

**My recommendation:** SUBMIT. The evaluator is right - we're flying blind. We need LB feedback to guide the next $1000 of effort.

In [None]:
print("="*60)
print("SUBMISSION RECOMMENDATION")
print("="*60)
print("✓ SUBMIT exp_003 NOW")
print()
print("Reasoning:")
print("1. Evaluator's #1 priority: 'SUBMIT TO LEADERBOARD NOW'")
print("2. After 3 experiments, we have no LB feedback on hyperparameter tuning")
print("3. CV improved +0.89% with stronger regularization")
print("4. 8 submissions remaining - we can afford to test")
print("5. LB feedback will tell us if gap is closing or if we need to pivot")
print()
print("Expected outcomes:")
print(f"  If LB ≥ 75.5%: Gap closing, continue current direction")
print(f"  If LB = 74.6%: No improvement, pivot to ensembles")
print(f"  If LB < 74.0%: Overfitting worse, investigate Title_Mr dominance")