# Loop 2 LB Feedback Analysis

## Submission Results
- **exp_000 (XGBoost 13 features)**: CV 0.8316 → LB 0.7584 (gap: +7.3%)
- **exp_001 (Simple RF 7 features)**: CV 0.8238 → LB 0.7775 (gap: +4.6%)

## Key Insight
The simpler model hypothesis was **VALIDATED**:
- LB improved by 1.9% (0.7584 → 0.7775)
- CV-LB gap reduced from 7.3% to 4.6%

This confirms that the baseline was overfitting to training data.

In [None]:
import pandas as pd
import numpy as np

# Analyze the two submissions
results = {
    'Experiment': ['exp_000 (XGBoost 13 features)', 'exp_001 (Simple RF 7 features)'],
    'CV Score': [0.8316, 0.8238],
    'LB Score': [0.7584, 0.7775],
    'CV-LB Gap': [0.8316 - 0.7584, 0.8238 - 0.7775]
}
df = pd.DataFrame(results)
print(df.to_string(index=False))
print(f"\nLB Improvement: {0.7775 - 0.7584:.4f} (+{(0.7775 - 0.7584)/0.7584*100:.1f}%)")
print(f"Gap Reduction: {(0.8316 - 0.7584) - (0.8238 - 0.7775):.4f}")

In [None]:
# Load data to understand what's happening
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print("Training data shape:", train.shape)
print("Test data shape:", test.shape)
print(f"\nTraining survival rate: {train['Survived'].mean():.3f}")

In [None]:
# Load both submissions and compare
try:
    sub_baseline = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
    sub_simple = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')
    
    print("Baseline (XGBoost) survival rate:", sub_baseline['Survived'].mean())
    print("Simple RF survival rate:", sub_simple['Survived'].mean())
    
    # Compare predictions
    agreement = (sub_baseline['Survived'] == sub_simple['Survived']).mean()
    print(f"\nPrediction agreement: {agreement:.1%}")
    
    # Where do they differ?
    diff_mask = sub_baseline['Survived'] != sub_simple['Survived']
    print(f"Different predictions: {diff_mask.sum()} passengers")
except Exception as e:
    print(f"Error loading submissions: {e}")

In [None]:
# Analyze the passengers where predictions differ
if diff_mask.sum() > 0:
    diff_ids = sub_baseline[diff_mask]['PassengerId'].values
    diff_test = test[test['PassengerId'].isin(diff_ids)].copy()
    
    print(f"Passengers with different predictions: {len(diff_test)}")
    print("\nSex distribution:")
    print(diff_test['Sex'].value_counts())
    print("\nPclass distribution:")
    print(diff_test['Pclass'].value_counts())
    
    # What did each model predict?
    baseline_preds = sub_baseline[diff_mask]['Survived'].values
    simple_preds = sub_simple[diff_mask]['Survived'].values
    
    print(f"\nBaseline predicted survivors: {baseline_preds.sum()}")
    print(f"Simple RF predicted survivors: {simple_preds.sum()}")

In [None]:
# Key insight: Simple RF predicts fewer survivors but scores BETTER on LB
# This suggests the baseline was over-predicting survival

print("="*60)
print("KEY INSIGHT")
print("="*60)
print(f"""\nBaseline (XGBoost 13 features):
- CV: 0.8316 (higher)
- LB: 0.7584 (lower)
- Survival rate: {sub_baseline['Survived'].mean():.3f}

Simple RF (7 features):
- CV: 0.8238 (lower)
- LB: 0.7775 (higher!)
- Survival rate: {sub_simple['Survived'].mean():.3f}

Conclusion:
1. The simpler model generalizes BETTER despite lower CV
2. The baseline was overfitting to training patterns
3. Reducing features from 13 to 7 improved LB by 2%
4. The CV-LB gap reduced from 7.3% to 4.6%
""")

In [None]:
# What's the next step?
print("="*60)
print("NEXT STEPS ANALYSIS")
print("="*60)
print("""
1. The simpler model approach is VALIDATED
   - Continue with simple feature sets
   - Consider adding back Title (captures sex + social status)

2. Ensemble methods should help further
   - Voting ensemble with diverse models
   - Stacking with out-of-fold predictions
   - Use the same simple feature set

3. Target: 1.0 is IMPOSSIBLE
   - State-of-the-art is 81-85% accuracy
   - Current best LB: 0.7775 (77.75%)
   - Realistic target: 0.80-0.81 (80-81%)
   - Need ~2-3% improvement

4. Remaining submissions: 6
   - Use wisely to validate ensemble approaches
""")

In [None]:
# Calculate what we need to beat
target = 1.0
current_best_lb = 0.7775
gap_to_target = target - current_best_lb

print(f"Target: {target}")
print(f"Current best LB: {current_best_lb}")
print(f"Gap to target: {gap_to_target:.4f} ({gap_to_target*100:.1f}%)")
print(f"\nNOTE: Target of 1.0 (100% accuracy) is IMPOSSIBLE for Titanic.")
print(f"State-of-the-art is ~0.85 (85%). Best public solutions achieve ~0.81-0.85.")
print(f"\nRealistic goal: Achieve 0.80+ (80%+ accuracy)")
print(f"Gap to realistic goal: {0.80 - current_best_lb:.4f} ({(0.80 - current_best_lb)*100:.1f}%)")

## Summary

### What We Learned
1. **Simpler models generalize better** - The 7-feature RF outperformed 13-feature XGBoost on LB
2. **CV-LB gap reduced** - From 7.3% to 4.6%, indicating less overfitting
3. **Feature engineering can hurt** - Too many features led to overfitting

### Strategy Going Forward
1. **Keep features simple** - 7-10 core features maximum
2. **Add Title back** - It captures sex + social status, was 0.08 importance
3. **Try ensemble methods** - Voting/stacking with simple feature set
4. **Focus on LB, not CV** - CV is not reliable for this problem

### Target Reality Check
- Target of 1.0 is impossible (100% accuracy)
- State-of-the-art is 81-85%
- Current best: 77.75%
- Realistic improvement: 2-3% to reach ~80%