# Loop 6 LB Feedback Analysis

## Submission Results
- exp_005 (Feature Engineering, 13 features): CV 0.8395 → LB 0.7775
- CV-LB gap: +6.20%

## Key Question: Why did CV improve but LB stay flat?
- exp_003: CV 0.8373 → LB 0.7847 (gap: 5.26%)
- exp_005: CV 0.8395 → LB 0.7775 (gap: 6.20%)

The new features improved CV but HURT LB by 0.72% (3 passengers)!

In [None]:
import pandas as pd
import numpy as np

# Load submissions
exp_003 = pd.read_csv('/home/code/submission_candidates/candidate_003.csv')
exp_005 = pd.read_csv('/home/code/submission_candidates/candidate_005.csv')

print("Submission comparison:")
print(f"exp_003 (Best LB 0.7847): {exp_003['Survived'].sum()} survivors ({exp_003['Survived'].mean()*100:.1f}%)")
print(f"exp_005 (LB 0.7775):      {exp_005['Survived'].sum()} survivors ({exp_005['Survived'].mean()*100:.1f}%)")

# Find differences
diff_mask = exp_003['Survived'] != exp_005['Survived']
print(f"\nDifferent predictions: {diff_mask.sum()}")

diff_ids = exp_003.loc[diff_mask, 'PassengerId'].values
print(f"Differing PassengerIds: {diff_ids}")

# Categorize differences
exp003_1_exp005_0 = exp_003.loc[diff_mask & (exp_003['Survived'] == 1), 'PassengerId'].values
exp003_0_exp005_1 = exp_003.loc[diff_mask & (exp_003['Survived'] == 0), 'PassengerId'].values

print(f"\nexp_003=1, exp_005=0 (exp_005 predicts death): {len(exp003_1_exp005_0)} passengers")
print(f"exp_003=0, exp_005=1 (exp_005 predicts survival): {len(exp003_0_exp005_1)} passengers")

In [None]:
# Load test data to analyze differing passengers
test = pd.read_csv('/home/data/test.csv')

# Extract title
test['Title'] = test['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
title_mapping = {
    'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
    'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs',
    'Lady': 'Rare', 'Countess': 'Rare', 'Capt': 'Rare', 'Col': 'Rare',
    'Don': 'Rare', 'Dr': 'Rare', 'Major': 'Rare', 'Rev': 'Rare',
    'Sir': 'Rare', 'Jonkheer': 'Rare', 'Dona': 'Rare'
}
test['Title'] = test['Title'].map(title_mapping).fillna('Rare')
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
test['IsAlone'] = (test['FamilySize'] == 1).astype(int)
test['Has_Cabin'] = test['Cabin'].notna().astype(int)

print("Passengers where exp_003=1, exp_005=0 (exp_005 wrongly predicted death):")
print("="*70)
if len(exp003_1_exp005_0) > 0:
    wrong_death = test[test['PassengerId'].isin(exp003_1_exp005_0)][['PassengerId', 'Pclass', 'Sex', 'Age', 'Title', 'FamilySize', 'IsAlone', 'Fare']]
    print(wrong_death.to_string())
else:
    print("None")

In [None]:
print("\nPassengers where exp_003=0, exp_005=1 (exp_005 wrongly predicted survival):")
print("="*70)
if len(exp003_0_exp005_1) > 0:
    wrong_survival = test[test['PassengerId'].isin(exp003_0_exp005_1)][['PassengerId', 'Pclass', 'Sex', 'Age', 'Title', 'FamilySize', 'IsAlone', 'Fare']]
    print(wrong_survival.to_string())
else:
    print("None")

In [None]:
# Calculate net effect
print("\n" + "="*70)
print("ANALYSIS: Why did exp_005 perform worse on LB?")
print("="*70)

print(f"\nexp_003 LB: 0.7847 (328 correct out of 418)")
print(f"exp_005 LB: 0.7775 (325 correct out of 418)")
print(f"Difference: -3 passengers")

print(f"\nOf the {diff_mask.sum()} differing predictions:")
print(f"  - exp_003 was right on ~{len(exp003_1_exp005_0) + 3} (exp_005 wrongly changed to death)")
print(f"  - exp_005 was right on ~{len(exp003_0_exp005_1) - 3} (exp_005 correctly changed to survival)")

print("\nKey insight: The new features (FamilySize, IsAlone, Has_Cabin, TicketFreq, FareBin)")
print("are causing the model to make WORSE predictions on the test set.")
print("\nThis suggests the new features are overfitting to training patterns that")
print("don't generalize to the test set.")

In [None]:
# Compare all submissions
print("\n" + "="*70)
print("SUBMISSION HISTORY ANALYSIS")
print("="*70)

submissions = [
    ('exp_000', 'XGBoost (13 features)', 0.8316, 0.7584, 157),
    ('exp_001', 'Simple RF (7 features)', 0.8238, 0.7775, 131),
    ('exp_003', 'Threshold-Tuned (8 features)', 0.8373, 0.7847, 130),
    ('exp_004', 'Stacking (5 base + XGB meta)', 0.8373, 0.7631, 131),
    ('exp_005', 'Feature Eng (13 features)', 0.8395, 0.7775, 131),
]

print(f"\n{'Exp':<8} {'Model':<30} {'CV':<8} {'LB':<8} {'Gap':<8} {'Survivors':<10}")
print("-"*75)
for exp_id, model, cv, lb, surv in submissions:
    gap = cv - lb
    print(f"{exp_id:<8} {model:<30} {cv:<8.4f} {lb:<8.4f} {gap:<8.4f} {surv:<10}")

print("\nKey patterns:")
print("1. Best LB (0.7847) achieved with 8 features + threshold tuning")
print("2. Adding more features (13) HURT LB despite improving CV")
print("3. Stacking also HURT LB despite same CV")
print("4. Simpler models with ~31% survival rate work best")

In [None]:
# What's the optimal strategy going forward?
print("\n" + "="*70)
print("STRATEGIC RECOMMENDATIONS")
print("="*70)

print("\n1. REVERT to exp_003 approach (8 features + threshold tuning)")
print("   - Best LB: 0.7847")
print("   - Features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Title")
print("   - Threshold: ~0.608 for 31% survival rate")

print("\n2. Try REMOVING features that hurt generalization:")
print("   - TicketFreq has data leakage (uses test data)")
print("   - FareBin may be redundant with Fare")
print("   - Has_Cabin may be noisy (77% missing)")

print("\n3. Consider DIFFERENT feature engineering:")
print("   - Cabin deck extraction (A, B, C, etc.)")
print("   - Ticket prefix extraction")
print("   - Age binning (child, adult, elderly)")
print("   - Family survival rate (if family members survived)")

print("\n4. With only 3 submissions left:")
print("   - Focus on variations of exp_003 (best LB)")
print("   - Avoid complex approaches (stacking, many features)")
print("   - Test one promising variation before final submission")

In [None]:
# Final summary
print("\n" + "="*70)
print("LOOP 6 SUMMARY")
print("="*70)

print("\nResult: Feature engineering FAILED to improve LB")
print(f"  - CV improved: 0.8373 → 0.8395 (+0.22%)")
print(f"  - LB dropped: 0.7847 → 0.7775 (-0.72%)")
print(f"  - CV-LB gap increased: 5.26% → 6.20%")

print("\nLesson learned:")
print("  - More features ≠ better generalization")
print("  - CV is NOT reliable for this problem")
print("  - Simpler models with fewer features work better")

print("\nBest approach so far:")
print("  - exp_003: Threshold-Tuned Ensemble (8 features)")
print("  - LB: 0.7847 (best)")
print("  - 130 survivors (31.1%)")

print("\nNext steps:")
print("  1. Analyze what makes exp_003 work")
print("  2. Try minimal variations (remove 1-2 features)")
print("  3. Consider different threshold values")
print("  4. Save submissions for promising variations")