# Loop 5 LB Feedback Analysis

## Critical Finding: Stacking FAILED on LB

| Exp | Model | CV | LB | Gap | Survival Rate |
|-----|-------|-----|-----|-----|---------------|
| exp_003 | Threshold-Tuned Ensemble | 0.8373 | 0.7847 | +5.26% | 31.1% (130) |
| exp_004 | Stacking (5 base + XGB meta) | 0.8373 | 0.7631 | +7.42% | 31.3% (131) |

**Stacking performed WORSE by 2.16% (9 passengers) despite identical CV!**

This is a critical insight - stacking added complexity without benefit.

In [None]:
import pandas as pd
import numpy as np

# Load all submissions for comparison
submissions = {
    'exp_000': {'path': '/home/code/submission_candidates/candidate_000.csv', 'lb': 0.7584, 'cv': 0.8316, 'name': 'XGBoost (13 features)'},
    'exp_001': {'path': '/home/code/submission_candidates/candidate_001.csv', 'lb': 0.7775, 'cv': 0.8238, 'name': 'Simple RF (7 features)'},
    'exp_003': {'path': '/home/code/submission_candidates/candidate_003.csv', 'lb': 0.7847, 'cv': 0.8373, 'name': 'Threshold-Tuned Ensemble'},
    'exp_004': {'path': '/home/code/submission_candidates/candidate_004.csv', 'lb': 0.7631, 'cv': 0.8373, 'name': 'Stacking (5 base + XGB meta)'}
}

for exp_id, info in submissions.items():
    df = pd.read_csv(info['path'])
    survivors = df['Survived'].sum()
    rate = df['Survived'].mean() * 100
    gap = info['cv'] - info['lb']
    print(f"{exp_id}: {info['name']}")
    print(f"  CV: {info['cv']:.4f}, LB: {info['lb']:.4f}, Gap: +{gap:.4f}")
    print(f"  Survivors: {survivors} ({rate:.1f}%)")
    print()

In [None]:
# Compare exp_003 (best LB) vs exp_004 (stacking)
best = pd.read_csv('/home/code/submission_candidates/candidate_003.csv')
stacking = pd.read_csv('/home/code/submission_candidates/candidate_004.csv')

# Merge and find differences
comp = best.merge(stacking, on='PassengerId', suffixes=('_best', '_stack'))
diff = comp[comp['Survived_best'] != comp['Survived_stack']]

print(f"Predictions that differ: {len(diff)} out of 418 ({len(diff)/418*100:.1f}%)")
print(f"\nBest LB (exp_003) survivors: {best['Survived'].sum()}")
print(f"Stacking (exp_004) survivors: {stacking['Survived'].sum()}")

# Breakdown of differences
best_only = diff[diff['Survived_best'] == 1]  # Best predicted 1, Stacking predicted 0
stack_only = diff[diff['Survived_stack'] == 1]  # Stacking predicted 1, Best predicted 0

print(f"\nDifferences breakdown:")
print(f"  Best=1, Stack=0: {len(best_only)} passengers (Best was right if they survived)")
print(f"  Best=0, Stack=1: {len(stack_only)} passengers (Stack was right if they survived)")
print(f"\nPassenger IDs where they differ:")
print(f"  Best=1, Stack=0: {list(best_only['PassengerId'].values)}")
print(f"  Best=0, Stack=1: {list(stack_only['PassengerId'].values)}")

In [None]:
# Load test data to analyze differing passengers
test = pd.read_csv('/home/data/test.csv')

# Get characteristics of differing passengers
diff_ids = diff['PassengerId'].values
diff_passengers = test[test['PassengerId'].isin(diff_ids)].copy()

print("Characteristics of passengers where predictions differ:")
print(f"\nTotal: {len(diff_passengers)} passengers")
print(f"\nSex distribution:")
print(diff_passengers['Sex'].value_counts())
print(f"\nPclass distribution:")
print(diff_passengers['Pclass'].value_counts())
print(f"\nAge statistics:")
print(diff_passengers['Age'].describe())

In [None]:
# Analyze which predictions were likely correct based on LB difference
# Best LB: 0.7847 (328 correct), Stacking LB: 0.7631 (319 correct)
# Difference: 9 more correct predictions for Best

best_correct = int(0.7847 * 418)  # ~328
stack_correct = int(0.7631 * 418)  # ~319

print(f"Estimated correct predictions:")
print(f"  Best (exp_003): {best_correct} correct")
print(f"  Stacking (exp_004): {stack_correct} correct")
print(f"  Difference: {best_correct - stack_correct} more correct for Best")

# Since 15 predictions differ and Best got 9 more correct:
# Best was right on ~12 of the 15, Stacking was right on ~3 of the 15
print(f"\nOf the {len(diff)} differing predictions:")
print(f"  Best was likely right on ~12 of them")
print(f"  Stacking was likely right on ~3 of them")

print(f"\n=== KEY INSIGHT ===")
print(f"Stacking's 15 different predictions were mostly WRONG")
print(f"The simpler Threshold-Tuned Ensemble was more accurate")

In [None]:
# Analyze the specific passengers where Best=1, Stack=0 (Best was likely right)
best_only_ids = best_only['PassengerId'].values
best_only_passengers = test[test['PassengerId'].isin(best_only_ids)]

print("Passengers where Best=1, Stack=0 (Best was likely right):")
print(best_only_passengers[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].to_string())

print(f"\nSummary:")
print(f"  Sex: {best_only_passengers['Sex'].value_counts().to_dict()}")
print(f"  Pclass: {best_only_passengers['Pclass'].value_counts().to_dict()}")
print(f"  Mean Age: {best_only_passengers['Age'].mean():.1f}")
print(f"  Mean Fare: {best_only_passengers['Fare'].mean():.1f}")

In [None]:
# Analyze the specific passengers where Best=0, Stack=1 (Stacking was likely wrong)
stack_only_ids = stack_only['PassengerId'].values
stack_only_passengers = test[test['PassengerId'].isin(stack_only_ids)]

print("Passengers where Best=0, Stack=1 (Stacking was likely wrong):")
print(stack_only_passengers[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']].to_string())

print(f"\nSummary:")
print(f"  Sex: {stack_only_passengers['Sex'].value_counts().to_dict()}")
print(f"  Pclass: {stack_only_passengers['Pclass'].value_counts().to_dict()}")
print(f"  Mean Age: {stack_only_passengers['Age'].mean():.1f}")
print(f"  Mean Fare: {stack_only_passengers['Fare'].mean():.1f}")

In [None]:
# Key learning: What patterns does stacking get wrong?
print("=" * 70)
print("KEY LEARNINGS FROM STACKING FAILURE")
print("=" * 70)

print(f"""
1. STACKING ADDED COMPLEXITY WITHOUT BENEFIT
   - Same CV (0.8373) but worse LB (0.7631 vs 0.7847)
   - CV-LB gap increased from 5.26% to 7.42%
   - This suggests stacking is overfitting to training data

2. SIMPLER IS BETTER FOR THIS DATASET
   - Best LB models: Simple RF (0.7775), Threshold-Tuned Ensemble (0.7847)
   - Worst LB models: XGBoost (0.7584), Stacking (0.7631)
   - Pattern: More complex models overfit

3. SURVIVAL RATE IS NOT THE ONLY FACTOR
   - Both exp_003 and exp_004 have ~31% survival rate
   - But exp_003 (Threshold-Tuned) is 2.16% better on LB
   - The SPECIFIC predictions matter, not just the rate

4. NEXT STEPS
   - Abandon stacking approach
   - Focus on improving the Threshold-Tuned Ensemble
   - Consider feature engineering to capture patterns stacking missed
   - Try blending Simple RF + Threshold-Tuned predictions
""")

In [None]:
# Analyze what makes exp_003 (Threshold-Tuned) better
# Compare all 4 submissions to find consensus patterns

all_preds = pd.DataFrame({'PassengerId': test['PassengerId']})
for exp_id, info in submissions.items():
    df = pd.read_csv(info['path'])
    all_preds[exp_id] = df['Survived']

# Find consensus predictions (all 4 agree)
all_preds['sum'] = all_preds[['exp_000', 'exp_001', 'exp_003', 'exp_004']].sum(axis=1)
all_preds['consensus'] = all_preds['sum'].apply(lambda x: 1 if x >= 3 else (0 if x <= 1 else -1))

print("Consensus analysis:")
print(f"  All 4 predict 0: {(all_preds['sum'] == 0).sum()} passengers")
print(f"  All 4 predict 1: {(all_preds['sum'] == 4).sum()} passengers")
print(f"  3 predict 0, 1 predicts 1: {(all_preds['sum'] == 1).sum()} passengers")
print(f"  3 predict 1, 1 predicts 0: {(all_preds['sum'] == 3).sum()} passengers")
print(f"  2 predict 0, 2 predict 1: {(all_preds['sum'] == 2).sum()} passengers")

print(f"\nConsensus predictions (3+ agree): {(all_preds['consensus'] != -1).sum()} passengers")
print(f"Split predictions (2-2): {(all_preds['consensus'] == -1).sum()} passengers")

In [None]:
# Create a consensus submission based on best-performing models
# Weight by LB score: exp_003 (0.7847) > exp_001 (0.7775) > exp_004 (0.7631) > exp_000 (0.7584)

# Simple majority vote weighted by LB performance
weights = {
    'exp_000': 0.7584,
    'exp_001': 0.7775,
    'exp_003': 0.7847,
    'exp_004': 0.7631
}

# Normalize weights
total_weight = sum(weights.values())
weights = {k: v/total_weight for k, v in weights.items()}

print("Normalized weights based on LB performance:")
for exp_id, w in weights.items():
    print(f"  {exp_id}: {w:.4f}")

# Calculate weighted vote
all_preds['weighted_vote'] = (
    all_preds['exp_000'] * weights['exp_000'] +
    all_preds['exp_001'] * weights['exp_001'] +
    all_preds['exp_003'] * weights['exp_003'] +
    all_preds['exp_004'] * weights['exp_004']
)

print(f"\nWeighted vote distribution:")
print(all_preds['weighted_vote'].describe())

In [None]:
# Try different thresholds for weighted vote
print("Threshold analysis for weighted vote:")
print(f"{'Threshold':<12} {'Survivors':<12} {'Rate':<12}")
print("-" * 36)

for thresh in [0.40, 0.45, 0.50, 0.55, 0.60]:
    preds = (all_preds['weighted_vote'] >= thresh).astype(int)
    survivors = preds.sum()
    rate = preds.mean() * 100
    marker = " <- TARGET" if 125 <= survivors <= 135 else ""
    print(f"{thresh:<12.2f} {survivors:<12} {rate:<12.1f}{marker}")

# Best threshold for ~31% survival rate
optimal_thresh = 0.50
weighted_preds = (all_preds['weighted_vote'] >= optimal_thresh).astype(int)
print(f"\nOptimal threshold: {optimal_thresh}")
print(f"Survivors: {weighted_preds.sum()} ({weighted_preds.mean()*100:.1f}%)")

In [None]:
# Compare weighted vote to best LB (exp_003)
weighted_vs_best = pd.DataFrame({
    'PassengerId': all_preds['PassengerId'],
    'weighted': weighted_preds,
    'best': all_preds['exp_003']
})

diff_weighted = weighted_vs_best[weighted_vs_best['weighted'] != weighted_vs_best['best']]
print(f"Weighted vote vs Best LB (exp_003):")
print(f"  Predictions that differ: {len(diff_weighted)}")
print(f"  Weighted survivors: {weighted_preds.sum()}")
print(f"  Best survivors: {all_preds['exp_003'].sum()}")

if len(diff_weighted) > 0:
    print(f"\nDifferences:")
    print(f"  Weighted=1, Best=0: {((diff_weighted['weighted'] == 1) & (diff_weighted['best'] == 0)).sum()}")
    print(f"  Weighted=0, Best=1: {((diff_weighted['weighted'] == 0) & (diff_weighted['best'] == 1)).sum()}")

## Summary and Next Steps

### Key Findings:
1. **Stacking FAILED**: Same CV (0.8373) but LB dropped from 0.7847 to 0.7631 (-2.16%)
2. **Simpler is better**: Threshold-Tuned Ensemble outperforms complex stacking
3. **CV-LB gap increased**: From 5.26% to 7.42% with stacking (more overfitting)
4. **Survival rate alone isn't enough**: Both had ~31% but different LB scores

### Recommended Next Steps:
1. **Abandon stacking** - It doesn't help for this dataset
2. **Focus on Threshold-Tuned Ensemble** - Current best at LB 0.7847
3. **Try weighted blending** - Combine best models with LB-based weights
4. **Feature engineering** - Add FamilySize, IsAlone, Has_Cabin
5. **Analyze disagreement patterns** - Understand where models differ

### Submission Strategy:
- 4 submissions remaining
- Best LB: 0.7847 (exp_003)
- Target: 0.80+ (need +1.5% improvement)
- Focus on incremental improvements to Threshold-Tuned Ensemble