# Loop 4 LB Feedback Analysis

## Submission Results
- **exp_003**: Threshold-Tuned Ensemble (31% survival rate)
  - CV: 0.8373 | LB: 0.7847 | Gap: +5.26%

## Submission History
| Exp | Model | CV | LB | Gap | Survival Rate |
|-----|-------|-----|-----|-----|---------------|
| exp_000 | XGBoost (13 features) | 0.8316 | 0.7584 | +7.32% | 37.6% |
| exp_001 | Simple RF (7 features) | 0.8238 | 0.7775 | +4.63% | 31.3% |
| exp_003 | Threshold-Tuned Ensemble | 0.8373 | 0.7847 | +5.26% | 31.1% |

## Key Observations
1. **Threshold tuning WORKED!** LB improved from 0.7775 to 0.7847 (+0.72%)
2. **Best LB so far**: 0.7847 (78.47% accuracy)
3. **CV-LB gap reduced**: From 7.32% (XGBoost) to 5.26% (Threshold-Tuned)
4. **Survival rate hypothesis CONFIRMED**: Lower survival rate (~31%) performs better on LB

In [None]:
import pandas as pd
import numpy as np

# Load all submissions for comparison
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

# Load predictions
xgb_pred = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
rf_pred = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')
thresh_pred = pd.read_csv('/home/code/submission_candidates/candidate_003.csv')

print("Survival rates:")
print(f"  Training: {train['Survived'].mean():.3f}")
print(f"  XGBoost (LB 0.7584): {xgb_pred['Survived'].mean():.3f}")
print(f"  Simple RF (LB 0.7775): {rf_pred['Survived'].mean():.3f}")
print(f"  Threshold-Tuned (LB 0.7847): {thresh_pred['Survived'].mean():.3f}")

In [None]:
# Compare predictions between best models
print("\nPrediction Agreement Analysis:")
print("="*50)

# RF vs Threshold-Tuned
rf_vs_thresh = (rf_pred['Survived'] == thresh_pred['Survived']).sum()
print(f"\nSimple RF vs Threshold-Tuned:")
print(f"  Agreement: {rf_vs_thresh}/418 ({rf_vs_thresh/418*100:.1f}%)")
print(f"  Disagreement: {418-rf_vs_thresh} passengers")

# Where do they differ?
diff_idx = rf_pred['Survived'] != thresh_pred['Survived']
print(f"\nDifferences:")
print(f"  RF=1, Thresh=0: {((rf_pred['Survived']==1) & (thresh_pred['Survived']==0)).sum()}")
print(f"  RF=0, Thresh=1: {((rf_pred['Survived']==0) & (thresh_pred['Survived']==1)).sum()}")

In [None]:
# Analyze passengers where models differ
test_with_preds = test.copy()
test_with_preds['RF_Pred'] = rf_pred['Survived']
test_with_preds['Thresh_Pred'] = thresh_pred['Survived']
test_with_preds['XGB_Pred'] = xgb_pred['Survived']

# Extract Title
test_with_preds['Title'] = test_with_preds['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

# Where RF=1 but Thresh=0 (RF predicts survival, Thresh doesn't)
rf_yes_thresh_no = test_with_preds[(test_with_preds['RF_Pred']==1) & (test_with_preds['Thresh_Pred']==0)]
print(f"\nPassengers where RF=1 but Thresh=0 ({len(rf_yes_thresh_no)} passengers):")
if len(rf_yes_thresh_no) > 0:
    print(f"  Sex: {rf_yes_thresh_no['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {rf_yes_thresh_no['Pclass'].value_counts().to_dict()}")
    print(f"  Title: {rf_yes_thresh_no['Title'].value_counts().to_dict()}")

# Where RF=0 but Thresh=1 (Thresh predicts survival, RF doesn't)
rf_no_thresh_yes = test_with_preds[(test_with_preds['RF_Pred']==0) & (test_with_preds['Thresh_Pred']==1)]
print(f"\nPassengers where RF=0 but Thresh=1 ({len(rf_no_thresh_yes)} passengers):")
if len(rf_no_thresh_yes) > 0:
    print(f"  Sex: {rf_no_thresh_yes['Sex'].value_counts().to_dict()}")
    print(f"  Pclass: {rf_no_thresh_yes['Pclass'].value_counts().to_dict()}")
    print(f"  Title: {rf_no_thresh_yes['Title'].value_counts().to_dict()}")

In [None]:
# Calculate which model got more right on the disagreements
# Since Thresh got LB 0.7847 and RF got 0.7775, Thresh is better by 3 passengers (0.72% of 418)

lb_diff = 0.7847 - 0.7775
passengers_diff = round(lb_diff * 418)
print(f"\nLB Score Analysis:")
print(f"  Threshold-Tuned: 0.7847")
print(f"  Simple RF: 0.7775")
print(f"  Difference: {lb_diff:.4f} ({passengers_diff} passengers)")

print(f"\nThis means Threshold-Tuned got ~{passengers_diff} more predictions correct than Simple RF.")
print(f"The {418-rf_vs_thresh} disagreements between models resulted in net +{passengers_diff} for Threshold-Tuned.")

In [None]:
# What's the best possible score we can achieve?
# State-of-the-art is 81-85% accuracy
# Current best: 0.7847
# Target: 1.0 (impossible)

print("\n" + "="*60)
print("SCORE TRAJECTORY")
print("="*60)
print(f"\nSubmission History:")
print(f"  #1: XGBoost (13 features) → LB 0.7584")
print(f"  #2: Simple RF (7 features) → LB 0.7775 (+1.91%)")
print(f"  #3: Threshold-Tuned Ensemble → LB 0.7847 (+0.72%)")
print(f"\nTotal improvement: +2.63% (from 0.7584 to 0.7847)")
print(f"\nState-of-the-art: 0.81-0.85")
print(f"Gap to close: {0.81 - 0.7847:.4f} to {0.85 - 0.7847:.4f}")
print(f"\nTarget: 1.0 (IMPOSSIBLE - inherent noise in data)")
print(f"\nRealistic goal: 0.80+ would be excellent (top ~10%)")
print(f"Current position: 0.7847 is competitive but not top-tier")

In [None]:
# What strategies could push us to 0.80+?
print("\n" + "="*60)
print("STRATEGIES TO REACH 0.80+")
print("="*60)

print("""
1. STACKING (proven to achieve 0.808 LB)
   - Use 5 base models: RF, ExtraTrees, AdaBoost, GradientBoosting, SVC
   - Generate OOF predictions using 5-fold CV
   - Train XGBoost as meta-learner on stacked predictions
   - Apply threshold tuning to final predictions

2. BLENDING RF + Threshold-Tuned Ensemble
   - Average predictions from both models
   - Or use weighted average based on LB performance
   - Simple RF: 0.7775, Threshold-Tuned: 0.7847

3. FEATURE ENGINEERING
   - Add FamilySize, IsAlone features
   - Add Has_Cabin, Deck features
   - Ticket frequency (passengers sharing same ticket)
   - Age/Fare binning

4. THRESHOLD OPTIMIZATION
   - Current threshold 0.608 gave 31.1% survival
   - Try threshold 0.55 (best OOF accuracy 0.8418, 34.4% survival)
   - Find optimal threshold that balances OOF accuracy and survival rate

5. MODEL DIVERSITY
   - Add CatBoost, LightGBM to ensemble
   - Neural network (simple MLP)
   - Different feature subsets for different models
""")

## Key Insights

1. **Threshold tuning validated**: LB improved from 0.7775 to 0.7847 by adjusting threshold to match ~31% survival rate

2. **CV-LB gap is consistent**: ~5% gap is expected for this dataset

3. **Survival rate matters**: Lower survival rate predictions (~31%) perform better than training rate (38.4%)

4. **Next steps**:
   - Stacking is the most promising approach (proven 0.808 LB)
   - Blending RF + Threshold-Tuned could also help
   - More feature engineering (FamilySize, Has_Cabin, Deck)

5. **Target of 1.0 is impossible**: Focus on incremental improvements toward 0.80+