# Loop 4 LB Feedback Analysis

## CRITICAL: XGBoost FAILED BADLY on LB

**Submission History:**
- exp_000: CV 0.8339 → LB 0.7799 (BEST LB, gap: +5.4%)
- exp_001: CV 0.8271 → LB 0.7727 (gap: +5.4%)
- exp_002: CV 0.8361 → LB 0.7703 (gap: +6.6%)
- exp_003: CV 0.8361 → LB 0.7584 (gap: +7.8%) ← WORST LB!

**Key Insight:** XGBoost performed TERRIBLY despite same features as exp_000.
The 27 prediction changes (6.5%) were WRONG, not right.

**Only 1 submission remaining!**

In [1]:
import pandas as pd
import numpy as np

# Load all candidate predictions
exp_000 = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
exp_001 = pd.read_csv('/home/code/submission_candidates/candidate_001.csv')
exp_002 = pd.read_csv('/home/code/submission_candidates/candidate_002.csv')
exp_003 = pd.read_csv('/home/code/submission_candidates/candidate_003.csv')

print("Prediction distributions:")
print(f"exp_000 (LB 0.7799): {exp_000['Survived'].value_counts().to_dict()}")
print(f"exp_001 (LB 0.7727): {exp_001['Survived'].value_counts().to_dict()}")
print(f"exp_002 (LB 0.7703): {exp_002['Survived'].value_counts().to_dict()}")
print(f"exp_003 (LB 0.7584): {exp_003['Survived'].value_counts().to_dict()}")

Prediction distributions:
exp_000 (LB 0.7799): {0: 264, 1: 154}
exp_001 (LB 0.7727): {0: 255, 1: 163}
exp_002 (LB 0.7703): {0: 264, 1: 154}
exp_003 (LB 0.7584): {0: 273, 1: 145}


In [2]:
# Analyze prediction differences
all_preds = exp_000.copy()
all_preds['exp_000'] = exp_000['Survived']
all_preds['exp_001'] = exp_001['Survived']
all_preds['exp_002'] = exp_002['Survived']
all_preds['exp_003'] = exp_003['Survived']
all_preds = all_preds.drop('Survived', axis=1)

# Agreement analysis
print("\nPrediction Agreement Analysis:")
print(f"exp_000 vs exp_001: {(all_preds['exp_000'] == all_preds['exp_001']).sum()}/418 ({(all_preds['exp_000'] == all_preds['exp_001']).mean()*100:.1f}%)")
print(f"exp_000 vs exp_002: {(all_preds['exp_000'] == all_preds['exp_002']).sum()}/418 ({(all_preds['exp_000'] == all_preds['exp_002']).mean()*100:.1f}%)")
print(f"exp_000 vs exp_003: {(all_preds['exp_000'] == all_preds['exp_003']).sum()}/418 ({(all_preds['exp_000'] == all_preds['exp_003']).mean()*100:.1f}%)")


Prediction Agreement Analysis:
exp_000 vs exp_001: 395/418 (94.5%)
exp_000 vs exp_002: 410/418 (98.1%)
exp_000 vs exp_003: 391/418 (93.5%)


In [3]:
# LB scores for reference
lb_scores = {
    'exp_000': 0.7799,
    'exp_001': 0.7727,
    'exp_002': 0.7703,
    'exp_003': 0.7584
}

print("\nLB Score Ranking:")
for exp, score in sorted(lb_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"  {exp}: {score:.4f}")

print(f"\nBest LB: exp_000 with {lb_scores['exp_000']:.4f}")
print(f"Worst LB: exp_003 with {lb_scores['exp_003']:.4f}")
print(f"Difference: {(lb_scores['exp_000'] - lb_scores['exp_003'])*100:.2f}%")


LB Score Ranking:
  exp_000: 0.7799
  exp_001: 0.7727
  exp_002: 0.7703
  exp_003: 0.7584

Best LB: exp_000 with 0.7799
Worst LB: exp_003 with 0.7584
Difference: 2.15%


In [4]:
# Majority vote analysis
all_preds['majority_vote'] = ((all_preds['exp_000'] + all_preds['exp_001'] + all_preds['exp_002'] + all_preds['exp_003']) >= 2).astype(int)

print("\nMajority Vote Analysis (all 4 experiments):")
print(f"Majority vote distribution: {all_preds['majority_vote'].value_counts().to_dict()}")
print(f"\nAgreement with each experiment:")
for exp in ['exp_000', 'exp_001', 'exp_002', 'exp_003']:
    agreement = (all_preds['majority_vote'] == all_preds[exp]).sum()
    print(f"  vs {exp}: {agreement}/418 ({agreement/418*100:.1f}%)")


Majority Vote Analysis (all 4 experiments):
Majority vote distribution: {0: 260, 1: 158}

Agreement with each experiment:
  vs exp_000: 414/418 (99.0%)
  vs exp_001: 397/418 (95.0%)
  vs exp_002: 410/418 (98.1%)
  vs exp_003: 393/418 (94.0%)


In [5]:
# Best 3 experiments (excluding worst exp_003)
all_preds['majority_top3'] = ((all_preds['exp_000'] + all_preds['exp_001'] + all_preds['exp_002']) >= 2).astype(int)

print("\nMajority Vote (Top 3 - excluding exp_003):")
print(f"Distribution: {all_preds['majority_top3'].value_counts().to_dict()}")
print(f"\nAgreement with exp_000 (best LB): {(all_preds['majority_top3'] == all_preds['exp_000']).sum()}/418 ({(all_preds['majority_top3'] == all_preds['exp_000']).mean()*100:.1f}%)")


Majority Vote (Top 3 - excluding exp_003):
Distribution: {0: 264, 1: 154}

Agreement with exp_000 (best LB): 416/418 (99.5%)


In [6]:
# Weighted voting based on LB performance
# Weight by LB score (higher is better)
weights = {
    'exp_000': 0.7799,
    'exp_001': 0.7727,
    'exp_002': 0.7703,
    'exp_003': 0.7584
}

# Normalize weights
total_weight = sum(weights.values())
norm_weights = {k: v/total_weight for k, v in weights.items()}

print("\nWeighted Voting (by LB score):")
print(f"Normalized weights: {norm_weights}")

weighted_sum = (all_preds['exp_000'] * norm_weights['exp_000'] + 
                all_preds['exp_001'] * norm_weights['exp_001'] + 
                all_preds['exp_002'] * norm_weights['exp_002'] + 
                all_preds['exp_003'] * norm_weights['exp_003'])

all_preds['weighted_vote'] = (weighted_sum >= 0.5).astype(int)
print(f"\nWeighted vote distribution: {all_preds['weighted_vote'].value_counts().to_dict()}")
print(f"Agreement with exp_000: {(all_preds['weighted_vote'] == all_preds['exp_000']).sum()}/418")


Weighted Voting (by LB score):
Normalized weights: {'exp_000': 0.25310745464576645, 'exp_001': 0.2507707785674878, 'exp_002': 0.2499918865413949, 'exp_003': 0.246129880245351}

Weighted vote distribution: {0: 264, 1: 154}
Agreement with exp_000: 416/418


In [7]:
# Best strategy: Trust exp_000 (best LB) but consider where other models agree
# If exp_000, exp_001, exp_002 all agree, that's likely correct
all_preds['top3_unanimous'] = ((all_preds['exp_000'] == all_preds['exp_001']) & 
                               (all_preds['exp_001'] == all_preds['exp_002']))

print("\nUnanimous Agreement Analysis (top 3 experiments):")
print(f"Top 3 unanimous: {all_preds['top3_unanimous'].sum()}/418 ({all_preds['top3_unanimous'].mean()*100:.1f}%)")

# Where top 3 disagree
disagree_mask = ~all_preds['top3_unanimous']
print(f"\nDisagreements: {disagree_mask.sum()} cases")


Unanimous Agreement Analysis (top 3 experiments):
Top 3 unanimous: 389/418 (93.1%)

Disagreements: 29 cases


In [8]:
# Final analysis: What should we submit?
print("="*60)
print("FINAL SUBMISSION STRATEGY ANALYSIS")
print("="*60)

print("\n1. BEST OPTION: Submit exp_000 again (already submitted, LB 0.7799)")
print("   - This is our best LB score")
print("   - But we already submitted it, so no new information")

print("\n2. RISKY OPTION: Submit majority vote of top 3")
print(f"   - Differs from exp_000 by {(all_preds['majority_top3'] != all_preds['exp_000']).sum()} predictions")
print("   - Could be better or worse")

print("\n3. SAFEST NEW OPTION: Submit weighted ensemble")
print(f"   - Differs from exp_000 by {(all_preds['weighted_vote'] != all_preds['exp_000']).sum()} predictions")
print("   - Weights favor exp_000 (best LB)")

print("\n" + "="*60)
print("RECOMMENDATION: Given only 1 submission left,")
print("the safest approach is to submit exp_000 (best LB)")
print("OR a conservative ensemble that stays close to exp_000.")
print("="*60)

FINAL SUBMISSION STRATEGY ANALYSIS

1. BEST OPTION: Submit exp_000 again (already submitted, LB 0.7799)
   - This is our best LB score
   - But we already submitted it, so no new information

2. RISKY OPTION: Submit majority vote of top 3
   - Differs from exp_000 by 2 predictions
   - Could be better or worse

3. SAFEST NEW OPTION: Submit weighted ensemble
   - Differs from exp_000 by 2 predictions
   - Weights favor exp_000 (best LB)

RECOMMENDATION: Given only 1 submission left,
the safest approach is to submit exp_000 (best LB)
OR a conservative ensemble that stays close to exp_000.


In [9]:
# Create a conservative ensemble: exp_000 + exp_001 majority (best 2 LB scores)
all_preds['top2_majority'] = ((all_preds['exp_000'] + all_preds['exp_001']) >= 1).astype(int)  # OR logic
all_preds['top2_both'] = ((all_preds['exp_000'] + all_preds['exp_001']) == 2).astype(int)  # AND logic

print("\nTop 2 Ensemble Options (exp_000 + exp_001):")
print(f"OR logic (predict 1 if either predicts 1): {all_preds['top2_majority'].value_counts().to_dict()}")
print(f"AND logic (predict 1 only if both predict 1): {all_preds['top2_both'].value_counts().to_dict()}")
print(f"\nexp_000 distribution: {all_preds['exp_000'].value_counts().to_dict()}")

print(f"\nOR differs from exp_000 by: {(all_preds['top2_majority'] != all_preds['exp_000']).sum()}")
print(f"AND differs from exp_000 by: {(all_preds['top2_both'] != all_preds['exp_000']).sum()}")


Top 2 Ensemble Options (exp_000 + exp_001):
OR logic (predict 1 if either predicts 1): {0: 248, 1: 170}
AND logic (predict 1 only if both predict 1): {0: 271, 1: 147}

exp_000 distribution: {0: 264, 1: 154}

OR differs from exp_000 by: 16
AND differs from exp_000 by: 7


In [10]:
# Analyze the 2 predictions where majority_top3 differs from exp_000
diff_mask = all_preds['majority_top3'] != all_preds['exp_000']
print(f"Cases where majority_top3 differs from exp_000: {diff_mask.sum()}")

# Load test data to understand these cases
test = pd.read_csv('/home/data/test.csv')
test_with_preds = test.copy()
test_with_preds['exp_000'] = all_preds['exp_000']
test_with_preds['exp_001'] = all_preds['exp_001']
test_with_preds['exp_002'] = all_preds['exp_002']
test_with_preds['exp_003'] = all_preds['exp_003']
test_with_preds['majority_top3'] = all_preds['majority_top3']

# Show the differing cases
diff_cases = test_with_preds[diff_mask]
print("\nDiffering cases:")
print(diff_cases[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 
                  'exp_000', 'exp_001', 'exp_002', 'majority_top3']].to_string())

Cases where majority_top3 differs from exp_000: 2

Differing cases:
     PassengerId  Pclass                                               Name     Sex   Age  SibSp  Parch    Fare  exp_000  exp_001  exp_002  majority_top3
33           925       3  Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"  female   NaN      1      2  23.450        1        0        0              0
412         1304       3                     Henriksson, Miss. Jenny Lovisa  female  28.0      0      0   7.775        0        1        1              1


In [11]:
# Analysis of the 2 differing cases:
# 1. PassengerId 925: Mrs. Johnston - 3rd class female with family (SibSp=1, Parch=2)
#    - exp_000 predicts 1 (survived), exp_001 and exp_002 predict 0 (died)
#    - 3rd class females with large families had lower survival
#    - Majority says 0 (died) - this might be correct
#
# 2. PassengerId 1304: Miss Henriksson - 3rd class female, alone, age 28
#    - exp_000 predicts 0 (died), exp_001 and exp_002 predict 1 (survived)
#    - Single young female - could go either way
#    - Majority says 1 (survived) - this might be correct

print("Analysis of differing cases:")
print("\\n1. PassengerId 925 (Mrs. Johnston):")
print("   - 3rd class female with family (SibSp=1, Parch=2)")
print("   - exp_000: 1 (survived), exp_001/exp_002: 0 (died)")
print("   - Large family in 3rd class = lower survival")
print("   - Majority vote: 0 (died) - likely CORRECT")

print("\\n2. PassengerId 1304 (Miss Henriksson):")
print("   - 3rd class female, alone, age 28")
print("   - exp_000: 0 (died), exp_001/exp_002: 1 (survived)")
print("   - Young single female - could go either way")
print("   - Majority vote: 1 (survived) - uncertain")

print("\\n" + "="*60)
print("CONCLUSION: The majority_top3 ensemble changes only 2 predictions")
print("from exp_000. Both changes seem reasonable based on domain knowledge.")
print("This is a LOW-RISK change that could improve LB slightly.")
print("="*60)

Analysis of differing cases:
\n1. PassengerId 925 (Mrs. Johnston):
   - 3rd class female with family (SibSp=1, Parch=2)
   - exp_000: 1 (survived), exp_001/exp_002: 0 (died)
   - Large family in 3rd class = lower survival
   - Majority vote: 0 (died) - likely CORRECT
\n2. PassengerId 1304 (Miss Henriksson):
   - 3rd class female, alone, age 28
   - exp_000: 0 (died), exp_001/exp_002: 1 (survived)
   - Young single female - could go either way
   - Majority vote: 1 (survived) - uncertain
CONCLUSION: The majority_top3 ensemble changes only 2 predictions
from exp_000. Both changes seem reasonable based on domain knowledge.
This is a LOW-RISK change that could improve LB slightly.


In [None]:
# KEY INSIGHT: The target of 1.0 (100% accuracy) is IMPOSSIBLE
# Best Kaggle scores for Titanic are around 80-82%
# Our best LB is 0.7799 (77.99%)
# We cannot beat 1.0 - this is an unrealistic target

print("="*60)
print("REALITY CHECK: TARGET OF 1.0 IS IMPOSSIBLE")
print("="*60)
print("\\nTitanic competition facts:")
print("- Best public scores are around 80-82%")
print("- Perfect prediction is impossible due to:")
print("  1. Missing information (some passengers' fates are unknowable)")
print("  2. Random factors in survival (luck, timing, etc.)")
print("  3. Noise in the data")
print("\\nOur performance:")
print("- Best LB: 0.7799 (77.99%)")
print("- This is competitive but not top-tier")
print("- Typical good scores: 78-80%")
print("- Top scores: 80-82%")
print("\\nTo improve further, we would need:")
print("- More sophisticated feature engineering")
print("- Better handling of edge cases")
print("- Possibly manual data cleaning/research")
print("="*60)