# Loop 3 LB Feedback Analysis

## Critical Finding: Removing Age WORSENED LB Performance

**Submission Results:**
- exp_000 (RF with Age): CV 0.8339 → LB 0.7799 (gap: 5.4%)
- exp_001 (Stacking): CV 0.8271 → LB 0.7727 (gap: 5.4%)
- exp_002 (RF no Age): CV 0.8361 → LB 0.7703 (gap: 6.6%) ← WORST LB!

**The hypothesis was WRONG:**
- We thought removing Age would narrow the CV-LB gap
- Instead, LB dropped by 0.96% while CV improved by 0.22%
- The gap WIDENED from 5.4% to 6.6%

**What this tells us:**
1. Age IS important for test set predictions, despite distribution shift
2. The adversarial validation finding (Age = 56.9% shift) was misleading
3. We need to use Age, but in a more robust way

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

Train shape: (891, 12)
Test shape: (418, 11)


In [2]:
# Analyze what predictions changed between exp_000 and exp_002
exp_000 = pd.read_csv('/home/code/submission_candidates/candidate_000.csv')
exp_002 = pd.read_csv('/home/code/submission_candidates/candidate_002.csv')

# Compare predictions
exp_000['pred_000'] = exp_000['Survived']
exp_002['pred_002'] = exp_002['Survived']

comparison = pd.merge(exp_000[['PassengerId', 'pred_000']], 
                       exp_002[['PassengerId', 'pred_002']], 
                       on='PassengerId')

comparison['changed'] = comparison['pred_000'] != comparison['pred_002']

print(f"Total predictions: {len(comparison)}")
print(f"Predictions that changed: {comparison['changed'].sum()}")
print(f"Percentage changed: {comparison['changed'].mean()*100:.1f}%")

# What changed?
print(f"\nChanges breakdown:")
print(f"  0→1 (now predicting survival): {((comparison['pred_000']==0) & (comparison['pred_002']==1)).sum()}")
print(f"  1→0 (now predicting death): {((comparison['pred_000']==1) & (comparison['pred_002']==0)).sum()}")

Total predictions: 418
Predictions that changed: 8
Percentage changed: 1.9%

Changes breakdown:
  0→1 (now predicting survival): 4
  1→0 (now predicting death): 4


In [3]:
# Since exp_000 had better LB (0.7799) vs exp_002 (0.7703), 
# the changes made by removing Age were mostly WRONG
# Let's understand which passengers changed predictions

changed_ids = comparison[comparison['changed']]['PassengerId'].values
print(f"Changed PassengerIds: {len(changed_ids)}")

# Look at these passengers in test data
test_changed = test[test['PassengerId'].isin(changed_ids)].copy()
print(f"\nCharacteristics of passengers with changed predictions:")
print(test_changed[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']].describe())

Changed PassengerIds: 8

Characteristics of passengers with changed predictions:
         Pclass        Age    SibSp     Parch       Fare
count  8.000000   7.000000  8.00000  8.000000   8.000000
mean   2.250000  34.428571  0.75000  0.375000  32.958337
std    1.035098  18.635603  0.46291  0.744024  29.443741
min    1.000000  16.000000  0.00000  0.000000   7.775000
25%    1.000000  20.000000  0.75000  0.000000   8.400025
50%    3.000000  28.000000  1.00000  0.000000  18.675000
75%    3.000000  46.500000  1.00000  0.250000  59.402100
max    3.000000  64.000000  1.00000  2.000000  75.250000


In [4]:
# Key insight: Age provides signal that Title alone cannot capture
# Let's analyze the relationship between Age and Title

train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
train['Title'] = train['Title'].replace(['Mlle', 'Ms'], 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

print("Age distribution by Title:")
for title in ['Mr', 'Miss', 'Mrs', 'Master', 'Rare']:
    ages = train[train['Title'] == title]['Age'].dropna()
    print(f"  {title}: mean={ages.mean():.1f}, std={ages.std():.1f}, range=[{ages.min():.0f}-{ages.max():.0f}]")

print("\nSurvival by Title and Age group:")
train['Age_Group'] = pd.cut(train['Age'], bins=[0, 16, 32, 48, 64, 100], labels=['Child', 'Young', 'Middle', 'Senior', 'Elder'])
print(train.groupby(['Title', 'Age_Group'])['Survived'].agg(['mean', 'count']).round(3))

Age distribution by Title:
  Mr: mean=32.4, std=12.7, range=[11-80]
  Miss: mean=21.8, std=12.9, range=[1-63]
  Mrs: mean=35.8, std=11.4, range=[14-63]
  Master: mean=4.6, std=3.6, range=[0-12]
  Rare: mean=45.5, std=11.8, range=[23-70]

Survival by Title and Age group:


                   mean  count
Title  Age_Group              
Master Child      0.583     36
       Young        NaN      0
       Middle       NaN      0
       Senior       NaN      0
       Elder        NaN      0
Miss   Child      0.660     47
       Young      0.724     76
       Middle     0.842     19
       Senior     0.857      7
       Elder        NaN      0
Mr     Child      0.067     15
       Young      0.176    222
       Middle     0.191    115
       Senior     0.111     36
       Elder      0.100     10
Mrs    Child      1.000      2
       Young      0.750     44
       Middle     0.766     47
       Senior     0.938     16
       Elder        NaN      0
Rare   Child        NaN      0
       Young      0.250      4
       Middle     0.286      7
       Senior     0.500     10
       Elder      0.000      1


In [5]:
# The key insight: Within each Title, Age still matters!
# For example, older 'Mr' might have different survival than younger 'Mr'

# Let's look at Mr specifically (largest group)
mr_data = train[train['Title'] == 'Mr'].copy()
print("Mr survival by Age group:")
print(mr_data.groupby('Age_Group')['Survived'].agg(['mean', 'count']).round(3))

print("\nMiss survival by Age group:")
miss_data = train[train['Title'] == 'Miss'].copy()
print(miss_data.groupby('Age_Group')['Survived'].agg(['mean', 'count']).round(3))

Mr survival by Age group:
            mean  count
Age_Group              
Child      0.067     15
Young      0.176    222
Middle     0.191    115
Senior     0.111     36
Elder      0.100     10

Miss survival by Age group:
            mean  count
Age_Group              
Child      0.660     47
Young      0.724     76
Middle     0.842     19
Senior     0.857      7
Elder        NaN      0


In [6]:
# STRATEGY REVISION:
# 1. Age IS important - we should NOT remove it
# 2. The distribution shift is real, but Age still provides signal
# 3. We need a more robust approach to Age:
#    - Use Age binning instead of raw Age
#    - Or use Age interactions with other features
#    - Or use Age only for certain groups (e.g., children)

# Let's test: What if we use Age_Bin instead of raw Age?
# This might reduce distribution shift while keeping the signal

print("\n" + "="*60)
print("STRATEGY REVISION")
print("="*60)
print("""
Key Learning from exp_002 submission:
- Removing Age HURT LB performance (0.7799 → 0.7703)
- Age provides important signal that Title cannot fully capture
- Within Title groups, Age still matters for survival

New Approach:
1. KEEP Age, but use it more robustly
2. Try Age binning (Child/Young/Middle/Senior) instead of raw Age
3. Consider Age interactions (Age*Pclass, Age*Sex)
4. Focus on the BEST LB model (exp_000) and improve it

Best LB so far: exp_000 (RF with Age) = 0.7799
""")


STRATEGY REVISION

Key Learning from exp_002 submission:
- Removing Age HURT LB performance (0.7799 → 0.7703)
- Age provides important signal that Title cannot fully capture
- Within Title groups, Age still matters for survival

New Approach:
1. KEEP Age, but use it more robustly
2. Try Age binning (Child/Young/Middle/Senior) instead of raw Age
3. Consider Age interactions (Age*Pclass, Age*Sex)
4. Focus on the BEST LB model (exp_000) and improve it

Best LB so far: exp_000 (RF with Age) = 0.7799



In [7]:
# Let's analyze what made exp_000 successful
# It used: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked, Title, FamilySize, IsAlone, Has_Cabin, Age_Bin, IsChild

# Key differences from exp_002:
# - exp_000 had Age, Age_Bin, IsChild
# - exp_002 removed all Age-related features

# The Age_Bin and IsChild features might be more robust than raw Age
# Let's verify this hypothesis

print("Feature analysis for robust Age usage:")
print("\n1. IsChild (Age < 16) survival:")
train['IsChild'] = (train['Age'] < 16).astype(int)
print(train.groupby('IsChild')['Survived'].agg(['mean', 'count']).round(3))

print("\n2. Age_Bin survival:")
print(train.groupby('Age_Group')['Survived'].agg(['mean', 'count']).round(3))

print("\n3. Children by Sex:")
children = train[train['IsChild'] == 1]
print(children.groupby('Sex')['Survived'].agg(['mean', 'count']).round(3))

Feature analysis for robust Age usage:

1. IsChild (Age < 16) survival:
          mean  count
IsChild              
0        0.363    808
1        0.590     83

2. Age_Bin survival:
            mean  count
Age_Group              
Child      0.550    100
Young      0.370    346
Middle     0.404    188
Senior     0.435     69
Elder      0.091     11

3. Children by Sex:
         mean  count
Sex                 
female  0.651     43
male    0.525     40


In [8]:
# FINAL ANALYSIS: What should we try next?

print("="*60)
print("NEXT STEPS ANALYSIS")
print("="*60)

print("""
Submission Budget: 2 remaining

Best LB: exp_000 = 0.7799 (RF with Age)
Target: 1.0 (impossible - typical best is 80-82%)

Options:
1. Return to exp_000 approach and tune hyperparameters
2. Try Age binning instead of raw Age
3. Try ensemble of exp_000 and exp_002 predictions
4. Try different model (XGBoost, LightGBM)

Recommendation:
Since exp_000 is our best LB, we should:
1. Analyze what makes it work
2. Try small improvements (hyperparameter tuning, feature tweaks)
3. Consider blending with other predictions

With only 2 submissions left, we need to be strategic.
""")

# Let's check if blending exp_000 and exp_002 could help
print("\nBlending Analysis:")
print(f"exp_000 predictions: 0={exp_000['pred_000'].value_counts()[0]}, 1={exp_000['pred_000'].value_counts()[1]}")
print(f"exp_002 predictions: 0={exp_002['pred_002'].value_counts()[0]}, 1={exp_002['pred_002'].value_counts()[1]}")

# Agreement analysis
agreement = (comparison['pred_000'] == comparison['pred_002']).mean()
print(f"\nAgreement between exp_000 and exp_002: {agreement*100:.1f}%")
print(f"Disagreement: {(1-agreement)*100:.1f}%")

NEXT STEPS ANALYSIS

Submission Budget: 2 remaining

Best LB: exp_000 = 0.7799 (RF with Age)
Target: 1.0 (impossible - typical best is 80-82%)

Options:
1. Return to exp_000 approach and tune hyperparameters
2. Try Age binning instead of raw Age
3. Try ensemble of exp_000 and exp_002 predictions
4. Try different model (XGBoost, LightGBM)

Recommendation:
Since exp_000 is our best LB, we should:
1. Analyze what makes it work
2. Try small improvements (hyperparameter tuning, feature tweaks)
3. Consider blending with other predictions

With only 2 submissions left, we need to be strategic.


Blending Analysis:
exp_000 predictions: 0=264, 1=154
exp_002 predictions: 0=264, 1=154

Agreement between exp_000 and exp_002: 98.1%
Disagreement: 1.9%
