# Loop 2 LB Feedback Analysis

## Submission Results
- exp_000 (Baseline RF): CV 0.8339 → LB 0.7799 (gap: 5.4%)
- exp_001 (Stacking + Fixed Leakage): CV 0.8271 → LB 0.7727 (gap: 5.4%)

## Key Observations
1. Fixing data leakage did NOT improve LB - it got WORSE (0.7799 → 0.7727)
2. The CV-LB gap remained constant at ~5.4%
3. Stacking underperformed vs individual SVC (83.6% CV)

## Questions to Answer
1. Why did the fixed-leakage model perform worse on LB?
2. Is the distribution shift the real problem, not leakage?
3. What features are most robust to distribution shift?

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

Train shape: (891, 12)
Test shape: (418, 11)


In [2]:
# Compare Age distributions more carefully
print("=" * 50)
print("AGE DISTRIBUTION ANALYSIS")
print("=" * 50)

print("\nTrain Age (non-null):")
print(train['Age'].dropna().describe())

print("\nTest Age (non-null):")
print(test['Age'].dropna().describe())

# Check missing Age patterns
print("\nMissing Age by Pclass (Train):")
print(train.groupby('Pclass')['Age'].apply(lambda x: x.isna().sum() / len(x) * 100).round(1))

print("\nMissing Age by Pclass (Test):")
print(test.groupby('Pclass')['Age'].apply(lambda x: x.isna().sum() / len(x) * 100).round(1))

AGE DISTRIBUTION ANALYSIS

Train Age (non-null):
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

Test Age (non-null):
count    332.000000
mean      30.272590
std       14.181209
min        0.170000
25%       21.000000
50%       27.000000
75%       39.000000
max       76.000000
Name: Age, dtype: float64

Missing Age by Pclass (Train):
Pclass
1    13.9
2     6.0
3    27.7
Name: Age, dtype: float64

Missing Age by Pclass (Test):
Pclass
1     8.4
2     5.4
3    33.0
Name: Age, dtype: float64


In [3]:
# The key insight: Maybe the problem is not Age imputation leakage,
# but rather that Age itself is unreliable for generalization

# Let's test: What if we COMPLETELY REMOVE Age from features?
# This would eliminate the distribution shift issue entirely

def engineer_features_no_age(df):
    """Feature engineering WITHOUT Age"""
    df = df.copy()
    
    # Title extraction
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    
    # Family features
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    
    # Cabin features
    df['Has_Cabin'] = df['Cabin'].notna().astype(int)
    
    # Embarked fill
    df['Embarked'] = df['Embarked'].fillna('S')
    
    # Fare fill
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    
    return df

train_no_age = engineer_features_no_age(train)
test_no_age = engineer_features_no_age(test)

print("Features engineered (no Age):", train_no_age.columns.tolist())

Features engineered (no Age): ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title', 'FamilySize', 'IsAlone', 'Has_Cabin']


In [4]:
# Prepare features WITHOUT Age
def prepare_features_no_age(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Encode categorical
    for col in ['Sex', 'Embarked', 'Title']:
        le = LabelEncoder()
        combined = pd.concat([train_df[col], test_df[col]])
        le.fit(combined)
        train_df[col] = le.transform(train_df[col])
        test_df[col] = le.transform(test_df[col])
    
    # Features WITHOUT Age
    features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked',
                'Title', 'FamilySize', 'IsAlone', 'Has_Cabin']
    
    X_train = train_df[features]
    y_train = train_df['Survived']
    X_test = test_df[features]
    
    return X_train, y_train, X_test, features

X_train_no_age, y_train, X_test_no_age, features_no_age = prepare_features_no_age(train_no_age, test_no_age)

print(f"Features (no Age): {features_no_age}")
print(f"X_train shape: {X_train_no_age.shape}")

Features (no Age): ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone', 'Has_Cabin']
X_train shape: (891, 10)


In [5]:
# Test SVC without Age
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_no_age)

svc = SVC(kernel='rbf', C=1.0, probability=True, random_state=42)
cv_scores = cross_val_score(svc, X_train_scaled, y_train, cv=skf, scoring='accuracy')

print("SVC WITHOUT Age:")
print(f"CV Scores: {cv_scores}")
print(f"Mean CV: {cv_scores.mean():.5f} (+/- {cv_scores.std():.5f})")
print(f"\nPrevious SVC with Age: 0.8361")
print(f"Difference: {cv_scores.mean() - 0.8361:.5f}")

SVC WITHOUT Age:
CV Scores: [0.84357542 0.81460674 0.82022472 0.83146067 0.84269663]
Mean CV: 0.83051 (+/- 0.01165)

Previous SVC with Age: 0.8361
Difference: -0.00559


In [6]:
# Test RF without Age
rf = RandomForestClassifier(n_estimators=200, max_depth=6, min_samples_split=4, random_state=42)
cv_scores_rf = cross_val_score(rf, X_train_no_age, y_train, cv=skf, scoring='accuracy')

print("RF WITHOUT Age:")
print(f"CV Scores: {cv_scores_rf}")
print(f"Mean CV: {cv_scores_rf.mean():.5f} (+/- {cv_scores_rf.std():.5f})")
print(f"\nPrevious RF with Age: 0.8339")
print(f"Difference: {cv_scores_rf.mean() - 0.8339:.5f}")

RF WITHOUT Age:
CV Scores: [0.84916201 0.83707865 0.8258427  0.83707865 0.84269663]
Mean CV: 0.83837 (+/- 0.00769)

Previous RF with Age: 0.8339
Difference: 0.00447


In [7]:
# Key insight: If CV drops significantly without Age, Age IS informative
# But if LB gap persists, Age is causing distribution shift

# Let's try a different approach: Use ONLY the most robust features
# Based on adversarial validation, Age (56.9%) and Fare (14.3%) are the main shift sources

# What if we use ONLY features that are stable between train/test?
# Sex, Pclass, Title, FamilySize, IsAlone, Has_Cabin, Embarked

print("\n" + "="*50)
print("ROBUST FEATURE SET ANALYSIS")
print("="*50)

# Check if Sex, Pclass distributions are similar
print("\nSex distribution:")
print("Train:", train['Sex'].value_counts(normalize=True).to_dict())
print("Test:", test['Sex'].value_counts(normalize=True).to_dict())

print("\nPclass distribution:")
print("Train:", train['Pclass'].value_counts(normalize=True).sort_index().to_dict())
print("Test:", test['Pclass'].value_counts(normalize=True).sort_index().to_dict())


ROBUST FEATURE SET ANALYSIS

Sex distribution:
Train: {'male': 0.6475869809203143, 'female': 0.35241301907968575}
Test: {'male': 0.6363636363636364, 'female': 0.36363636363636365}

Pclass distribution:
Train: {1: 0.24242424242424243, 2: 0.20650953984287318, 3: 0.5510662177328844}
Test: {1: 0.25598086124401914, 2: 0.22248803827751196, 3: 0.5215311004784688}


In [8]:
# The distributions look similar for Sex and Pclass
# Let's check Embarked
print("Embarked distribution:")
print("Train:", train['Embarked'].value_counts(normalize=True).to_dict())
print("Test:", test['Embarked'].value_counts(normalize=True).to_dict())

# Check FamilySize
train_temp = train.copy()
test_temp = test.copy()
train_temp['FamilySize'] = train_temp['SibSp'] + train_temp['Parch'] + 1
test_temp['FamilySize'] = test_temp['SibSp'] + test_temp['Parch'] + 1

print("\nFamilySize distribution:")
print("Train:", train_temp['FamilySize'].value_counts(normalize=True).sort_index().head(8).to_dict())
print("Test:", test_temp['FamilySize'].value_counts(normalize=True).sort_index().head(8).to_dict())

Embarked distribution:
Train: {'S': 0.7244094488188977, 'C': 0.1889763779527559, 'Q': 0.08661417322834646}
Test: {'S': 0.645933014354067, 'C': 0.24401913875598086, 'Q': 0.11004784688995216}

FamilySize distribution:
Train: {1: 0.6026936026936027, 2: 0.18069584736251404, 3: 0.11447811447811448, 4: 0.03254769921436588, 5: 0.016835016835016835, 6: 0.024691358024691357, 7: 0.013468013468013467, 8: 0.006734006734006734}
Test: {1: 0.6052631578947368, 2: 0.17703349282296652, 3: 0.13636363636363635, 4: 0.03349282296650718, 5: 0.01674641148325359, 6: 0.007177033492822967, 7: 0.009569377990430622, 8: 0.004784688995215311}


In [9]:
# Summary of findings:
# 1. Fixing Age leakage did NOT help LB - it got worse
# 2. The CV-LB gap is consistent at ~5.4%
# 3. Age is the main source of distribution shift (56.9% in adversarial validation)
# 4. Stacking underperformed vs individual SVC

# Hypotheses for next experiments:
# 1. Try SVC alone (best individual model at 83.6% CV)
# 2. Try completely removing Age from features
# 3. Try using only robust features (Sex, Pclass, Title, FamilySize)
# 4. Try Fare binning to reduce Fare's distribution shift impact

print("\n" + "="*50)
print("RECOMMENDATIONS FOR NEXT EXPERIMENTS")
print("="*50)
print("""
1. SVC ALONE: Best individual model (83.6% CV) - simpler may generalize better
2. NO AGE: Remove Age entirely to eliminate distribution shift
3. ROBUST FEATURES: Use only stable features (Sex, Pclass, Title, FamilySize)
4. FARE BINNING: Bin Fare to reduce its distribution shift impact
5. SIMPLER MODEL: Try Logistic Regression with robust features

Key insight: The problem is NOT data leakage - it's distribution shift.
Age is the main culprit. Need to reduce reliance on Age.
""")


RECOMMENDATIONS FOR NEXT EXPERIMENTS

1. SVC ALONE: Best individual model (83.6% CV) - simpler may generalize better
2. NO AGE: Remove Age entirely to eliminate distribution shift
3. ROBUST FEATURES: Use only stable features (Sex, Pclass, Title, FamilySize)
4. FARE BINNING: Bin Fare to reduce its distribution shift impact
5. SIMPLER MODEL: Try Logistic Regression with robust features

Key insight: The problem is NOT data leakage - it's distribution shift.
Age is the main culprit. Need to reduce reliance on Age.

