# Loop 1 LB Feedback Analysis

## Submission Results
- CV Score: 0.8339 (83.4%)
- LB Score: 0.7799 (78.0%)
- Gap: +0.054 (5.4 percentage points)

## Key Questions
1. Why is there such a large CV-LB gap?
2. Is there distribution shift between train and test?
3. What can we do to improve LB score?

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

Train shape: (891, 12)
Test shape: (418, 11)


In [2]:
# Compare distributions between train and test
print("=== Distribution Comparison ===")
print("\nPclass distribution:")
print("Train:", train['Pclass'].value_counts(normalize=True).sort_index().to_dict())
print("Test:", test['Pclass'].value_counts(normalize=True).sort_index().to_dict())

print("\nSex distribution:")
print("Train:", train['Sex'].value_counts(normalize=True).to_dict())
print("Test:", test['Sex'].value_counts(normalize=True).to_dict())

print("\nEmbarked distribution:")
print("Train:", train['Embarked'].value_counts(normalize=True).to_dict())
print("Test:", test['Embarked'].value_counts(normalize=True).to_dict())

=== Distribution Comparison ===

Pclass distribution:
Train: {1: 0.24242424242424243, 2: 0.20650953984287318, 3: 0.5510662177328844}
Test: {1: 0.25598086124401914, 2: 0.22248803827751196, 3: 0.5215311004784688}

Sex distribution:
Train: {'male': 0.6475869809203143, 'female': 0.35241301907968575}
Test: {'male': 0.6363636363636364, 'female': 0.36363636363636365}

Embarked distribution:
Train: {'S': 0.7244094488188977, 'C': 0.1889763779527559, 'Q': 0.08661417322834646}
Test: {'S': 0.645933014354067, 'C': 0.24401913875598086, 'Q': 0.11004784688995216}


In [3]:
# Age distribution comparison
print("\nAge statistics:")
print(f"Train: mean={train['Age'].mean():.2f}, median={train['Age'].median():.2f}, missing={train['Age'].isna().sum()}")
print(f"Test: mean={test['Age'].mean():.2f}, median={test['Age'].median():.2f}, missing={test['Age'].isna().sum()}")

print("\nFare statistics:")
print(f"Train: mean={train['Fare'].mean():.2f}, median={train['Fare'].median():.2f}")
print(f"Test: mean={test['Fare'].mean():.2f}, median={test['Fare'].median():.2f}")


Age statistics:
Train: mean=29.70, median=28.00, missing=177
Test: mean=30.27, median=27.00, missing=86

Fare statistics:
Train: mean=32.20, median=14.45
Test: mean=35.63, median=14.45


In [4]:
# Title extraction to compare
def extract_title(df):
    df = df.copy()
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    return df

train_t = extract_title(train)
test_t = extract_title(test)

print("\nTitle distribution:")
print("Train:", train_t['Title'].value_counts(normalize=True).to_dict())
print("Test:", test_t['Title'].value_counts(normalize=True).to_dict())


Title distribution:
Train: {'Mr': 0.5802469135802469, 'Miss': 0.20763187429854096, 'Mrs': 0.1414141414141414, 'Master': 0.04489337822671156, 'Rare': 0.025813692480359147}
Test: {'Mr': 0.5741626794258373, 'Miss': 0.18899521531100477, 'Mrs': 0.1722488038277512, 'Master': 0.050239234449760764, 'Rare': 0.014354066985645933}


In [5]:
# Check if there are any unusual patterns in test data
print("\n=== Potential Distribution Shift Analysis ===")

# Family size distribution
train_t['FamilySize'] = train_t['SibSp'] + train_t['Parch'] + 1
test_t['FamilySize'] = test_t['SibSp'] + test_t['Parch'] + 1

print("\nFamilySize distribution:")
print("Train:", train_t['FamilySize'].value_counts(normalize=True).sort_index().head(8).to_dict())
print("Test:", test_t['FamilySize'].value_counts(normalize=True).sort_index().head(8).to_dict())


=== Potential Distribution Shift Analysis ===

FamilySize distribution:
Train: {1: 0.6026936026936027, 2: 0.18069584736251404, 3: 0.11447811447811448, 4: 0.03254769921436588, 5: 0.016835016835016835, 6: 0.024691358024691357, 7: 0.013468013468013467, 8: 0.006734006734006734}
Test: {1: 0.6052631578947368, 2: 0.17703349282296652, 3: 0.13636363636363635, 4: 0.03349282296650718, 5: 0.01674641148325359, 6: 0.007177033492822967, 7: 0.009569377990430622, 8: 0.004784688995215311}


In [6]:
# Adversarial validation - can we distinguish train from test?
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Prepare features for adversarial validation
def prepare_for_adversarial(train_df, test_df):
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Simple features
    for df in [train_df, test_df]:
        df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
        df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
        df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
        df['Title'] = df['Title'].replace('Mme', 'Mrs')
        df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
        df['Has_Cabin'] = df['Cabin'].notna().astype(int)
        df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
        df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
        df['Title'] = df['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})
    
    # Fill missing
    for df in [train_df, test_df]:
        df['Age'] = df['Age'].fillna(df['Age'].median())
        df['Fare'] = df['Fare'].fillna(df['Fare'].median())
        df['Embarked'] = df['Embarked'].fillna(0)
        df['Title'] = df['Title'].fillna(0)
    
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'Has_Cabin', 'Title']
    
    train_df['is_test'] = 0
    test_df['is_test'] = 1
    
    combined = pd.concat([train_df[features + ['is_test']], test_df[features + ['is_test']]], ignore_index=True)
    
    return combined[features], combined['is_test']

X_adv, y_adv = prepare_for_adversarial(train, test)

# Train adversarial classifier
rf_adv = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=42)
adv_scores = cross_val_score(rf_adv, X_adv, y_adv, cv=5, scoring='roc_auc')

print(f"Adversarial Validation AUC: {adv_scores.mean():.4f} (+/- {adv_scores.std():.4f})")
print("\nInterpretation:")
print("- AUC ~0.50 = No distribution shift (train/test are similar)")
print("- AUC >0.60 = Some distribution shift")
print("- AUC >0.70 = Significant distribution shift")

Adversarial Validation AUC: 0.7025 (+/- 0.0305)

Interpretation:
- AUC ~0.50 = No distribution shift (train/test are similar)
- AUC >0.60 = Some distribution shift
- AUC >0.70 = Significant distribution shift


In [7]:
# Feature importance for adversarial validation
rf_adv.fit(X_adv, y_adv)
importance_adv = pd.DataFrame({
    'feature': X_adv.columns,
    'importance': rf_adv.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeatures that distinguish train from test:")
print(importance_adv.to_string(index=False))


Features that distinguish train from test:
   feature  importance
       Age    0.568987
      Fare    0.143003
  Embarked    0.068058
FamilySize    0.048450
     Parch    0.048302
     SibSp    0.036645
    Pclass    0.032765
     Title    0.029074
 Has_Cabin    0.014782
       Sex    0.009935


## Key Findings

1. **CV-LB Gap of 5.4%** is significant and suggests:
   - Possible overfitting in CV
   - Minor data leakage from using combined train+test for Age imputation
   - The CV score of 83.4% is likely inflated

2. **The target of 1.0 (100%) is impossible** for Titanic:
   - Best known LB scores are 80-82%
   - Our LB score of 77.99% is actually reasonable
   - We should focus on incremental improvements, not perfection

3. **Next Steps:**
   - Fix data leakage: compute imputation stats from train only
   - Try ensemble/stacking approaches (research shows 80.8% achievable)
   - Add more features: Ticket frequency, Deck, Age/Fare bins, interactions
   - Focus on LB score improvement, not CV inflation