# Experiment 004: XGBoost on exp_000 Features

## Critical Learning from Previous Experiments
- **CV is NOT a reliable predictor of LB performance**
- exp_000: CV 0.8339 → LB 0.7799 (BEST LB)
- exp_001: CV 0.8271 → LB 0.7727
- exp_002: CV 0.8361 → LB 0.7703 (WORST LB despite BEST CV)

## Strategy
- Use SAME features as exp_000 (which has best LB)
- Try XGBoost instead of RF (research shows XGBoost often achieves top results)
- Keep Age in features (removing Age HURT LB)
- Focus on model improvement, not feature changes

In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Load data
train = pd.read_csv('/home/data/train.csv')
test = pd.read_csv('/home/data/test.csv')

print(f"Train shape: {train.shape}")
print(f"Test shape: {test.shape}")

Train shape: (891, 12)
Test shape: (418, 11)


In [2]:
def engineer_features(df):
    """SAME feature engineering as exp_000 (best LB)"""
    df = df.copy()
    
    # 1. Title extraction
    df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    df['Title'] = df['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    df['Title'] = df['Title'].replace(['Mlle', 'Ms'], 'Miss')
    df['Title'] = df['Title'].replace('Mme', 'Mrs')
    
    # 2. Family features
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    
    # 3. Cabin features
    df['Has_Cabin'] = df['Cabin'].notna().astype(int)
    
    return df

# Apply feature engineering
train = engineer_features(train)
test = engineer_features(test)

print("Titles:", train['Title'].value_counts().to_dict())

Titles: {'Mr': 517, 'Miss': 185, 'Mrs': 126, 'Master': 40, 'Rare': 23}


In [3]:
def fill_missing_values(train_df, test_df):
    """Fill missing values - SAME as exp_000"""
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Combine for Age imputation (same as exp_000)
    combined = pd.concat([train_df, test_df], sort=False)
    title_age_median = combined.groupby('Title')['Age'].median()
    
    for df in [train_df, test_df]:
        for title in df['Title'].unique():
            mask = (df['Title'] == title) & (df['Age'].isna())
            if title in title_age_median:
                df.loc[mask, 'Age'] = title_age_median[title]
            else:
                df.loc[mask, 'Age'] = combined['Age'].median()
    
    # Embarked: Fill with mode
    embarked_mode = train_df['Embarked'].mode()[0]
    train_df['Embarked'] = train_df['Embarked'].fillna(embarked_mode)
    test_df['Embarked'] = test_df['Embarked'].fillna(embarked_mode)
    
    # Fare: Fill with median by Pclass
    for pclass in [1, 2, 3]:
        fare_median = train_df[train_df['Pclass'] == pclass]['Fare'].median()
        train_df.loc[(train_df['Pclass'] == pclass) & (train_df['Fare'].isna()), 'Fare'] = fare_median
        test_df.loc[(test_df['Pclass'] == pclass) & (test_df['Fare'].isna()), 'Fare'] = fare_median
    
    return train_df, test_df

train, test = fill_missing_values(train, test)
print("Missing values filled")

Missing values filled


In [4]:
def prepare_features(train_df, test_df):
    """Prepare features - SAME as exp_000"""
    train_df = train_df.copy()
    test_df = test_df.copy()
    
    # Encode categorical variables
    for col in ['Sex', 'Embarked', 'Title']:
        le = LabelEncoder()
        combined = pd.concat([train_df[col], test_df[col]])
        le.fit(combined)
        train_df[col] = le.transform(train_df[col])
        test_df[col] = le.transform(test_df[col])
    
    # SAME features as exp_000 (including Age!)
    features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked',
                'Title', 'FamilySize', 'IsAlone', 'Has_Cabin']
    
    X_train = train_df[features]
    y_train = train_df['Survived']
    X_test = test_df[features]
    
    return X_train, y_train, X_test, features

X_train, y_train, X_test, features = prepare_features(train, test)

print(f"Features ({len(features)}): {features}")
print(f"X_train shape: {X_train.shape}")

Features (11): ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Title', 'FamilySize', 'IsAlone', 'Has_Cabin']
X_train shape: (891, 11)


In [5]:
# XGBoost with conservative hyperparameters (avoid overfitting)
xgb = XGBClassifier(
    n_estimators=200,
    max_depth=4,  # Conservative depth
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    random_state=42,
    use_label_encoder=False,
    eval_metric='logloss'
)

# Stratified K-Fold cross-validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(xgb, X_train, y_train, cv=skf, scoring='accuracy')

print(f"CV Scores: {cv_scores}")
print(f"Mean CV Accuracy: {cv_scores.mean():.5f} (+/- {cv_scores.std():.5f})")
print(f"\nComparison (REMEMBER: CV is NOT reliable for LB):")
print(f"  exp_000 (RF with Age):    CV 0.8339 → LB 0.7799 (BEST LB)")
print(f"  exp_001 (Stacking):       CV 0.8271 → LB 0.7727")
print(f"  exp_002 (RF no Age):      CV 0.8361 → LB 0.7703 (WORST LB)")
print(f"  exp_003 (XGBoost):        CV {cv_scores.mean():.4f}")

CV Scores: [0.8603352  0.86516854 0.79213483 0.83146067 0.83146067]
Mean CV Accuracy: 0.83611 (+/- 0.02611)

Comparison (REMEMBER: CV is NOT reliable for LB):
  exp_000 (RF with Age):    CV 0.8339 → LB 0.7799 (BEST LB)
  exp_001 (Stacking):       CV 0.8271 → LB 0.7727
  exp_002 (RF no Age):      CV 0.8361 → LB 0.7703 (WORST LB)
  exp_003 (XGBoost):        CV 0.8361


In [6]:
# Train on full data
xgb.fit(X_train, y_train)

# Feature importance
importance_df = pd.DataFrame({
    'feature': features,
    'importance': xgb.feature_importances_
}).sort_values('importance', ascending=False)

print("XGBoost Feature Importance:")
print(importance_df.to_string(index=False))

XGBoost Feature Importance:
   feature  importance
       Sex    0.388776
     Title    0.130855
    Pclass    0.107666
 Has_Cabin    0.092833
FamilySize    0.058316
     SibSp    0.051812
      Fare    0.038219
   IsAlone    0.037231
       Age    0.033439
  Embarked    0.032829
     Parch    0.028024


In [7]:
# Make predictions
test_predictions = xgb.predict(X_test)

# Create submission
test_original = pd.read_csv('/home/data/test.csv')
submission = pd.DataFrame({
    'PassengerId': test_original['PassengerId'],
    'Survived': test_predictions
})

submission.to_csv('/home/submission/submission.csv', index=False)
print(f"Submission saved: {len(submission)} predictions")
print(f"Prediction distribution: {pd.Series(test_predictions).value_counts().to_dict()}")

Submission saved: 418 predictions
Prediction distribution: {0: 273, 1: 145}


In [8]:
# Summary
print("="*60)
print("EXPERIMENT 004 SUMMARY: XGBoost on exp_000 Features")
print("="*60)
print(f"Model: XGBoost (n_estimators=200, max_depth=4, lr=0.05)")
print(f"Features: SAME as exp_000 (11 features, including Age)")
print(f"\nCV Accuracy: {cv_scores.mean():.5f} (+/- {cv_scores.std():.5f})")
print(f"\nKEY INSIGHT: CV is NOT reliable for LB prediction!")
print(f"  - exp_000 had lower CV but BEST LB")
print(f"  - exp_002 had highest CV but WORST LB")
print(f"\nThis experiment tests if XGBoost can improve on RF")
print(f"with the same features that gave best LB (exp_000).")
print("="*60)

EXPERIMENT 004 SUMMARY: XGBoost on exp_000 Features
Model: XGBoost (n_estimators=200, max_depth=4, lr=0.05)
Features: SAME as exp_000 (11 features, including Age)

CV Accuracy: 0.83611 (+/- 0.02611)

KEY INSIGHT: CV is NOT reliable for LB prediction!
  - exp_000 had lower CV but BEST LB
  - exp_002 had highest CV but WORST LB

This experiment tests if XGBoost can improve on RF
with the same features that gave best LB (exp_000).
