# Titanic Survival Prediction

**Goal:** Predict passenger survival on the Titanic using machine learning

**Approach:** Feature engineering with domain knowledge + model comparison (Random Forest vs XGBoost)

**Result:** 82.16% cross-validation accuracy with XGBoost

---

## 1. Data Loading and Exploration

Loading the training data and examining structure, missing values, and categorical variables to inform feature engineering strategy.

In [1]:
# Load libraries and training data
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

df = pd.read_csv('/kaggle/input/titanic/train.csv')

print(df.info())

nulls = df.isnull().sum()
print(nulls[nulls.gt(0)])

cat_nunique = df.nunique()
print(cat_nunique[cat_nunique.lt(10)])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
Age         177
Cabin       687
Embarked      2
dtype: int64
Survived    2
Pclass      3
Sex         2
SibSp       7
Parch       7
Embarked    3
dtype: int64


In [2]:
# assign training variables and generate df statistics
X = df.drop(columns=['Survived'])
y = df['Survived']

train_stats = {
    'median_age': X['Age'].median(),
    'embarked_mode': X['Embarked'].mode()[0],
    'fare_median': X['Fare'].median()
}

## 2. Feature Engineering

Creating a reusable cleaning function that:
- Handles missing values (Age, Cabin, Embarked, Fare)
- Extracts meaningful features (Title, Cabin deck, Family size)
- Creates indicator variables for missing data
- One-hot encodes categorical variables

Key insight: Missing cabin information may indicate lower social status, which correlates with survival rates.

In [3]:
# Feature engineering function - creates new features and handles missing values
# Key features: Title extraction, cabin deck, family size, traveling alone indicator
def df_cleaner(df, train_stats):
    df['Age_known'] = df['Age'].notna() # unknown age may signal lower status
    df['Age'] = df['Age'].fillna(train_stats['median_age'])
    
    df['Cabin_known'] = df['Cabin'].notna() # unknown cabin may signal lower status
    
    df['Cabin_deck'] = df['Cabin'].str[0] # cabin deck may impact ability to reach a life boat
    
    df['Cabin_deck'] = df['Cabin_deck'].fillna('Unknown') # unknown cabin deck may signal lower status
    
    df['Embarked'] = df['Embarked'].fillna(train_stats['embarked_mode'])

    df['Fare'] = df['Fare'].fillna(train_stats['fare_median'])
    
    df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0].str.strip()
    
    title_mapping = {
        'Mr': 'Mr', 
        'Miss': 'Miss', 
        'Mrs': 'Mrs', 
        'Master': 'Master',
        'Dr': 'Rare', 
        'Rev': 'Rare', 
        'Col': 'Rare', 
        'Major': 'Rare',
        'Mlle': 'Miss',
        'Ms': 'Miss',
        'Mme': 'Mrs',
        'Countess': 'Rare', 
        'Lady': 'Rare',
        'Jonkheer': 'Rare', 
        'Don': 'Rare',
        'Sir': 'Rare',
        'Dona': 'Rare',
        'Capt': 'Rare'
    }
    df['Title'] = df['Title'].map(title_mapping) # title signifies status
    
    df = pd.get_dummies(df, columns=['Sex', 'Embarked', 'Title', 'Cabin_deck'], drop_first=True)
    
    df['FamilySize'] = df['Parch'] + df['SibSp'] + 1 # family size may have impacted ability to make it to life boats
    
    df['is_alone'] = df['FamilySize'].eq(1) # traveling alone may have impacted ability to make it to life boats
    
    df_clean = df.drop(columns=['PassengerId','Cabin','Name','Ticket'])
    
    return df_clean


In [4]:
# clean and verify X & y variables

X_clean = df_cleaner(X, train_stats)
y_clean = y

print(f'X_clean shape: {X_clean.shape}')
print(f'X_clean dtypes: \n{X_clean.dtypes.value_counts().to_string()}')
print(f'X_clean null columns: {X_clean.isnull().sum().sum()}')
print()
print(f'y_clean shape: {y_clean.shape}')
print(f'y_clean skew: {y.skew():.4f}')
print(f'y_clean dtype: {y_clean.dtypes}')
print(f'y_clean nulls: {y_clean.isnull().sum()}')
print(X_clean.info())

X_clean shape: (891, 24)
X_clean dtypes: 
bool       18
int64       4
float64     2
X_clean null columns: 0

y_clean shape: (891,)
y_clean skew: 0.4785
y_clean dtype: int64
y_clean nulls: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Pclass              891 non-null    int64  
 1   Age                 891 non-null    float64
 2   SibSp               891 non-null    int64  
 3   Parch               891 non-null    int64  
 4   Fare                891 non-null    float64
 5   Age_known           891 non-null    bool   
 6   Cabin_known         891 non-null    bool   
 7   Sex_male            891 non-null    bool   
 8   Embarked_Q          891 non-null    bool   
 9   Embarked_S          891 non-null    bool   
 10  Title_Miss          891 non-null    bool   
 11  Title_Mr            891 non-null    bool   
 12  Title_Mrs     

## 3. Model Selection

Comparing Random Forest and XGBoost using 5-fold cross-validation. The model with higher mean accuracy and lower standard deviation will be selected for better generalization.

In [5]:
# Compare models using 5-fold cross-validation
# XGBoost chosen if: higher mean accuracy AND lower standard deviation
# This balances performance with consistency
RF_model = RandomForestClassifier(random_state=42)
RF_score = cross_val_score(RF_model, X_clean, y_clean, cv=5, scoring='accuracy')
RF_score_mean = RF_score.mean()
RF_score_std = RF_score.std()

XGB_model = XGBClassifier(random_state=42)
XGB_score = cross_val_score(XGB_model, X_clean, y_clean, cv=5, scoring='accuracy')
XGB_score_mean = XGB_score.mean()
XGB_score_std = XGB_score.std()

print(f'RF_model score: {RF_score_mean:.4f} (+/- {RF_score_std:.4f})')
print(f'XGB_model score: {XGB_score_mean:.4f} (+/- {XGB_score_std:.4f})')
print()

if XGB_score_mean > RF_score_mean and XGB_score_std < RF_score_std:
    print('Proceeding with XGBoost (better accuracy and more consistent)')
elif XGB_score_mean > RF_score_mean:
    print('Proceeding with XGBoost (better accuracy)')
else:
    print('Proceeding with Random Forest')

RF_model score: 0.8036 (+/- 0.0300)
XGB_model score: 0.8216 (+/- 0.0192)

Proceeding with XGBoost (better accuracy and more consistent)


### Model Selection Rationale

XGBoost achieved higher accuracy (82.16% vs 80.36%) with lower variance across folds (±1.92% vs ±3.00%), making it the more reliable choice. The lower variance indicates more consistent performance across different subsets of the data.

## 4. Generate Predictions

Loading the test set, applying the same feature engineering pipeline, and generating predictions using the trained XGBoost model.

In [6]:
# import and clean/EDA test set
test_df = pd.read_csv('/kaggle/input/titanic/test.csv')

test_df_clean = df_cleaner(test_df, train_stats)

print(f'test_df_clean shape: {test_df_clean.shape}')
print(f'test_df_clean dtypes: \n{test_df_clean.dtypes.value_counts().to_string()}')
print(f'test_df_clean nulls: {test_df_clean.isnull().sum().sum()}')

test_df_clean shape: (418, 23)
test_df_clean dtypes: 
bool       17
int64       4
float64     2
test_df_clean nulls: 0


In [7]:
# build and train final model
model_final = XGBClassifier(random_state=42)
model_final.fit(X_clean, y_clean)

# generate final predictions
test_df_clean = test_df_clean.reindex(columns=X_clean.columns, fill_value=0)
predict_final = model_final.predict(test_df_clean)

LukeKubi_XGB_Titanic_Feb26 = pd.DataFrame({
    'PassengerId': test_df['PassengerId'],
    'Survived': predict_final
})

LukeKubi_XGB_Titanic_Feb26.to_csv('LukeKubi_XGB_Titanic_Feb26.csv', index=False)