# Predicting Titanic Survival Using Machine Learning
**Author:** Justin Zhao

```
"The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others."
```
This is a Machine Learning challenge/competition from Kaggle: https://www.kaggle.com/competitions/titanic/overview

## Project Overview
This notebook demonstrates the end-to-end machine learning pipeline used to achieve a top 5% score on the Titanic Kaggle leaderboard.

**Methodology:**
1.  **Preprocessing:** Custom feature engineering (Title extraction, Family size binning) and robust imputation.
2.  **Model Selection:** GridSearch tuning of three distinct classifiers:
    * **XGBoost** (Gradient Boosting)
    * **RandomForest** (Bagging)
    * **Logistic Regression** (Linear baseline)
3.  **Ensemble:** A `VotingClassifier` to combine predictions and reduce variance.

## Problem Definition

> Given some information (name, age, gender, socio-economic class) of an individual aboard the RMS Titanic, can we predict whether or not they survived the shipwreck?

## Data

The data for this analysis was obtained from the Kaggle Titanic competition: 

https://www.kaggle.com/competitions/titanic/overview

This dataset is widely believed to be a subset of the "Titanic Passenger Survival Data" originally compiled by Thomas Cason and hosted by the Department of Biostatistics at Vanderbilt University.

It is important to note that while the data is based on the historical record from Encyclopedia Titanica, it is not an exhaustive list of everyone on board. Specifically, this dataset contains 1,309 records (split into train/test sets) and focuses exclusively on passengers, notably excluding the ~900 crew members who were also aboard the ship.

## Features

| Variable | Role | Type | Description | Units / Key | Missing Values |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **survival** | Target | Integer | Survival status | 0 = No, 1 = Yes | No |
| **pclass** | Feature | Integer | Ticket class (Proxy for SES) | 1 = 1st, 2 = 2nd, 3 = 3rd | No |
| **sex** | Feature | Object | Sex | male, female | No |
| **Age** | Feature | Float | Age in years | Years | **Yes** |
| **sibsp** | Feature | Integer | # of siblings / spouses aboard | Count | No |
| **parch** | Feature | Integer | # of parents / children aboard | Count | No |
| **ticket** | Feature | Object | Ticket number | Alphanumeric | No |
| **fare** | Feature | Float | Passenger fare | Currency | No |
| **cabin** | Feature | Object | Cabin number | Alphanumeric | **Yes** |
| **embarked** | Feature | Object | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton | **Yes** |

* **survival**: This is the target variable that indicates whether the passenger survived (1) or died (0).
* **pclass**: This represents the ticket class (1st, 2nd, or 3rd) and acts as a proxy for socio-economic status.
* **sex**: This records the gender of the passenger.
* **Age**: This is the passenger's age in years, which may be fractional for infants.
* **sibsp**: This counts the total number of siblings and spouses traveling with the passenger.
* **parch**: This counts the total number of parents and children traveling with the passenger.
* **ticket**: This is the unique alphanumeric ticket number printed on the passenger's ticket.
* **fare**: This records the monetary price the passenger paid for their journey.
* **cabin**: This lists the specific cabin number assigned to the passenger, though many values are missing.
* **embarked**: This indicates the port where the passenger boarded the ship (Cherbourg, Queenstown, or Southampton).

In [1]:
# IMPORTS
# regular EDA (exploratory data analysis) and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# data manipulation pipeline (imputing, encoding)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import KBinsDiscretizer

# models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

# XGBoost
from xgboost import XGBClassifier

# model evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import ConfusionMatrixDisplay

In [2]:
# LOAD DATA
df = pd.read_csv("data/titanic_train.csv")
df_test = pd.read_csv("data/titanic_test_no_labels.csv")

### 1. Preprocessing & Feature Engineering

Effective preprocessing was the primary driver of model performance in this project. Rather than throwing all features into the model at once, I adopted an iterative approach: establishing a baseline with simple imputation, and then sequentially adding engineered features to validate their impact on an **XGBoost** model.

### 1.1 Feature Evolution & EDA Findings
Below is a summary of the iterative improvements observed during the Exploratory Data Analysis phase.

#### **Phase 1: Baseline**
* **Strategy:** Basic categorical encoding of features `Sex`. Imputation of `Age` and `Fare` if necessary (not necessary for decision trees)
* **Logic:** Establish a control score using only raw numerical data and basic categorical encoding.
* **Cross-Validation Score:** `~82.3%`
* **AUC:** `~0.84`

#### **Phase 2: Title Extraction**
* **Strategy:** Extracted prefixes from the `Name` column (e.g., Mr., Mrs., Master., Dr.). grouped rare titles (Don, Rev, Major) into a single `Rare` category. Clean up noisy/unecessary features such as `Ticket`.
* **Logic:** Titles serve as a proxy for social status, gender, and age grouping simultaneously. 'Master' specifically helps identify male children who (unlike adult males) had high survival rates.
* **Cross-Validation Score:** `~82.8%`
* **AUC:** `~0.86`

#### **Phase 3: Family Survival Rate**
* **Strategy:** Calculated family sizes based on `Parch` and `Sibsp`; Calculated the survival rate of groups sharing the same ***Surname*** and `Ticket` price.
* **Logic:** Survival was rarely independent; if one family member survived, others were significantly more likely to survive (and vice versa) due to rescue boats keeping groups together.
* **Cross-Validation Score:** `~86.7%` *(+3.9% lift)*
* **AUC:** `~0.92` *(+0.06 lift)*

---

### 1.2 Final Preprocessing Pipeline
Based on these findings, the final model incorporates all successful features. It uses a custom transformer pipeline to handle all data types and feature engineering.

In [3]:
from src.preprocessing import create_family_survival_feature
X_train_full, X_test_full = create_family_survival_feature(df, df_test)

In [4]:
np.random.seed(12)

X = X_train_full.drop("Survived", axis=1)
y = X_train_full["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [5]:
from src.preprocessing import feat_eng_transformer

feat_eng_functionTransformer = feat_eng_transformer()

from src.preprocessing import title_transformer
from src.preprocessing import sex_transformer
from src.preprocessing import cabin_transformer

clean_features = ["Pclass", "Fare", "Age", "FamilySize", "Family_Survival"]

final_preprocessor = ColumnTransformer(
    transformers=[
        ("sex", sex_transformer(), ["Sex"]),
        ("cabin", cabin_transformer(), ["Cabin"]),
        ("title", title_transformer(), ["Title"]),
        ("clean_cols", "passthrough", clean_features)
    ], 
remainder="drop")

In [7]:
from src.preprocessing import age_binning_transformer
from src.preprocessing import fare_bin_transformer
from src.preprocessing import family_size_transformer
from src.preprocessing import cat_transformer

logreg_preprocessor = ColumnTransformer(
    transformers=[
        ("sex", sex_transformer(), ["Sex"]),
        ("cabin", cabin_transformer(), ["Cabin"]),
        ("title", title_transformer(), ["Title"]),
        ("age", age_binning_transformer(), ["Age"]),
        ("fams_size", family_size_transformer(), ["FamilySize"]),
        ("fare", fare_bin_transformer(), ["Fare"]),
        ("cat", cat_transformer(), ["Pclass","Family_Survival"])
    ], 
remainder="drop")

## 2. Model Training & Hyperparameter Tuning

With the feature engineering finalized, we move to model selection and tuning. To ensure a robust final ensemble, I selected three distinct algorithms that learn from the data in fundamentally different ways: **Gradient Boosting (XGBoost)**, **Bagging (Random Forest)**, and a **Linear Baseline (Logistic Regression)**.

### 2.1 Strategy: GridSearch & Cross-Validation
I utilized `GridSearchCV` for hyperparameter tuning on each model. I trained and cross-validated each model to find the optimal balance between bias and variance.

#### **A. Tree-Based Models (XGBoost & Random Forest)**
* **Role:** These models are excellent at capturing non-linear relationships and interactions (e.g., *Age* vs *Class*) without explicit feature engineering.
* **Preprocessing:** Tree models are generally robust to unscaled data and can handle raw feature distributions well.

#### **B. The Linear Baseline (Logistic Regression)**
* **Role:** Provides a probabilistic baseline. If a complex tree model can't beat this, the signal is likely linear.
* **Pipeline Adaptation:** Unlike trees, Logistic Regression **required a distinct pipeline**.
    * **Binning is Critical:** Features need to be either binned or scaled, features like *Age*, *Fare* need to be binned and imputed for LogReg. 

---

### 2.2 Hyperparameter Tuning Implementation
The code below defines the specific parameter grids for each model.

In [None]:
param_grid_xgb = [
    {
        'model': [XGBClassifier()],
        'model__n_estimators': [800, 1000, 1200],
        'model__learning_rate': [0.005, 0.007],
        'model__max_depth': [3, 4],
        'model__subsample': [0.5],
        'model__colsample_bytree': [0.5],
        'model__gamma': [0.1, 0.2],
        'model__min_child_weight': [4, 5],
        'model__reg_alpha' : [1],
        'model__reg_lambda' : [2]
    }
]
param_grid_rf = [
    {
        'model': [RandomForestClassifier()],
        'model__n_estimators': [ 500, 800, 1000, 1200],
        'model__max_depth': [2, 3,4],
        'model__min_samples_split': [ 2,3,4, 5],
        'model__min_samples_leaf': [3, 4, 6],
        'model__bootstrap': [True]
    }
]
param_grid_lr = [
    {
        'model': [LogisticRegression()],
        'model__C': [ 0.002, 0.1, 1],
        'model__solver': ["lbfgs", "liblinear"]
    }
]