# Titanic Survival Prediction

This notebook analyzes the Titanic dataset to understand which passenger attributes influenced survival, then builds and compares machine learning models to predict survival outcomes.

**Approach:**
1. Exploratory Data Analysis — understand distributions and survival patterns  
2. Feature Engineering — extract meaningful signals from raw attributes  
3. Preprocessing Pipeline — imputation, scaling, and encoding  
4. Model Comparison — evaluate three models using 5-fold cross-validation  
5. Best Model Evaluation — detailed metrics and feature importance

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## 1. Data Loading

In [None]:
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
print(f'Training set: {train.shape[0]} rows, {train.shape[1]} columns')
print(f'Test set:     {test.shape[0]} rows,  {test.shape[1]} columns')

In [None]:
train.head()

## 2. Exploratory Data Analysis

Before modelling, we examine the data to understand distributions, missingness, and early signals of which features correlate with survival.

In [None]:
print('Missing values:')
print(train.isnull().sum()[train.isnull().sum() > 0])

`Age` is missing for ~20% of passengers, `Cabin` for ~77%, and `Embarked` for just 2 rows. We will handle these in the preprocessing pipeline.

In [None]:
print(f"Overall survival rate: {train['Survived'].mean():.1%}")
print()
print('Survival rate by Sex:')
print(train.groupby('Sex')['Survived'].mean().round(3))
print()
print('Survival rate by Pclass:')
print(train.groupby('Pclass')['Survived'].mean().round(3))

In [None]:
os.makedirs('images', exist_ok=True)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Survival by Sex
survival_by_sex = train.groupby('Sex')['Survived'].mean()
axes[0].bar(['Female', 'Male'], survival_by_sex.values, color=['#2ecc71', '#e74c3c'], edgecolor='black')
axes[0].set_title('Survival Rate by Sex', fontweight='bold')
axes[0].set_ylabel('Survival Rate')
axes[0].set_ylim(0, 1)
for i, v in enumerate(survival_by_sex.values):
    axes[0].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')

# Survival by Pclass
survival_by_class = train.groupby('Pclass')['Survived'].mean()
axes[1].bar(['1st', '2nd', '3rd'], survival_by_class.values, color=['#3498db', '#e67e22', '#95a5a6'], edgecolor='black')
axes[1].set_title('Survival Rate by Passenger Class', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_ylim(0, 1)
for i, v in enumerate(survival_by_class.values):
    axes[1].text(i, v + 0.02, f'{v:.1%}', ha='center', fontweight='bold')

# Age distribution by survival
train[train['Survived'] == 0]['Age'].dropna().hist(
    bins=30, alpha=0.6, color='#e74c3c', label='Did not survive', ax=axes[2], edgecolor='black')
train[train['Survived'] == 1]['Age'].dropna().hist(
    bins=30, alpha=0.6, color='#2ecc71', label='Survived', ax=axes[2], edgecolor='black')
axes[2].set_title('Age Distribution by Survival', fontweight='bold')
axes[2].set_xlabel('Age')
axes[2].set_ylabel('Count')
axes[2].legend()

plt.tight_layout()
plt.savefig('images/eda.png', dpi=150)
plt.show()

## 3. Feature Engineering

Raw columns like `Name` contain hidden signals. We extract the passenger's **title** (Mr, Mrs, Miss, Master, Rare), which reflects social status and age group and often correlates strongly with survival.

We also create:
- **FamilySize** = SibSp + Parch + 1 — captures whether group size affected survival
- **IsAlone** — binary flag for passengers travelling solo

Irrelevant columns (`PassengerId`, `Name`, `Ticket`, `Cabin`) are dropped — they either have too many unique values to generalise from or are missing for the majority of passengers.

In [None]:
def engineer_features(df):
    df = df.copy()

    # Extract title from Name
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
    rare_titles = ['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr',
                   'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
    df['Title'] = df['Title'].replace(rare_titles, 'Rare')
    df['Title'] = df['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

    # Family features
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone']    = (df['FamilySize'] == 1).astype(int)

    # Drop columns with no useful signal for the model
    df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

    return df

train_fe = engineer_features(train)
train_fe.head()

## 4. Preprocessing Pipeline

We use scikit-learn's `Pipeline` and `ColumnTransformer` to keep preprocessing clean and prevent data leakage — all transformations are fit only on training data and applied to the test set.

- **Numeric features**: median imputation → standard scaling
- **Categorical features**: most-frequent imputation → one-hot encoding

In [None]:
numeric_features     = ['Age', 'Fare', 'SibSp', 'Parch', 'Pclass', 'FamilySize', 'IsAlone']
categorical_features = ['Sex', 'Embarked', 'Title']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer,     numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

X = train_fe.drop('Survived', axis=1)
y = train_fe['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## 5. Model Comparison

We compare three models using **5-fold stratified cross-validation** on the full training set. Stratified folds preserve the class ratio in each fold, giving a more reliable estimate of generalisation performance than a single train/test split.

In [None]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=200, random_state=42),
    'Gradient Boosting':   GradientBoostingClassifier(n_estimators=200, random_state=42)
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_results = {}

for name, model in models.items():
    clf = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
    scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy')
    cv_results[name] = scores
    print(f'{name:22s}  mean={scores.mean():.4f}  std={scores.std():.4f}')

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
ax.boxplot(
    cv_results.values(),
    labels=cv_results.keys(),
    patch_artist=True,
    boxprops=dict(facecolor='#3498db', alpha=0.7),
    medianprops=dict(color='black', linewidth=2)
)
ax.set_title('5-Fold CV Accuracy by Model', fontsize=14, fontweight='bold')
ax.set_ylabel('Accuracy')
ax.set_ylim(0.7, 0.9)
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('images/model_comparison.png', dpi=150)
plt.show()

## 6. Best Model Evaluation

Gradient Boosting consistently achieves the highest cross-validation score. We now train it on the full training split and evaluate on the held-out test set for a final, unbiased performance estimate.

In [None]:
best_clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, random_state=42))
])

best_clf.fit(X_train, y_train)
preds = best_clf.predict(X_test)

print(f'Test Accuracy: {accuracy_score(y_test, preds):.4f}')
print()
print(classification_report(y_test, preds, target_names=['Did Not Survive', 'Survived']))

In [None]:
cm = confusion_matrix(y_test, preds)
fig, ax = plt.subplots(figsize=(5, 4))
sns.heatmap(
    cm, annot=True, fmt='d', cmap='Blues',
    xticklabels=['Did Not Survive', 'Survived'],
    yticklabels=['Did Not Survive', 'Survived'],
    ax=ax, linewidths=0.5
)
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')
ax.set_ylabel('Actual')
ax.set_xlabel('Predicted')
plt.tight_layout()
plt.savefig('images/confusion_matrix.png', dpi=150)
plt.show()

## 7. Feature Importance

Gradient Boosting assigns importance scores to each feature based on how much it reduces prediction error across all trees. Higher scores indicate features the model relied on most.

In [None]:
cat_feature_names = (
    best_clf.named_steps['preprocessor']
    .transformers_[1][1]
    .named_steps['encoder']
    .get_feature_names_out(categorical_features)
    .tolist()
)
all_feature_names = numeric_features + cat_feature_names
importances = best_clf.named_steps['model'].feature_importances_
indices = np.argsort(importances)

fig, ax = plt.subplots(figsize=(7, 6))
ax.barh(range(len(indices)), importances[indices], color='#3498db', edgecolor='black')
ax.set_yticks(range(len(indices)))
ax.set_yticklabels([all_feature_names[i] for i in indices])
ax.set_title('Feature Importance (Gradient Boosting)', fontsize=14, fontweight='bold')
ax.set_xlabel('Importance')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('images/feature_importance.png', dpi=150)
plt.show()

## Key Findings

- **Sex** was the single strongest predictor — women survived at 74% vs. men at 19%
- **Title** (extracted from passenger names) captured social status and age group, ranking among the top engineered features
- **Passenger class** had a strong effect — 1st class passengers survived at ~63% vs. 24% in 3rd class
- **Fare** and **Age** contributed moderate predictive signal
- **Gradient Boosting** outperformed Logistic Regression and Random Forest in cross-validation, achieving ~83% accuracy on the held-out test set