# Titanic Survival Prediction: 02 - Modeling
*Date: 14.09.2025*
*Author: Jonas Lilletvedt*

--- 

## 1. Introduction

### 1.1. Objective

This notebook marks the second part of the Titanic Survival Prediction project: model training and evaluation. We will build directly on the pre-processing pipeline constructed in `01_data_cleaning_and_feature_engineering.ipynb`. The primary objective is to systematically train, evaluate, and tune a selection of models to find the optimal balance between predictive performance and interpretability. 

1. **Preparation and Pipeline Reconstruction:** We will begin by loading datasets and rebuilding the pre-processing pipeline from the previous notebook to ensure a consistent and reproducible environment.
2. **Baseline Model Evaluation:** A selection of diverse classification models (e.g., K-NN, Random Forest, Gradient Boosting) will be used as baseline and later help us identify strategies for optimization. 
3. **Hyperparameter Tuning:** The best-performing model from the baseline evaluation will undergo systematic hyperparameter tuning using GridSearchCV. The goal is to maximize its predictive performance based on our primary evaluation metric, the F1-Score. The reasoning for selecting this metric is detailed in the section below.
4. **Final Analysis and Iteration:** Finally, we will analyze the results and train the optimized model on the full dataset to generate final predictions for a Kaggle submission. We will conclude by exploring potential avenues for future improvements, such as alternative modeling techniques or further feature engineering.

Our end goal is twofold: to develop a model that is competitive on the Kaggle leaderboard, while remaining understandable enough to provide insights into the factors that influence survival. The process will be iterative, allowing us to refine our approach based on model performance. 

Beyond the Leaderboard: A Focus on Interpretability
Unlike many competition-driven projects that prioritize raw predictive power above all else, this analysis places a strong emphasis on interpretability. While achieving a high Kaggle score is a key objective, it is equally important to build a model whose decision-making process can be understood. By favoring models and features that provide clear insights—for instance, by showing why certain passengers were more likely to survive—we aim to produce not just a "black box" predictor, but a meaningful analysis of a historical event. This balanced approach ensures that our final result is both powerful and insightful.

### 1.2 Defining Success: Beyond Simple Accuracy

While accuracy is a straightforward metric, it can be misleading. A simple "naive" model that predicts survival for all females and death for all males achieves an accuracy of approximately 78.7% on the training data. Any model we build must therefore significantly outperform this baseline to be considered valuable.

To truly understand our model's performance, we must select the right evaluation metrics for the task. The most useful evaluation metric depends on the specific problem you are trying to solve. For example, in medical screening for a virus, achieving a high Recall is more important than high Precision. We would want to identify as many infected people as possible, even if it means accepting some false positives. In other situations, however, overall Accuracy might be the primary goal.
Applying this to the Titanic problem, our context is one of historical analysis and balanced prediction. There is no real-world cost that makes predicting a death incorrectly worse than predicting a survival incorrectly. Our goal is not just to be right, but to build a model that is equally effective at identifying both those who survived and those who did not.
For this reason, we need metrics that reward this balance and are not skewed by the class distribution. While we will still consider overall Accuracy, our primary metric for model evaluation is:
*   F1-Score: The harmonic mean of precision and recall. Provides an overall score between precision and recall. 

## 2. Data Loading and Setup

---

### 2.1. Library Imports

In [127]:
# Import necessary libraries

# Data manipulation and analysis
import numpy as np
import pandas as pd


### 

### 2.2. Load Datasets

In [128]:
# Load datasets
df_train = pd.read_csv('../data/01_raw/train.csv')
df_test = pd.read_csv('../data/01_raw/test.csv')

### 2.3. Initial Inspection

A quick inspection to check the datasets are loaded properly and as expected.

**Datasets Shape:**

In [129]:
# Check shape of each dataset
print(f'Training data shape: {df_train.shape}')
print(f'Test data shape: {df_test.shape}')

Training data shape: (891, 12)
Test data shape: (418, 11)


**Data Preview:**

In [130]:
# Check five first rows in df_train
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [131]:
# Check five first rows in df_test
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### 2.4. Prepare Data for Modeling

The datasets are loaded as expected. We will now separate our training data into two distinct objects:
*   **X:** A DataFrame containing all predictor variables.
*   **y:** A Series containing the target variable `Survived`, which we aim to predict.

In [132]:
# Seperate predictors from target
X_train = df_train.drop('Survived', axis=1)
y_train = df_train['Survived']

To avoid making changes on the original test data we will copy it to a separate object.

In [133]:
# Copy to avoid making changes to the original dataset
X_test = df_test.copy()

Display the shapes of our new variables to ensure the separation and copy worked as intended.

In [134]:
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')

Shape of X_train: (891, 11)
Shape of y_train: (891,)
Shape of X_test: (418, 11)


### 2.5. Reconstruct the Pre-processing Pipeline

To make this notebook self-contained and reproducible, we will now redefine the custom transformers and the full `grand_pipeline` that were built and validated in the previous notebook. All the data transformation logic is encapsulated within a single code cell below. 

**A Note on Modularity vs. Narrative Flow**

In a production-level software project, this logic would typically be defined in a separate python-file to be imported. However, for this project, which is designed to be a linear, narrative-driven analysis, I have chosen to explicitly include the code here. This approach ensures that the notebook tells the complete, end-to-end story. Any future changes or improvements made in later stages of the project does not affect the logic or result of previous notebooks, preserving the integrity of each step of of our analysis. Furthermore, it allows any modifications or improvements to be documented and explained at the precise moment they are introduced, for a more natural progression.

All data transformation logic is therefore encapsulated within the single code cell below, which can be collapsed for easier reading of the modeling work that follows.

In [138]:
# Scikit-learn tools for preprocessing and modeling
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin

class TitleExtractor(BaseEstimator,  TransformerMixin):
    def __init__(self, rare_threshold=10):
        self.rare_threshold = rare_threshold

        # Titles with same meaning
        self.title_synonym_mapping_ = {
            'Mlle.': 'Miss.',
            'Ms.': 'Miss.',
            'Mme.': 'Mrs.'
        }

    def fit(self, X, y=None):
        titles = X['Name'].str.extract(pat=' ([A-Za-z]+\.)', expand=False)
        titles = titles.replace(self.title_synonym_mapping_)
        self.non_rare_titles_ = titles.value_counts()[lambda x: x >= self.rare_threshold].index
        return self
    
    def transform(self, X, y=None):
        # Copy to avoid modifying original
        X_copy = X.copy()
        # Extract title 
        X_copy['Title_feat'] = X_copy['Name'].str.extract(pat=' ([A-Za-z]+\.)', expand=False)
        
        # Swap titles with 'Rare' and synonym
        X_copy['Title_feat'] = X_copy['Title_feat'].replace(self.title_synonym_mapping_)


        X_copy['Title_feat'] = X_copy['Title_feat'].apply(lambda x: x if x in self.non_rare_titles_ else 'Rare')

        return X_copy

class FamilySurvialRateExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, drop_surname=True, smooth_factor=1):
        self.drop_surname = drop_surname
        self.smooth_factor = smooth_factor

    def fit(self, X, y):
        X_temp = pd.concat([X, y], axis=1)

        X_temp['FamilyID_temp'] = self.__get_family_id(X)


        self.family_stats_ = X_temp.groupby('FamilyID_temp').agg(
            FamilySize_temp = ('Survived', 'size'), 
            FamilySurvivalCount_temp = ('Survived', 'sum')
        )
    
        self.global_survival_rate_ = y.mean()
        self.training_index_ = X.index
        self.y_train_ = y

        self.alone_survival_rate_ = self.family_stats_[self.family_stats_['FamilySize_temp'] == 1]['FamilySurvivalCount_temp'].mean()

        return self

    def transform(self, X, y=None):
        X_copy = X.copy()

        X_copy['FamilyID_temp'] = self.__get_family_id(X)

        X_copy = X_copy.merge(self.family_stats_, on='FamilyID_temp', how='left')

        # Different calculation for passengers from training set -- avoid data leakage
        if X.index.equals(self.training_index_):
            X_copy['FamilySurvivalCount_temp'] -= self.y_train_
            X_copy['FamilySize_temp'] -= 1

        numerator = X_copy['FamilySurvivalCount_temp'] 
        denominator = X_copy['FamilySize_temp']
        
        # Apply smoothing
        smoothed_numerator = numerator + (self.smooth_factor * self.global_survival_rate_)
        smoothed_denominator = denominator + self.smooth_factor

        # Calculate FamilySurvivalRate
        X_copy['FamilySurvivalRate_feat'] = smoothed_numerator / smoothed_denominator

        # For passengers without a family from training
        X_copy['FamilySurvivalRate_feat'] = X_copy['FamilySurvivalRate_feat'].fillna(self.global_survival_rate_)

        # For passengers which travel alone we will overwrite global_survival_rate_
        is_alone_mask = (X_copy.groupby('FamilyID_temp')['FamilyID_temp'].transform('count') == 1) & (X_copy['FamilySize_temp'] == 0 | X_copy['FamilySize_temp'].isna())

        X_copy.loc[is_alone_mask, 'FamilySurvivalRate_feat'] = self.alone_survival_rate_

        # Columns to drop
        columns_to_drop = ['FamilySurvivalCount_temp', 'FamilySize_temp', 'FamilyID_temp']
        
        # Clean df
        X_copy = X_copy.drop(columns_to_drop, axis=1)

        return X_copy

    def __get_surname(self, X):
        # Extract surname
        data = X.copy()
        return data['Name'].str.extract(pat=r'^(.+)?,', expand=False)
    
    def __get_family_id(self, X):
        return self.__get_surname(X) + '_' + X['Pclass'].astype(str)

class AgeImputer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        X_temp = X.copy()
        self.median_by_title_ = X_temp.groupby('Title_feat')['Age'].median()
        return self
    
    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy['Age'] = X_copy['Age'].fillna(X_copy['Title_feat'].map(self.median_by_title_))
        return X_copy

class CabinLocationExtractor:
    def __init__(self, drop_original=True):
        self.drop_original = True
    
    def fit(self, X, y):
        pass

    def transform(self, X, y=None):
        X_copy = X.copy()
        X_copy['Deck_feat'] = X_copy['Cabin'].str.extract(pat=r'^([A-Za-z])?', expand=False).fillna('U')
        X_copy['Zone_feat'] = pd.to_numeric(
            X_copy['Cabin'].str.extract(pat=r'([0-9]+)', expand=False),
            errors='coerce'
            )
        return X_copy
    
feature_engineering_pipeline = Pipeline(steps=[
    ('title_extractor', TitleExtractor()),
    ('age_imputer', AgeImputer()),
    ('cabin_location_extractor', CabinLocationExtractor())
])

FARE_TO_LOG_TRANS = ['Fare']
CAT_FEATURES = ['Sex',
                        'Embarked',
                        'Pclass',
                        'Title_feat',
                        'Deck_feat'
                        ]
AGE_TO_BIN = ['Age']
ZONE_TO_BIN = ['Zone_feat']


fare_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log_transform', FunctionTransformer(np.log1p)),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Simple function for pd.cut
def apply_pd_cut(X, bins, labels):
    series = pd.Series(X[:, 0])
    binned_series = pd.cut(series, bins=bins, labels=labels, right=True, include_lowest=True)
    return binned_series.to_numpy().reshape(-1, 1)

# Bins for Age
# Infant: 0-5, Child: 6-12, young-adult: 13-25, adult: 26-50, elder: 51->
AGE_BINS = [0, 5, 12, 25, 50, np.inf]
AGE_LABELS = ['Infant', 'Child', 'Young-Adult', 'Adult', 'Senior']

age_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('binner', FunctionTransformer(apply_pd_cut, kw_args={'bins': AGE_BINS, 'labels': AGE_LABELS})),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

def apply_bin_zone(X, n_bins):
    series = pd.Series(X[:, 0])
    labels  = [f'Q{i}' for i in range(1, n_bins + 1)]
    binned_series = pd.qcut(series, q=n_bins, labels=labels, duplicates='drop')

    # Convert nans to unknown
    binned_series = binned_series.cat.add_categories(['Unknown']).fillna('Unknown')
    return binned_series.to_numpy().reshape(-1, 1)

zone_pipeline = Pipeline(steps=[
    ('bin_with_missing_values', FunctionTransformer(apply_bin_zone, kw_args={'n_bins': 8})),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('fare_log', fare_pipeline, FARE_TO_LOG_TRANS),
        ('cat', categorical_pipeline, CAT_FEATURES),
        ('age_binned', age_pipeline, AGE_TO_BIN),
        ('zone_binned', zone_pipeline, ZONE_TO_BIN)
    ],
    remainder='drop'
)

grand_pipeline = Pipeline(steps=[
    ('feature_engineering', feature_engineering_pipeline),
    ('preprocessing', preprocessor)
])  


## 3. Baseline Model Evaluation

---
With our pre-processing pipeline fully constructed, we can now proceed to the modeling phase. The first step is to establish a performance baseline by evaluating several different classification algorithms. This process will help us identify the most promising model architecture before investing time in hyperparameter tuning.

As established in the introduction, our primary metric for model selection will be the F1-Score, as it provides a balanced measure of a model's performance. For a more complete picture, we will also evaluate Precision, Recall, and overall Accuracy.

To obtain a reliable estimate of each model's generalization performance, we will use 5-fold cross-validation. This method helps mitigate the risk of overfitting to a single train-test split and provides a more robust foundation for our subsequent decisions.

In [137]:
# Import SKlearn models
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Set seed for reproducible results
seed = 43

# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=seed, max_iter=1000),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(random_state=seed),
    'Gradient Boostind': GradientBoostingClassifier(random_state=seed), 
    'Support Vector Machine': SVC(random_state=seed, probability=True)
}

# Dict for results
results = {}

print('Starting baseline-evaluation of models...')

# Iterate over all models
for model_name, model in models.items():

    # Train model using the pre-constructed pipeline
    full_pipeline = Pipeline(steps=[
        ('preprocessor', grand_pipeline),
        ('classifier', model)
    ])

    # Evaluate f1-score
    f1_scores = cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='f1', n_jobs=-1)



Starting baseline-evaluation of models...


ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 588, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 718, in fit_transform
    Xt = self._fit(X, y, routed_params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 588, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 1551, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jonaslilletvedt/miniconda3/envs/titanic-survival-prediction/lib/python3.11/site-packages/sklearn/pipeline.py", line 734, in fit_transform
    return last_step.fit(Xt, y, **last_step_params["fit"]).transform(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'transform'


For our baseline we will use four different classification models. 
1. **Logistic Regression**
2. **K-Nearest Neighbors**
3. **Random Forest**
4. **Gradient Boosting**
5. **Support Vector Machine**