# Titanic Survival Prediction: 01 - Data Cleaning and Feature Engineering
*Date: TODO*
*Author: Jonas Lilletvedt*

--- 

## 1. Introduction and Setup

### 1.1. Objective

In this notebook we will focus on data cleaning and feature engineering. Building on the findings from our initial exploration in `00_initial_data_exploration.ipynb`, this notebook's primary objective is to construct a robust end-to-end pre-processing pipeline. This pipeline will transform raw data into clean, feature-rich dataset ready for modeling, while ensuring results are reproducible and free from data leakage.

To achieve this, our pipeline will systematically perform these tasks:
1.  **Data Cleaning and Imputation:**
    *   Address missing data in `Age`, `Embarked`.
    *   Normalize the positive skewed data in `Fare`.
2.  **Advanced Feature Engineering:**
    *   Extract `Title` from `Name` column to act as a proxy for age, sex, marriage- and social-status.
    *   Create a binned `FamilySize` feature from `SibSp` and `Parch`.
    *   Derive `Deck` (vertical location) and `Zone` (horizontal location) from `Cabin`.
    *   Bin the `Age` feature to better represent the non-linear relationship to `Survived`.

The entire process will be wrapped in a `scikit-learn` `Pipeline`, implemented using a set of custom transformers for our distinct logic, and a `ColumnTransformer` for standard pre-processing tasks.

### 1.2 Recap of Findings from Exploratory Data Analysis

The preceding data analysis (`00_initial_data_exploration.ipynb`) revealed several key insights and quality issues that will determine our work here:

**Key Predictive Relationships:**
*   **Dominant Predictors:** `Sex` and `Pclass` were found to be the strongest predictor for survival.
    *   Females had a vastly higher survival rate (~74%) then males (~18%).
    *   There was a clear linear relationship between passenger class and survival rate. First class passengers had a survival rate of ~63% compared to second and first class passengers ~63% and ~24% respectively.

**Data Quality and Structural Issues:**
*   **Missing Data:** Multiple columns were missing significant amounts of data.
    *   `Cabin`(~77% missing)
    *   `Age` (~20% missing)
    *   `Embarked` (2 missing values)
*   **Outliers and Skewness:** The `Fare` column show a substantial discrepancy in the 75% quantile (31$) and max value (512$).
*   **Features Requiring Transformation:** The columns `Name`, `Ticket` and `Cabin` are not suitable for direct use but contain valuable information that can be extracted:
    *   `Name` can be deconstructed to extract `Title` (a proxy for sex, age, marriage- and social-status) and `Surname` for family identification. 
    *   From `Cabin` we can gather positional information for `Deck` (vertical location) and `Zone` (horizontal location).
    *   `SibSp` and `Parch`can be combined for a more powerful feature, `FamilySize`.

## 2. Data Loading and Setup

---

The first step in this notebook is to load in the correct libraries followed by the train and test datasets.

All modifications will be applied to both the `train` and `test` sets for consistency. To prevent data leakage and ensure our model's performance is realistically evaluated, all transformations parameters -- such as values for imputation or scaling factors -- will be extracted solely from the `train` dataset. The test must and will only be used for model evaluation, and not influence any part of the data analysis or pre-processing steps.

### 2.1. Library Imports

In [69]:
# Import necessary libraries

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn tools for preprocessing and modeling
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin

### 2.2. Load Datasets

In [70]:
# Load datasets
df_train = pd.read_csv('../data/01_raw/train.csv')
df_test = pd.read_csv('../data/01_raw/test.csv')

### 2.3. Initial Inspection 

A quick inspection to check the dataset are loaded properly and expected. 

**Dataset Shapes:**

In [71]:
# Check shape of each dataset
print(f'Training data shape: {df_train.shape}')
print(f'Test data shape: {df_test.shape}')

Training data shape: (891, 12)
Test data shape: (418, 11)


**Data Preview:**

In [72]:
# Check five first rows of `df_train`
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [73]:
# Check five first rows of `df_test`
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


**Data Types and Missing Values:**

In [74]:
# Types and missing values for `df_train`
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [75]:
# Types and missing values for `df_test`
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


Unlike our train set we also have missing values in `Fare` in addition to `Cabin` and `Age`. 

Now that we have gotten a feel for the data, and checked everything is working as expected we will move on to `Data Cleaning and Imputation`.

### 2.4. Initial Findings and Plan Adjustments

The inspection confirms the missing values in `Age`, `Cabin` and `Embarked` in the training set, as we expected from the EDA.

Alongside the previous findings we have also discovered missing values in the test set, this include `Age` and `Cabin` similar to the train set. In addition we the test set have one missing value in `Fare`. Thus, our pipeline must be able to handle `Fare` imputation along the previous transformers.

## 3. Exploratory Feature Engineering and Validation

--- 

Feature selection is an important step before pipeline development. We will therefore perform an exploratory analysis to evaluate the value of the proposed engineered features.

Thus, this section will act as an 'scratchpad' for our feature engineering ideas. 

### 3.1. Evaluating the `Title` feature

We will begin our feature engineering exploration with `Title`. As hypothesized in the introduction, a passenger's title has the potential to be a strong proxy for age, and survival. 

It is crucial to evaluate `Title` first, because our `Age` imputation strategy will depend on it. We will analyze `Title`'s relationship with both variables accordingly.

In [76]:
# Make copy for scratchpad
df_scratch = df_train.copy()

# Add new columns
df_scratch['Surname_feat'] = df_scratch['Name'].str.split(',').str[0]
df_scratch['Title_feat'] = df_scratch['Name'].str.extract(pat=' ([A-Za-z]+\.)', expand=False)
df_scratch.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname_feat,Title_feat
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund,Mr.
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings,Mrs.
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen,Miss.
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle,Mrs.
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen,Mr.


## 4. Building the Pre-processing and Modeling Pipeline

---

In this section we will construct the components for our final, end-to-end pipeline. 

**A Note on Project Structure:** In a production setting, all source code for the pipeline would typically be organized into a `src`directory for modularity, reusability and testing. However, for the sake of clarity and simplicity for the reader, we will define these components below. This allows the reader to follow each step of the logic directly.

### 3.1. Custom Transformers

## 3. Data Cleaning and Imputation

---

Our exploratory data analysis revealed several data quality issues that must be resolved before feature engineering can start:

1.  **Handle Outliers and Skewed data in `Fare`:** Apply a log-transformation to the column.
2.  **Impute `Embarked`:** Fill the two missing values using the mode.
3.  **Impute `Age`:** Impute the missing values in `Age`, using the median age grouped by `Title` and `Pclass`.

**Note on `Cabin`:** Due to the severe amount of missing data (~77%), a simple imputation is not feasible. Instead, we will treat this as a feature engineering task by extracting positional information to create two new features: `Deck` and `Zone`. For passengers with missing cabin data, these new features will be assigned an 'Unknown' class. Consequently, the handling of `Cabin` is deferred to the `Feature Engineering` section.