### Titanic with feature engineering

The goal with this nb is to improvise on the previous vanilla algorithm:

- Use pipelines for transformation, instead of carrying each one out manually. 
- Perform better feature engineering in 3 ways: extract Mr, Mrs, Master etc titles, first few digits of ticket numbers and create a feature called family size. 
- How to handle cabin letter? 
- Impute missing data more strategically. 
- hyperparameter tuning
- use different model (RF, decision tree along with KNN and sgd clf used earlier)

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

In [3]:
df = pd.read_csv("titanic_data/train.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
# isolate training and target data

X0 = df.drop(['Survived', 'PassengerId'], axis=1)
yT = df['Survived']
print(X0.shape, yT.shape)

(891, 10) (891,)


In [8]:
X0.head(10)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


1. Extract titles from 'Name' column 

In [None]:
# The regex r' ([A-Za-z]+)\.' looks for a word followed by a dot (.), which captures titles like Mr, Mrs, Miss, Master

X0['Title'] = X0['Name'].str.extract(r' ([A-Za-z]+)\.')

In [16]:
title_dict = X0['Title'].value_counts().to_dict()
print(title_dict)

{'Mr': 517, 'Miss': 182, 'Mrs': 125, 'Master': 40, 'Dr': 7, 'Rev': 6, 'Mlle': 2, 'Major': 2, 'Col': 2, 'Countess': 1, 'Capt': 1, 'Ms': 1, 'Sir': 1, 'Lady': 1, 'Mme': 1, 'Don': 1, 'Jonkheer': 1}


Lets group the long tail of titles into "Other" or "Royalty", which will ultimately keep the features from burgeoning and cuasing an overfit during the encoding. 

In [19]:
title_map = {
    'Mr': 'Mr', 'Mrs': 'Mrs', 'Miss': 'Miss', 'Master': 'Master',
    'Don': 'Other', 'Rev': 'Other', 'Dr': 'Other', 'Mme': 'Mrs', 
    'Ms': 'Miss', 'Major': 'Other', 'Lady': 'Royalty', 'Sir': 'Royalty', 
    'Mlle': 'Miss', 'Col': 'Other', 'Capt': 'Other', 'Countess': 'Royalty', 'Jonkheer': 'Royalty'
}

X0['Title'] = X0['Title'].map(lambda x: title_map.get(x, 'Other'))

In [21]:
X0['Title'].value_counts()

Title
Mr         517
Miss       185
Mrs        126
Master      40
Other       19
Royalty      4
Name: count, dtype: int64

In [24]:
X0.columns, X0.shape

(Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
        'Cabin', 'Embarked', 'Title'],
       dtype='object'),
 (891, 11))

2. Add a column reflecting the total family members and drop the columns for SibSp and Parch. +1 to account for the person himself. 

In [26]:
X0['Relatives_onb'] = (X0['SibSp'] + X0['Parch'] + 1)

In [54]:
X1 = X0.drop(['SibSp', 'Parch', 'Name'], axis=1)
X1.columns

Index(['Pclass', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Title',
       'Relatives_onb'],
      dtype='object')

3. Discern whether having a cabin has a huge impact on people who survived and create a new feature if needed. 

In [57]:
# replace all missing cabin values with 'Z', otherwise extract the first alphabet which might hold some key information. 
X1['Cabin'] = X1['Cabin'].fillna('Z').str[0]

In [58]:
X1['Cabin']

0      Z
1      C
2      Z
3      C
4      Z
      ..
886    Z
887    B
888    Z
889    C
890    Z
Name: Cabin, Length: 891, dtype: object

In [60]:
df['Cabin'] = df['Cabin'].fillna('Z').str[0]

In [61]:
print(df.groupby('Cabin')['Survived'].mean())

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
T    0.000000
Z    0.299854
Name: Survived, dtype: float64


let us create a new binary feature 'has_cabin', which is 1 if cabin = 'A','B', C,D,E,F,G else 0, since probabilities of survival in other cabins is miniscule. Though there may not be direct correlation between the 0 and assigning 0 or 1 based on survival may be incorrect, but the hypothesis at play is that: wealthy => cabin => higher survival

In [70]:
X1['has_cabin'] = X1['Cabin'].isin(['A', 'B', 'C', 'D', 'E', 'F', 'G']).astype(int)
X1 = X1.drop('Cabin', axis = 1)
[X1.columns, X1.shape]

[Index(['Pclass', 'Sex', 'Age', 'Ticket', 'Fare', 'Embarked', 'Title',
        'Relatives_onb', 'has_cabin'],
       dtype='object'),
 (891, 9)]

In [71]:
print(X1['has_cabin'].sum())
X1.head(10)

203


Unnamed: 0,Pclass,Sex,Age,Ticket,Fare,Embarked,Title,Relatives_onb,has_cabin
0,3,male,22.0,A/5 21171,7.25,S,Mr,2,0
1,1,female,38.0,PC 17599,71.2833,C,Mrs,2,1
2,3,female,26.0,STON/O2. 3101282,7.925,S,Miss,1,0
3,1,female,35.0,113803,53.1,S,Mrs,2,1
4,3,male,35.0,373450,8.05,S,Mr,1,0
5,3,male,,330877,8.4583,Q,Mr,1,0
6,1,male,54.0,17463,51.8625,S,Mr,1,1
7,3,male,2.0,349909,21.075,S,Master,5,0
8,3,female,27.0,347742,11.1333,S,Mrs,3,0
9,2,female,14.0,237736,30.0708,C,Mrs,2,0


<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">NOTE:</span>
For now, I will also drop the `ticket` column, since I have a hunch that the ticket number info (first digit) has already been captured elsewhere. If needed i will revisit this later. 

In [72]:
XT = X1.drop('Ticket', axis = 1)

In [73]:
XT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pclass         891 non-null    int64  
 1   Sex            891 non-null    object 
 2   Age            714 non-null    float64
 3   Fare           891 non-null    float64
 4   Embarked       889 non-null    object 
 5   Title          891 non-null    object 
 6   Relatives_onb  891 non-null    int64  
 7   has_cabin      891 non-null    int32  
dtypes: float64(2), int32(1), int64(2), object(3)
memory usage: 52.3+ KB


### Transformation pipeline

So XT will serve as the training dataset after some feature engineering. Lets create a pipeline to impute, encode and subsequently scale the dataset.

In [76]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

# Segregate categorical & numerical features
categorical_features = ['Sex', 'Embarked', 'Title']
numerical_features = ['Pclass', 'Relatives_onb', 'Fare', 'has_cabin']  # Excludes 'Age' for custom imputation

# Custom function to impute 'Age' based on mean per 'Title'
def age_imputer(X):
    df = pd.DataFrame(X, columns=['Age', 'Title'])  # Convert NumPy array to DataFrame
    df['Age'] = df.groupby('Title')['Age'].transform(lambda x: x.fillna(x.mean()))
    return df[['Age']].values  # Return as NumPy array

# Wrap function in FunctionTransformer
age_transformer = FunctionTransformer(age_imputer, validate=False)

# Pipeline for numerical features (excluding Age)
num_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

# Pipeline for categorical features (One-Hot Encoding)
cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # For 'Embarked'
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Full preprocessing pipeline
preprocessor = ColumnTransformer([
    ('age_impute', age_transformer, ['Age', 'Title']),  # Custom Age Imputation
    ('num', num_pipeline, numerical_features),          # Standard Scaling
    ('cat', cat_pipeline, categorical_features)         # One-Hot Encoding
])

# Apply pipeline to the dataset
XT_transformed = preprocessor.fit_transform(XT)


In [78]:
XT_transformed.shape

(891, 16)

### Training models and checking performance 

So out data has been preprocessed and stored into `XT_Transformed`, so lets train our models and check performance. Hope its better than the vanilla one. 

1. Logistic regression

In [95]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import f1_score, confusion_matrix

In [101]:
log_reg = LogisticRegression(max_iter=10000)

y_pred_lr = cross_val_predict(log_reg, XT_transformed, yT, cv = 5)
lr_scores = cross_val_score(log_reg, XT_transformed, yT, cv = 5, scoring='accuracy')

In [102]:
c1 = confusion_matrix(yT, y_pred_lr)
c1

array([[479,  70],
       [ 82, 260]], dtype=int64)

In [103]:
print(lr_scores)
lr_scores.mean()

[0.84357542 0.81460674 0.80898876 0.82022472 0.85955056]


0.8293892411022534

In [104]:
knn_clf = KNeighborsClassifier(n_neighbors=3)

y_knn_pred = cross_val_predict(knn_clf, XT_transformed, yT, cv=5)
knn_score = cross_val_score(knn_clf, XT_transformed, yT, cv = 5, scoring='accuracy')

In [105]:
c2 = confusion_matrix(yT, y_knn_pred)
c2

array([[468,  81],
       [101, 241]], dtype=int64)

In [106]:
knn_score.mean()

0.7957378695624883