Feature Engineering

Notebook Purpose
Transform preprocessed data into model-ready features by extracting new information from existing columns.

Input
- `train_preprocessed.csv` - Cleaned training data from preprocessing step
- `test_preprocessed.csv` - Cleaned test data from preprocessing step

Output
- `train_features.csv` - Training data with engineered features
- `test_features.csv` - Test data with engineered features

Features to Engineer
1. **Title** - Extract social title from passenger names
2. **FamilySize** - Total family members aboard
3. **IsAlone** - Binary flag for solo travelers
4. **AgeBin** - Categorical age groups
5. **FareBin** - Categorical fare groups

In [3]:
# Initial Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Set up Visualization options

In [4]:
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

In [5]:
# Load in preprocessed data
train_df = pd.read_csv('../data/processed/train_preprocessed.csv')
test_df = pd.read_csv('../data/processed/test_preprocessed.csv')

print(f"Training set: {train_df.shape}")
print(f"Test set: {test_df.shape}")

print(train_df.head())

Training set: (891, 20)
Test set: (418, 19)
   Survived  Pclass                                               Name   Age  \
0         0       3                            Braund, Mr. Owen Harris  22.0   
1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0   
2         1       3                             Heikkinen, Miss. Laina  26.0   
3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0   
4         0       3                           Allen, Mr. William Henry  35.0   

   SibSp  Parch     Fare  Sex_encoded  Embarked_C  Embarked_Q  Embarked_S  \
0      1      0   7.2500            0       False       False        True   
1      1      0  71.2833            1        True       False       False   
2      0      0   7.9250            1       False       False        True   
3      1      0  53.1000            1       False       False        True   
4      0      0   8.0500            0       False       False        True   

   Deck_A  D

Feature 1: Title Extraction

Why This Feature Matters
Passenger names contain titles (Mr., Mrs., Miss., Master., etc.) that encode:
- Gender (redundant but confirms our encoding)
- Marital status (Mrs. vs Miss.)
- Age indicators (Master. = young boys)
- Social status (Dr., Rev., military ranks)

In [6]:
train_df['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

In [7]:
def extract_title(name):
    """
    Extract title from passenger name.
    Nmaes are formatted as LastName, Title. FirstName
    """
    
    title = name.split(',')[1].split('.')[0].strip()
    return title


# Testing function
print(extract_title("Braund, Mr. Owen Harris"))
print(extract_title("Cumings, Mrs. John Bradley (Florence Briggs Thayer)"))
print(extract_title("Heikkinen, Miss. Laina"))

Mr
Mrs
Miss


In [8]:
# Apply to the datasets
train_df['Title'] = train_df['Name'].apply(extract_title)
test_df['Title'] = test_df['Name'].apply(extract_title)

print("Training set titles:")
print(train_df['Title'].value_counts())

print("Test set titles:")
print(test_df['Title'].value_counts())

Training set titles:
Title
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Col               2
Mlle              2
Major             2
Ms                1
Mme               1
Don               1
Lady              1
Sir               1
Capt              1
the Countess      1
Jonkheer          1
Name: count, dtype: int64
Test set titles:
Title
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: count, dtype: int64


Title Grouping

Many rare titles appear only a few times. We should group them into meaningful categories to:
1. Reduce dimensionality
2. Ensure test set titles exist in training set
3. Create statistically meaningful groups

In [9]:
def group_titles(title):
    """
    Group rare titles into common categories.
    """
    
    title_mapping = {
        # Female titles
        'Mlle': 'Miss',      # Mademoiselle (French Miss)
        'Ms': 'Miss',
        'Mme': 'Mrs',        # Madame (French Mrs)
        
        # Male titles - professional/military/noble -> 'Rare'
        'Dr': 'Rare',
        'Rev': 'Rare',
        'Col': 'Rare',
        'Major': 'Rare',
        'Capt': 'Rare',
        'Sir': 'Rare',
        'Don': 'Rare',
        'Jonkheer': 'Rare',  # Dutch honorific
        
        # Female nobility -> 'Rare'
        'Lady': 'Rare',
        'the Countess': 'Rare',
        'Dona': 'Rare'
    }
    
    return title_mapping.get(title, title)


# Apply grouping
train_df['Title'] = train_df['Title'].apply(group_titles)
test_df['Title'] = test_df['Title'].apply(group_titles)

print("Grouped titles (Training):")
print(train_df['Title'].value_counts())

Grouped titles (Training):
Title
Mr        517
Miss      185
Mrs       126
Master     40
Rare       23
Name: count, dtype: int64


In [12]:
# Visualize survival rate by title
title_survival = train_df.groupby('Title')['Survived'].agg(['mean', 'count'])
title_survival.columns = ['Survival_Rate', 'Count']
title_survival = title_survival.sort_values('Survival_Rate', ascending=False)

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(title_survival.index, title_survival['Survival_Rate'], color='steelblue')
ax.set_ylabel('Survival Rate')
ax.set_xlabel('Title')
ax.set_title('Survival Rate by Title')
ax.axhline(y=train_df['Survived'].mean(), color='red', linestyle='--', label='Overall Average')
ax.legend()

# Add count labels on bars
for bar, count in zip(bars, title_survival['Count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f'n={int(count)}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('../reports/figures/survival_rate_title.png')
plt.close()

Family Size

Why This Feature Matters
- SibSp = siblings + spouse count
- Parch = parents + children count
- Combined, these tell us total family size aboard

In [13]:
# Create FamilySize
train_df['FamilySize'] = train_df['SibSp'] + train_df['Parch'] + 1
test_df['FamilySize'] = test_df['SibSp'] + test_df['Parch'] + 1

print("Family Size Distribution:")
print(train_df['FamilySize'].value_counts().sort_index())

Family Size Distribution:
FamilySize
1     537
2     161
3     102
4      29
5      15
6      22
7      12
8       6
11      7
Name: count, dtype: int64


In [15]:
# Visualize survival by FamilySize
family_survival = train_df.groupby('FamilySize')['Survived'].agg(['mean', 'count'])
family_survival.columns = ['Survival_Rate', 'Count']

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(family_survival.index, family_survival['Survival_Rate'], color='steelblue')
ax.set_ylabel('Survival Rate')
ax.set_xlabel('Family Size')
ax.set_title('Survival Rate by Family Size')
ax.axhline(y=train_df['Survived'].mean(), color='red', linestyle='--', label='Overall Average')
ax.legend()

# Add count labels
for bar, count in zip(bars, family_survival['Count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'n={int(count)}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('../reports/figures/survival_rate_by_familysize.png')
plt.close()

Is Alone

Based on the family size analysis, solo travelers (FamilySize = 1) appear to have 
distinct survival patterns. Let's create a binary flag.

In [16]:
train_df['IsAlone'] = (train_df['FamilySize'] == 1).astype(int)
test_df['IsAlone'] = (test_df['FamilySize'] == 1).astype(int)

print("IsAlone Distribution:")
print(train_df['IsAlone'].value_counts())
print(f"\nSurvival rate - Alone: {train_df[train_df['IsAlone']==1]['Survived'].mean():.3f}")
print(f"Survival rate - With Family: {train_df[train_df['IsAlone']==0]['Survived'].mean():.3f}")

IsAlone Distribution:
IsAlone
1    537
0    354
Name: count, dtype: int64

Survival rate - Alone: 0.304
Survival rate - With Family: 0.506


Age Binning

Why Bin Ages?
- Reduces noise from exact age values
- Captures meaningful life stages (Child, Adult, Senior)
- "Women and children first" means children should have higher survival
- Handles the inherent uncertainty in imputed ages

In [17]:
# Define age bins
age_bins = [0, 12, 18, 35, 60, 100]
age_labels = ['Child', 'Teenager', 'Young_Adult', 'Adult', 'Senior']

train_df['AgeBin'] = pd.cut(train_df['Age'], bins=age_bins, labels=age_labels)
test_df['AgeBin'] = pd.cut(test_df['Age'], bins=age_bins, labels=age_labels)

print("Age Bin Distribution:")
print(train_df['AgeBin'].value_counts())

Age Bin Distribution:
AgeBin
Young_Adult    514
Adult          216
Teenager        70
Child           69
Senior          22
Name: count, dtype: int64


In [19]:
# Visualize survival by age bin
age_survival = train_df.groupby('AgeBin', observed=True)['Survived'].agg(['mean', 'count'])
age_survival.columns = ['Survival_Rate', 'Count']

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(range(len(age_survival)), age_survival['Survival_Rate'], color='steelblue')
ax.set_xticks(range(len(age_survival)))
ax.set_xticklabels(age_survival.index)
ax.set_ylabel('Survival Rate')
ax.set_xlabel('Age Group')
ax.set_title('Survival Rate by Age Group')
ax.axhline(y=train_df['Survived'].mean(), color='red', linestyle='--', label='Overall Average')
ax.legend()

for bar, count in zip(bars, age_survival['Count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'n={int(count)}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('../reports/figures/survival_rate_by_agebin.png')
plt.close()

Fare Binning

Why Bin Fares?
- Fare has a highly skewed distribution
- Captures socioeconomic tiers (complements Pclass)
- Reduces impact of outliers (some very high fares)

In [20]:
# Look at fare distribution first
print(train_df['Fare'].describe())

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64


In [21]:
# Use quartile-based binning (qcut) for balanced groups
train_df['FareBin'] = pd.qcut(train_df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'Very_High'])

# For test set, use the same bin edges from training
# First, get the bin edges from training
_, fare_bin_edges = pd.qcut(train_df['Fare'], q=4, retbins=True)

# Apply to test set (use cut with the edges from training)
# Extend edges slightly to handle any values outside training range
fare_bin_edges[0] = 0
fare_bin_edges[-1] = float('inf')
test_df['FareBin'] = pd.cut(test_df['Fare'], bins=fare_bin_edges, labels=['Low', 'Medium', 'High', 'Very_High'])

print("Fare Bin Distribution (Training):")
print(train_df['FareBin'].value_counts())

Fare Bin Distribution (Training):
FareBin
Medium       224
Low          223
High         222
Very_High    222
Name: count, dtype: int64


In [23]:
# Visualize survival by fare bin
fare_survival = train_df.groupby('FareBin', observed=True)['Survived'].agg(['mean', 'count'])
fare_survival.columns = ['Survival_Rate', 'Count']

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(range(len(fare_survival)), fare_survival['Survival_Rate'], color='steelblue')
ax.set_xticks(range(len(fare_survival)))
ax.set_xticklabels(fare_survival.index)
ax.set_ylabel('Survival Rate')
ax.set_xlabel('Fare Group')
ax.set_title('Survival Rate by Fare Group')
ax.axhline(y=train_df['Survived'].mean(), color='red', linestyle='--', label='Overall Average')
ax.legend()

for bar, count in zip(bars, fare_survival['Count']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'n={int(count)}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('../reports/figures/survival_rate_by_farebin.png')
plt.close()

Encode new categorical features

Our new categorical features (Title, AgeBin, FareBin) need to be encoded for modeling.

In [24]:
# One-hot encode Title
title_dummies_train = pd.get_dummies(train_df['Title'], prefix='Title')
title_dummies_test = pd.get_dummies(test_df['Title'], prefix='Title')

# Ensure test has same columns as train
for col in title_dummies_train.columns:
    if col not in title_dummies_test.columns:
        title_dummies_test[col] = 0
title_dummies_test = title_dummies_test[title_dummies_train.columns]

train_df = pd.concat([train_df, title_dummies_train], axis=1)
test_df = pd.concat([test_df, title_dummies_test], axis=1)

print("Title columns added:", list(title_dummies_train.columns))

Title columns added: ['Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare']


In [25]:
# One-hot encode AgeBin
age_dummies_train = pd.get_dummies(train_df['AgeBin'], prefix='Age')
age_dummies_test = pd.get_dummies(test_df['AgeBin'], prefix='Age')

# Ensure test has same columns as train
for col in age_dummies_train.columns:
    if col not in age_dummies_test.columns:
        age_dummies_test[col] = 0
age_dummies_test = age_dummies_test[age_dummies_train.columns]

train_df = pd.concat([train_df, age_dummies_train], axis=1)
test_df = pd.concat([test_df, age_dummies_test], axis=1)

print("AgeBin columns added:", list(age_dummies_train.columns))

AgeBin columns added: ['Age_Child', 'Age_Teenager', 'Age_Young_Adult', 'Age_Adult', 'Age_Senior']


In [26]:
# One-hot encode FareBin
fare_dummies_train = pd.get_dummies(train_df['FareBin'], prefix='Fare')
fare_dummies_test = pd.get_dummies(test_df['FareBin'], prefix='Fare')

# Ensure test has same columns as train
for col in fare_dummies_train.columns:
    if col not in fare_dummies_test.columns:
        fare_dummies_test[col] = 0
fare_dummies_test = fare_dummies_test[fare_dummies_train.columns]

train_df = pd.concat([train_df, fare_dummies_train], axis=1)
test_df = pd.concat([test_df, fare_dummies_test], axis=1)

print("FareBin columns added:", list(fare_dummies_train.columns))

FareBin columns added: ['Fare_Low', 'Fare_Medium', 'Fare_High', 'Fare_Very_High']


Feautre Engineering Summary

In [28]:
print(f"Training set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"\nAll Train columns ({len(train_df.columns)}):")
print(train_df.columns.tolist())
print(f"\nAll Test columns ({len(test_df.columns)}):")
print(test_df.columns.tolist())

Training set shape: (891, 39)
Test set shape: (418, 38)

All Train columns (39):
['Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_encoded', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Deck_Unknown', 'Title', 'FamilySize', 'IsAlone', 'AgeBin', 'FareBin', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare', 'Age_Child', 'Age_Teenager', 'Age_Young_Adult', 'Age_Adult', 'Age_Senior', 'Fare_Low', 'Fare_Medium', 'Fare_High', 'Fare_Very_High']

All Test columns (38):
['Pclass', 'Name', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_encoded', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Deck_Unknown', 'Title', 'FamilySize', 'IsAlone', 'AgeBin', 'FareBin', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare', 'Age_Child', 'Age_Teenager', 'Age_Young_Adult', 'Age_Adult', 'Age_Senior', 'Fare_Low', 'Fare

Drop Intermediate Columns 

In [29]:
# Columns to drop (raw/intermediate columns)
columns_to_drop = ['Name', 'Title', 'AgeBin', 'FareBin']

train_df = train_df.drop(columns=columns_to_drop)
test_df = test_df.drop(columns=columns_to_drop)

print(f"Final training set shape: {train_df.shape}")
print(f"Final test set shape: {test_df.shape}")

Final training set shape: (891, 35)
Final test set shape: (418, 34)


In [30]:
# Final column list
print("Final columns:")
print(train_df.columns.tolist())

Final columns:
['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_encoded', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Deck_Unknown', 'FamilySize', 'IsAlone', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare', 'Age_Child', 'Age_Teenager', 'Age_Young_Adult', 'Age_Adult', 'Age_Senior', 'Fare_Low', 'Fare_Medium', 'Fare_High', 'Fare_Very_High']


Save Engineered Features

In [31]:
train_df.to_csv('../data/processed/train_features.csv', index=False)
test_df.to_csv('../data/processed/test_features.csv', index=False)

print("Feature engineering complete!")
print(f"Saved: train_features.csv ({train_df.shape})")
print(f"Saved: test_features.csv ({test_df.shape})")

Feature engineering complete!
Saved: train_features.csv ((891, 35))
Saved: test_features.csv ((418, 34))
