Feature Engineering

Notebook Purpose
Transform preprocessed data into model-ready features by extracting new information from existing columns.

Input
- `train_preprocessed.csv` - Cleaned training data from preprocessing step
- `test_preprocessed.csv` - Cleaned test data from preprocessing step

Output
- `train_features.csv` - Training data with engineered features
- `test_features.csv` - Test data with engineered features

Features to Engineer
1. **Title** - Extract social title from passenger names
2. **FamilySize** - Total family members aboard
3. **IsAlone** - Binary flag for solo travelers
4. **AgeBin** - Categorical age groups
5. **FareBin** - Categorical fare groups

In [2]:
# Initial Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Set up Visualization options

In [3]:
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

In [5]:
# Load in preprocessed data
train_df = pd.read_csv('../data/processed/train_preprocessed.csv')
test_df = pd.read_csv('../data/processed/test_preprocessed.csv')

print(f"Training set: {train_df.shape}")
print(f"Test set: {test_df.shape}")

print(train_df.head())

Training set: (891, 20)
Test set: (418, 19)
   Survived  Pclass                                               Name   Age  \
0         0       3                            Braund, Mr. Owen Harris  22.0   
1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0   
2         1       3                             Heikkinen, Miss. Laina  26.0   
3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0   
4         0       3                           Allen, Mr. William Henry  35.0   

   SibSp  Parch     Fare  Sex_encoded  Embarked_C  Embarked_Q  Embarked_S  \
0      1      0   7.2500            0       False       False        True   
1      1      0  71.2833            1        True       False       False   
2      0      0   7.9250            1       False       False        True   
3      1      0  53.1000            1       False       False        True   
4      0      0   8.0500            0       False       False        True   

   Deck_A  D

Feature 1: Title Extraction

Why This Feature Matters
Passenger names contain titles (Mr., Mrs., Miss., Master., etc.) that encode:
- Gender (redundant but confirms our encoding)
- Marital status (Mrs. vs Miss.)
- Age indicators (Master. = young boys)
- Social status (Dr., Rev., military ranks)

In [6]:
train_df['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

In [None]:
def extract_title(name):
    """
    
    """