# Day 4 - Feature Engineering 
### Machine Learning Roadmap — Week 1
### Author: N Manish Kumar

---

## 1. Introduction
Today we begin one of the most important stages of the ML pipeline:
Feature Engineering.

In Day 2, we cleaned the Titanic dataset by handling missing values, encoding basic columns, fixing data types, and preparing a consistent structure.
In Day 3, we performed EDA to understand patterns, trends, and relationships between features and survival rates.

Now, in Day 4, we will use those insights to transform the cleaned dataset into a machine-learning–ready dataset.

In this Notebook, I will
- Load the cleaned Dataset
- Encode Categorial Values
- Engineer new meaningful features
- Dropping Irrelevent or Redundant Columns
- Finalize ML-ready dataset
- Create a train-test split
- Save the Transformed Dataset

---

## 2. Load Cleaned Dataset

In [1]:
import numpy as np
import pandas as pd

df= pd.read_csv("../Day2_Pandas/Data/titanic_cleaned.csv")
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,FamilySize
0,0,3,0,22.0,1,0,7.25,0,1,2
1,1,1,1,38.0,1,0,71.2833,0,0,2
2,1,3,1,26.0,0,0,7.925,0,1,1
3,1,1,1,35.0,1,0,53.1,0,1,2
4,0,3,0,35.0,0,0,8.05,0,1,1


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Survived    891 non-null    int64  
 1   Pclass      891 non-null    int64  
 2   Sex         891 non-null    int64  
 3   Age         891 non-null    float64
 4   SibSp       891 non-null    int64  
 5   Parch       891 non-null    int64  
 6   Fare        891 non-null    float64
 7   Embarked_Q  891 non-null    int64  
 8   Embarked_S  891 non-null    int64  
 9   FamilySize  891 non-null    int64  
dtypes: float64(2), int64(8)
memory usage: 69.7 KB


---

## 3. Encode Categorial Values

We actually Encode all categorial values which may be useful to train our model into numeric values.

In the titanic dataset the Sex , Embarked and Age categorial values may be very useful so we encode them.

But i have aldready Encoded Sex and Embarked columns on day two. 

As for Age column, I had encoded them on Day 3 but as I have only loaded Day 2 dataset, the changes haven't carried over. So i will have to make AgeGroup column again and encode it.

In [3]:
df['AgeGroup']= pd.cut(df['Age'],bins=[0,12,18,35,60,100],labels=['Child','Teen','Adult','Middle-aged','Senior'])
df= pd.get_dummies(df,columns=['AgeGroup'],drop_first=True)
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S,FamilySize,AgeGroup_Teen,AgeGroup_Adult,AgeGroup_Middle-aged,AgeGroup_Senior
0,0,3,0,22.0,1,0,7.2500,0,1,2,False,True,False,False
1,1,1,1,38.0,1,0,71.2833,0,0,2,False,False,True,False
2,1,3,1,26.0,0,0,7.9250,0,1,1,False,True,False,False
3,1,1,1,35.0,1,0,53.1000,0,1,2,False,True,False,False
4,0,3,0,35.0,0,0,8.0500,0,1,1,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,0,27.0,0,0,13.0000,0,1,1,False,True,False,False
887,1,1,1,19.0,0,0,30.0000,0,1,1,False,True,False,False
888,0,3,1,28.0,1,2,23.4500,0,1,4,False,True,False,False
889,1,1,0,26.0,0,0,30.0000,0,0,1,False,True,False,False


---

## 4. Feature Engineering
Feature Engineering = creating or transforming features to better represent the underlying patterns in the data.

The goal is to make survival patterns clearer for ML models.

Let’s go through each engineered feature, one by one

---

### 4.1. Log Transform Fare
Why do we do this?
- Fare is extremely right-skewed (long tail)
- ML models perform better when features are closer to normal distribution
- Prevents outliers from dominating the model
- Helps linear models (like Logistic Regression) converge faster

In [4]:
# Converts using log(1+x)
df['LogFare']=np.log1p(df['Fare'])

--- 

### 4.2. Create Age Bins (AgeBin) From Continuous Age
Although Age is useful as a continuous variable, survival rates are not linear with Age.

EDA showed:

Children survived more

Young adults had moderate survival

Seniors had very low survival

So we convert Age into categories (bins):

- 0–12     → Child (bin 0)
- 13–18    → Teen (bin 1)
- 19–35    → Adult (bin 2)
- 36–60    → Middle-aged (bin 3)
- 60+      → Senior (bin 4)

This captures non-linear patterns in a simple numeric form.

In [13]:
df['AgeBin']= pd.cut(df['Age'],bins=[0,12,18,35,60,100],labels=[0,1,2,3,4])
df['AgeBin']= df['AgeBin'].astype(int)

---

### 4.3. Create IsAlone Feature

Why do this?

You discovered in EDA:
- Passengers traveling alone had lower survival
- Medium family sizes (2–4) had highest survival

So create a binary feature that captures loneliness:

- IsAlone = 1 → traveling alone
- IsAlone = 0 → traveling with family


This dramatically improves model performance in Titanic datasets.

In [7]:
df['IsAlone']= (df['FamilySize']==1).astype(int)

---

### 4.4. Checking out all new Features

In [8]:
df[['Fare','LogFare','Age','AgeBin','FamilySize','IsAlone']].head()

Unnamed: 0,Fare,LogFare,Age,AgeBin,FamilySize,IsAlone
0,7.25,2.110213,22.0,2,2,0
1,71.2833,4.280593,38.0,3,2,0
2,7.925,2.188856,26.0,2,1,1
3,53.1,3.990834,35.0,2,2,0
4,8.05,2.202765,35.0,2,1,1


#### Why Feature Engineering Improves the Model

- Raw features often hide important patterns

- Log transformation reduces the impact of extreme values

- Age binning captures non-linear survival behavior

- IsAlone explains social/behavioral survival patterns

- These features make ML algorithms more accurate and interpretable

Feature engineering helps models learn the real story behind the data.

---

## 5. Drop Irrelevent or Redundant Columns

After feature engineering, your dataset contains a mix of:

- Original raw columns

- Encoded categorical features

- Engineered features

Some of these columns should NOT be used for training an ML model.

We drop them for good reasons, not randomly.

### 5.1. Dropping 'SibSp' and 'Parch'
As I have created a Column 'FamilySize' which encodes 'FamilySize' = 'SibSp' + 'Parch' + 1

In [10]:
df.drop(columns=['SibSp','Parch'],inplace=True)

Although we engineered new features (AgeBin, Fare_Log), we keep the original Age and Fare columns as well.
This gives the model more flexibility and allows both linear and tree-based algorithms to capture useful patterns.
Later, after evaluating feature importance and correlation, we may drop one version of these features to reduce redundancy or multicollinearity.

---

## 6. Final Dataset Verification

After all feature engineering and dropping irrelevant columns, we must verify that:
1) The dataset contains only valid ML features

2) All columns are numeric

3) No categorical/object columns remain

4) No missing values exist

5) Feature shapes and types are correct

6) Engineered features appear correctly

This is a crucial step before training models.

### 6.1. Check Column Types

In [17]:
# Check if all columns are in either int, float or bool type
df.dtypes

Survived                  int64
Pclass                    int64
Sex                       int64
Age                     float64
Fare                    float64
Embarked_Q                int64
Embarked_S                int64
FamilySize                int64
AgeGroup_Teen              bool
AgeGroup_Adult             bool
AgeGroup_Middle-aged       bool
AgeGroup_Senior            bool
LogFare                 float64
AgeBin                    int64
IsAlone                   int64
dtype: object

In [18]:
# Check for 0 missing values
df.isnull().sum()

Survived                0
Pclass                  0
Sex                     0
Age                     0
Fare                    0
Embarked_Q              0
Embarked_S              0
FamilySize              0
AgeGroup_Teen           0
AgeGroup_Adult          0
AgeGroup_Middle-aged    0
AgeGroup_Senior         0
LogFare                 0
AgeBin                  0
IsAlone                 0
dtype: int64

In [19]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked_Q,Embarked_S,FamilySize,AgeGroup_Teen,AgeGroup_Adult,AgeGroup_Middle-aged,AgeGroup_Senior,LogFare,AgeBin,IsAlone
0,0,3,0,22.0,7.25,0,1,2,False,True,False,False,2.110213,2,0
1,1,1,1,38.0,71.2833,0,0,2,False,False,True,False,4.280593,3,0
2,1,3,1,26.0,7.925,0,1,1,False,True,False,False,2.188856,2,1
3,1,1,1,35.0,53.1,0,1,2,False,True,False,False,3.990834,2,0
4,0,3,0,35.0,8.05,0,1,1,False,True,False,False,2.202765,2,1


In [20]:
# Statistical Summary
df.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked_Q,Embarked_S,FamilySize,LogFare,AgeBin,IsAlone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.352413,29.361582,32.204208,0.08642,0.725028,1.904602,2.962246,2.034792,0.602694
std,0.486592,0.836071,0.47799,13.019697,49.693429,0.281141,0.446751,1.613459,0.969048,0.839958,0.489615
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,1.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,22.0,7.9104,0.0,0.0,1.0,2.187218,2.0,0.0
50%,0.0,3.0,0.0,28.0,14.4542,0.0,1.0,1.0,2.737881,2.0,1.0
75%,1.0,3.0,1.0,35.0,31.0,0.0,1.0,2.0,3.465736,2.0,1.0
max,1.0,3.0,1.0,80.0,512.3292,1.0,1.0,11.0,6.240917,4.0,1.0


In [21]:
df.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S',
       'FamilySize', 'AgeGroup_Teen', 'AgeGroup_Adult', 'AgeGroup_Middle-aged',
       'AgeGroup_Senior', 'LogFare', 'AgeBin', 'IsAlone'],
      dtype='object')

In [22]:
df.shape

(891, 15)

In [23]:
df.corr(numeric_only=True)

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked_Q,Embarked_S,FamilySize,AgeGroup_Teen,AgeGroup_Adult,AgeGroup_Middle-aged,AgeGroup_Senior,LogFare,AgeBin,IsAlone
Survived,1.0,-0.338481,0.543351,-0.06491,0.257307,0.00365,-0.149683,0.016639,0.026859,-0.077053,0.01759,-0.051224,0.329862,-0.093191,-0.203367
Pclass,-0.338481,1.0,-0.1319,-0.339898,-0.5495,0.221009,0.074053,0.065997,0.061877,0.199842,-0.299461,-0.136667,-0.661022,-0.290501,0.135207
Sex,0.543351,-0.1319,1.0,-0.081163,0.182333,0.074115,-0.119224,0.200988,0.098941,-0.074542,0.00727,-0.071958,0.263276,-0.097739,-0.303646
Age,-0.06491,-0.339898,-0.081163,1.0,0.096688,-0.031415,-0.006729,-0.245619,-0.286849,-0.216849,0.62925,0.448281,0.110964,0.916255,0.171647
Fare,0.257307,-0.5495,0.182333,0.096688,1.0,-0.117216,-0.162184,0.217138,0.007332,-0.119664,0.128482,0.029368,0.787543,0.074269,-0.271832
Embarked_Q,0.00365,0.221009,0.074115,-0.031415,-0.117216,1.0,-0.499421,-0.058592,-0.030424,0.128565,-0.114494,0.002542,-0.160456,-0.027021,0.086464
Embarked_S,-0.149683,0.074053,-0.119224,-0.006729,-0.162184,-0.499421,1.0,0.077359,-0.016368,-0.055886,0.046322,0.016998,-0.128846,0.010552,0.029074
FamilySize,0.016639,0.065997,0.200988,-0.245619,0.217138,-0.058592,0.077359,1.0,0.040556,-0.22588,-0.015819,-0.048892,0.383658,-0.31011,-0.690922
AgeGroup_Teen,0.026859,0.061877,0.098941,-0.286849,0.007332,-0.030424,-0.016368,0.040556,1.0,-0.357956,-0.154558,-0.04646,-0.00628,-0.359929,-0.069803
AgeGroup_Adult,-0.077053,0.199842,-0.074542,-0.216849,-0.119664,0.128565,-0.055886,-0.22588,-0.357956,1.0,-0.64888,-0.195053,-0.220773,-0.050807,0.255478


In [25]:
# Check for Duplicate Columns
df.T.duplicated().sum()

np.int64(0)

#### Final Dataset Verification

Before moving to the modeling stage, it is essential to verify that the dataset is fully prepared for machine learning.
In this step, we check:

Column data types

Missing values

Feature distributions

Engineered feature correctness

Removal of irrelevant columns

Dataset shape and quality

A clean and fully numeric dataset ensures that ML models train smoothly and produce meaningful results.

---

## 7. Train-Test Split
Goal : Split your dataset into X (features) and y (target) and then further into training and testing data.
This ensures your model's performance is evaluated perfectly.

---

### 7.1. Select Features (X) and Target (y)

Target Column : 'Survived'
Everything else is a feature.

In [26]:
y= df['Survived']
X = df.drop(columns=['Survived'])

--- 

### 7.2. Perform Train-Test Split

We'll use a 80-20 split:
- 80% → training
- 20% → testing

This is the standard for small datasets like Titanic.

In [27]:
from sklearn.model_selection import train_test_split

# If you want the train–test split to maintain the same ratio (stratify) of survivors and non-survivors as the original dataset:
X_train,y_train,X_test,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

Why random_state=42?
- Ensures reproducibility
- Same split every time you run the notebook
- All ML engineers use 42 (industry convention)

---

### 7.3. Check Shapes to confirm split

In [28]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((712, 14), (712,), (179, 14), (179,))

- 712 training rows

- 179 testing rows

#### Why Do We Split the Dataset?

Machine learning models must be evaluated on data they have never seen.
If the same data is used for training and testing:

- The model will memorize patterns

- Accuracy will be artificially high

- Evaluation becomes meaningless

A train-test split ensures:

- The model learns on one dataset (training)

- And is judged on another (testing)

- This tells us how well the model generalizes.

## 8. Saving the Transformed DataSet

In [30]:
df.to_csv("Data/titanic_processed.csv",index=False)

### Why index=False?

- You do NOT want row numbers saved as a column

- ML models don’t need them

- Keeping the saved file clean and minimal is best practice

In [33]:
# Check if file was saved
import os 
os.listdir("Data")

['titanic_processed.csv']

In [35]:
# Check the first few rows to confirm save integrity
test_df= pd.read_csv("Data/titanic_processed.csv")
test_df

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked_Q,Embarked_S,FamilySize,AgeGroup_Teen,AgeGroup_Adult,AgeGroup_Middle-aged,AgeGroup_Senior,LogFare,AgeBin,IsAlone
0,0,3,0,22.0,7.2500,0,1,2,False,True,False,False,2.110213,2,0
1,1,1,1,38.0,71.2833,0,0,2,False,False,True,False,4.280593,3,0
2,1,3,1,26.0,7.9250,0,1,1,False,True,False,False,2.188856,2,1
3,1,1,1,35.0,53.1000,0,1,2,False,True,False,False,3.990834,2,0
4,0,3,0,35.0,8.0500,0,1,1,False,True,False,False,2.202765,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,0,27.0,13.0000,0,1,1,False,True,False,False,2.639057,2,1
887,1,1,1,19.0,30.0000,0,1,1,False,True,False,False,3.433987,2,1
888,0,3,1,28.0,23.4500,0,1,4,False,True,False,False,3.196630,2,0
889,1,1,0,26.0,30.0000,0,0,1,False,True,False,False,3.433987,2,1


# Day 4 — Feature Engineering Summary

In this session, we transformed the cleaned Titanic dataset into a fully machine-learning–ready dataset. This involved encoding categorical variables, engineering meaningful new features, and removing unnecessary columns that do not contribute to prediction quality.

Key accomplishments of Day 4:

#### 1) Loaded the cleaned dataset

We began by loading titanic_cleaned.csv from Day 2 and verified its structure and data types.

#### 2) Reconstructed and encoded AgeGroup

Since AgeGroup was created during EDA in Day 3 but not saved, we recreated it and applied one-hot encoding to convert it into numeric dummy variables.

#### 3) Engineered new features

We added several powerful features to improve model performance:

Fare_Log → log-transformed version of Fare to reduce skewness

AgeBin → grouped Age into meaningful demographic ranges

IsAlone → indicated whether a passenger travelled alone

These features capture important patterns related to survival.

#### 4) Dropped irrelevant columns

We removed: SibSp and Parch

These columns do not provide useful predictive information.

#### 5) Fully validated the dataset

We checked:

Column types

Missing values

Feature distributions

Dataset shape

Existence of only numeric features

Our dataset is now fully clean and ML-ready.

#### 6) Created a train–test split

We separated the data into training and testing sets to prepare for model evaluation in Day 5.

#### 7) Saved the processed dataset

This ensures reproducibility and allows us to start modeling immediately tomorrow.