# 🌟 Feature Engineering:

Feature engineering is the process of transforming raw data into meaningful features that improve a machine learning model’s performance.

Features = Input variables that represent the problem in a way the model can understand.

Good features often make a bigger difference than using a more complex algorithm.

## 🔍 Key Insights

Garbage in, garbage out → If your features are poor, even the best ML model won’t perform well.

Domain knowledge is gold → Understanding the problem domain helps create meaningful features.

Simplicity works → Sometimes a single engineered feature (like “age group” instead of exact age) can boost accuracy more than adding 10 raw features.

Balance is key → Too many features = noise. Too few features = lack of signal.

In [3]:
import seaborn as sns
import numpy as np
import pandas as pd

In [4]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [5]:
titanic = sns.load_dataset("titanic")

In [6]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [7]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [33]:
data = titanic

In [35]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [37]:
# dropping dupe columns from titanic data
# deck has the most number of missing values. 
# 'who' is dupe of 'sex'
# 'adult_male' is dupe of 'age' + 'sex'
# 'embark_town' is a dupe of 'embarked'


data = data.drop(columns=['deck', 'who', 'adult_male', 'embark_town', 'alive', 'alone'])
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class
0,0,3,male,22.0,1,0,7.2500,S,Third
1,1,1,female,38.0,1,0,71.2833,C,First
2,1,3,female,26.0,0,0,7.9250,S,Third
3,1,1,female,35.0,1,0,53.1000,S,First
4,0,3,male,35.0,0,0,8.0500,S,Third
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second
887,1,1,female,19.0,0,0,30.0000,S,First
888,0,3,female,,1,2,23.4500,S,Third
889,1,1,male,26.0,0,0,30.0000,C,First


### Tasks:

Handle missing values for age and embarked.

Create a new feature: family_size = sibsp + parch + 1.

Convert sex and embarked into numerical features using one-hot encoding.

Create a categorical feature for age_group (child, adult, senior).

Check correlation of engineered features with survival using correlation/chi-square.

In [40]:
# 1. Handle missing values for age and embarked.
# check for null values

data.isnull().sum()

survived      0
pclass        0
sex           0
age         177
sibsp         0
parch         0
fare          0
embarked      2
class         0
dtype: int64

In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   survived  891 non-null    int64   
 1   pclass    891 non-null    int64   
 2   sex       891 non-null    object  
 3   age       714 non-null    float64 
 4   sibsp     891 non-null    int64   
 5   parch     891 non-null    int64   
 6   fare      891 non-null    float64 
 7   embarked  889 non-null    object  
 8   class     891 non-null    category
dtypes: category(1), float64(2), int64(4), object(2)
memory usage: 56.8+ KB


In [44]:
data.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [46]:
# to fill age values, in a less sensitive way to outliers, choose either mean or median

data['age'] = data['age'].fillna(data['age'].median())

In [48]:
data.isnull().sum()

survived    0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    2
class       0
dtype: int64

In [50]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   survived  891 non-null    int64   
 1   pclass    891 non-null    int64   
 2   sex       891 non-null    object  
 3   age       891 non-null    float64 
 4   sibsp     891 non-null    int64   
 5   parch     891 non-null    int64   
 6   fare      891 non-null    float64 
 7   embarked  889 non-null    object  
 8   class     891 non-null    category
dtypes: category(1), float64(2), int64(4), object(2)
memory usage: 56.8+ KB


In [52]:
data.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,35.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [54]:
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])

In [56]:
data['embarked']

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: embarked, Length: 891, dtype: object

In [58]:
data['embarked'].isnull().sum()

0

In [60]:
data['embarked'] = data['embarked'].fillna(data['embarked'].mode()[0])

In [62]:
data['embarked'].isnull().sum()

0

In [64]:
data.isnull().sum()

survived    0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
class       0
dtype: int64

In [66]:
# 2. Create a new feature 'family_size' = sibsp + parch + 1

data['family_size'] = data['sibsp'] + data['parch'] + 1

In [68]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,family_size
0,0,3,male,22.0,1,0,7.2500,S,Third,2
1,1,1,female,38.0,1,0,71.2833,C,First,2
2,1,3,female,26.0,0,0,7.9250,S,Third,1
3,1,1,female,35.0,1,0,53.1000,S,First,2
4,0,3,male,35.0,0,0,8.0500,S,Third,1
...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,1
887,1,1,female,19.0,0,0,30.0000,S,First,1
888,0,3,female,28.0,1,2,23.4500,S,Third,4
889,1,1,male,26.0,0,0,30.0000,C,First,1


In [70]:
data[['sibsp', 'parch', 'family_size']].tail()

Unnamed: 0,sibsp,parch,family_size
886,0,0,1
887,0,0,1
888,1,2,4
889,0,0,1
890,0,0,1


In [72]:
# 3. Convert sex and embarked into numerical features using one-hot encoding.

In [74]:
data[['sex', 'embarked']]

Unnamed: 0,sex,embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
...,...,...
886,male,S
887,female,S
888,female,S
889,male,C


In [76]:
data

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,family_size
0,0,3,male,22.0,1,0,7.2500,S,Third,2
1,1,1,female,38.0,1,0,71.2833,C,First,2
2,1,3,female,26.0,0,0,7.9250,S,Third,1
3,1,1,female,35.0,1,0,53.1000,S,First,2
4,0,3,male,35.0,0,0,8.0500,S,Third,1
...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,1
887,1,1,female,19.0,0,0,30.0000,S,First,1
888,0,3,female,28.0,1,2,23.4500,S,Third,4
889,1,1,male,26.0,0,0,30.0000,C,First,1


In [31]:
data = pd.get_dummies(data, columns=['sex', 'embarked'])
data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,family_size,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.2500,Third,2,False,True,False,False,True
1,1,1,38.0,1,0,71.2833,First,2,True,False,True,False,False
2,1,3,26.0,0,0,7.9250,Third,1,True,False,False,False,True
3,1,1,35.0,1,0,53.1000,First,2,True,False,False,False,True
4,0,3,35.0,0,0,8.0500,Third,1,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,0,0,13.0000,Second,1,False,True,False,False,True
887,1,1,19.0,0,0,30.0000,First,1,True,False,False,False,True
888,0,3,28.0,1,2,23.4500,Third,4,True,False,False,False,True
889,1,1,26.0,0,0,30.0000,First,1,False,True,True,False,False


In [78]:
# One Hot Encoding to convert categorical to continuous variable

data = pd.get_dummies(data, columns=['sex', 'embarked'], drop_first=True)

In [80]:
data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,class,family_size,sex_male,embarked_Q,embarked_S
0,0,3,22.0,1,0,7.2500,Third,2,True,False,True
1,1,1,38.0,1,0,71.2833,First,2,False,False,False
2,1,3,26.0,0,0,7.9250,Third,1,False,False,True
3,1,1,35.0,1,0,53.1000,First,2,False,False,True
4,0,3,35.0,0,0,8.0500,Third,1,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,27.0,0,0,13.0000,Second,1,True,False,True
887,1,1,19.0,0,0,30.0000,First,1,False,False,True
888,0,3,28.0,1,2,23.4500,Third,4,False,False,True
889,1,1,26.0,0,0,30.0000,First,1,True,False,False


### ⚡ Tips & Hacks

Handle Missing Values Smartly → Fill with mean/median/mode, or add a “missing” flag.

Scale & Normalize → Standardize numerical data.

Create Ratios → E.g., income/expenses, goals/shots in sports. Ratios often capture patterns better.

Binning → Convert continuous variables into categories (e.g., age → “teen”, “adult”, “senior”, For a dataset with “date of birth”, create new features like age, age group, or is_adult.).

One-hot Encoding → Convert categorical variables into numeric dummy variables.

Feature Interactions → Multiply or combine features (e.g., height * weight = BMI).

Date/Time Features → Extract day, month, weekday, hour, season from timestamps.

Text Features → Use word counts, TF-IDF, or embeddings.

Feature Selection → Remove low-variance or irrelevant features to reduce noise.