feature_engineering.ipynb

In [2]:
import pandas as pd
import numpy as np

In [3]:
df=pd.read_csv("titanic_dataset.csv")

In [4]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [5]:
df['Age'].fillna(df['Age'].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(),inplace=True)


In [7]:
df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0],inplace=True)


In [8]:
df.drop(columns=['Cabin'],inplace=True)

In [9]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           1
Embarked       0
dtype: int64

In [10]:
df['Fare'].fillna(df['Fare'].mode()[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Fare'].fillna(df['Fare'].mode()[0],inplace=True)


In [11]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [13]:
df['FamilySize']=df['SibSp']+df['Parch']+1

In [14]:
df[['SibSp','Parch','FamilySize']].head()

Unnamed: 0,SibSp,Parch,FamilySize
0,0,0,1
1,1,0,2
2,0,0,1
3,0,0,1
4,1,1,3


In [15]:
df['IsAlone']=np.where(df['FamilySize']==1,1,0)

In [16]:
df[['FamilySize','IsAlone']].head(10)

Unnamed: 0,FamilySize,IsAlone
0,1,1
1,2,0
2,1,1
3,1,1
4,3,0
5,1,1
6,1,1
7,3,0
8,1,1
9,3,0


Why FamilySize matters:

Passengers traveling with small families had higher survival chances due to mutual support during evacuation.

Very large families faced coordination issues, reducing survival probability.

Family size provides more information than SibSp and Parch separately.

Why IsAlone matters for survival:

Passengers traveling alone were less likely to receive help during emergency situations.

Individuals with family members had better access to shared information and assistance.

This feature captures social support, which strongly influences survival outcomes.


Machine Learning models cannot understand text.
We convert categories → numbers.

In [17]:
df.select_dtypes(include='object').columns

Index(['Name', 'Sex', 'Ticket', 'Embarked'], dtype='object')

In [18]:
df.drop(columns=['Name','Ticket'],inplace=True)

Why drop?

High cardinality

No direct survival signal

Hard to encode meaningfully

In [19]:
df['Sex']=df['Sex'].map({'male':0,'female':1})

Why label encoding?

Binary category

Preserves meaning

Efficient

In [20]:
df=pd.get_dummies(df,columns=['Embarked'],drop_first=True)

Why one-hot?

No ordinal relationship

Prevents false ranking

In [21]:
df['Pclass']=df['Pclass'].astype(int)

In [23]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int32  
 3   Sex          418 non-null    int64  
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Fare         418 non-null    float64
 8   FamilySize   418 non-null    int64  
 9   IsAlone      418 non-null    int32  
 10  Embarked_Q   418 non-null    bool   
 11  Embarked_S   418 non-null    bool   
dtypes: bool(2), float64(2), int32(2), int64(6)
memory usage: 30.3 KB


Why encoding is required:

Machine learning algorithms work only with numerical data and cannot interpret text values.

Encoding converts categorical variables into a numerical format that models can process.

Proper encoding helps the model learn patterns without losing important information.

Why different encoding methods were used:

Sex was label encoded because it is a binary feature with only two categories.

Embarked was one-hot encoded because its categories have no inherent order.

Using appropriate encoding methods prevents introducing false relationships between categories.

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Embarked_Q,Embarked_S
0,892,0,3,0,34.5,0,0,7.8292,1,1,True,False
1,893,1,3,1,47.0,1,0,7.0,2,0,False,True
2,894,0,2,0,62.0,0,0,9.6875,1,1,True,False
3,895,0,3,0,27.0,0,0,8.6625,1,1,False,True
4,896,1,3,1,22.0,1,1,12.2875,3,0,False,True


New features exist (FamilySize, IsAlone)

Encoded columns (Sex, Embarked_*)

No text columns

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int32  
 3   Sex          418 non-null    int64  
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Fare         418 non-null    float64
 8   FamilySize   418 non-null    int64  
 9   IsAlone      418 non-null    int32  
 10  Embarked_Q   418 non-null    bool   
 11  Embarked_S   418 non-null    bool   
dtypes: bool(2), float64(2), int32(2), int64(6)
memory usage: 30.3 KB


All columns → int or float

ZERO object/string columns

Reasonable memory usage

In [26]:
df.to_csv("titanic_cleaned.csv",index=False)