<a href="https://colab.research.google.com/github/Movya777/EDA_and_Feature_Engineering/blob/main/%F0%9F%9A%A2Titanic2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset link: https://www.kaggle.com/competitions/titanic/data

**Objective of this notebook**

Handle categorical features

Techniques followed:

1. Nominal: One-hot encoding, mean encoding, probability ratio encoding
2. Ordinal: Target-guided encoding

In [None]:
import pandas as pd
import numpy as np

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Practice Datasets/titanic/train.csv')

In [None]:
cat_cols=[i for i in df.columns if df[i].dtypes=='O']
cat_cols

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [None]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


# One-hot enconding: few categories

- Sex, Embarked column

In [None]:
# replace NANs in embarked column with mode
df['Embarked'].fillna(df['Embarked'].mode().index[0],inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode().index[0],inplace=True)


In [None]:
# use one-hot encoding for sex and embarked
data=pd.get_dummies(df,columns=['Sex','Embarked'],drop_first=True,dtype=int)

In [None]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,1,0,0,1


# Target-guided encoding

- cabin column

In [None]:
data['Cabin'].fillna("Missing", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Cabin'].fillna("Missing", inplace=True)


In [None]:
data['Cabin'].head(10)

Unnamed: 0,Cabin
0,Missing
1,C85
2,Missing
3,C123
4,Missing
5,Missing
6,E46
7,Missing
8,Missing
9,Missing


In [None]:
data.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0
Cabin,0


In [None]:
data['Cabin'].unique()

array(['Missing', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62

If you observe, each alphabhet represents the level of cabin

The numbers represent the seat number belonging to that specific block

In [None]:
#Lets take only the block letter
data['Cabin_block']=data['Cabin'].astype(str).str[0]

In [None]:
data.drop('Cabin',axis=1,inplace=True)

In [None]:
data2=data.groupby(['Cabin_block'])['Survived'].mean()
data2

Unnamed: 0_level_0,Survived
Cabin_block,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


The survival probability of a person in A block is around 46%, B=74% and so on.....


In [None]:
ordinal_labels = data2.sort_values().index

In [None]:
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin_block')

In [None]:
# map these labels to a number
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [None]:
data['Cabin']=data['Cabin_block'].map(ordinal_labels2)

In [None]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_block,Cabin
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,1,0,0,1,M,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,0,1,0,0,C,4
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0,0,0,1,M,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,0,0,0,1,C,4
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,1,0,0,1,M,1


In [None]:
data.drop('Cabin_block',axis=1,inplace=True)

# Mean Encoding

In [None]:
mean_labels=data2.to_dict()

In [None]:
#map the cabin column with mean_labels

# Probability Ratio Encoding

Step1: Find out percentage of survived

Step2: FInd out percentage of died

Step3: Divide survived with died

Step4: Map the ratio with categories

In [None]:
dpr=pd.read_csv('/content/drive/MyDrive/Practice Datasets/titanic/train.csv',usecols=['Cabin','Survived'])

In [None]:
# fix NAN values in cabin
dpr['Cabin'].fillna('Missing', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dpr['Cabin'].fillna('Missing', inplace=True)


In [None]:
dpr['Cabin']=dpr['Cabin'].astype(str).str[0]

In [None]:
prob_df=dpr.groupby(['Cabin'])['Survived'].mean()
# prob_dpr is a series

In [None]:
# conver series into a dataframe
prob_df=pd.DataFrame(prob_df)

In [None]:
prob_df['Die']=1-prob_df['Survived']

In [None]:
prob_df

Unnamed: 0_level_0,Survived,Die
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25
F,0.615385,0.384615
G,0.5,0.5
M,0.299854,0.700146
T,0.0,1.0


In [None]:
# probability ratio encoding
prob_df['ratio']=prob_df['Survived']/prob_df['Die']

In [None]:
ratio=prob_df['ratio'].to_dict()

In [None]:
dpr['Cabin']=dpr['Cabin'].map(ratio)

In [None]:
dpr

Unnamed: 0,Survived,Cabin
0,0,0.428274
1,1,1.458333
2,1,0.428274
3,1,1.458333
4,0,0.428274
...,...,...
886,0,0.428274
887,1,2.916667
888,0,0.428274
889,1,1.458333
