## Handling Categorical Values

### Types of Encoding

#### 1) Nominal Encoding (like gender,states)
1. One Hot Encoding
2. One Hot Encoding with many categorical (like pincode)
3. Mean Encoding (like pincode) (pincodes are replaces by its mean value)

#### 2) Ordinal Encoding (like rating)
1. Label Encoding (rating , education)
2. Target guided Ordinal Encoding (Label are given on the bases of mean. Highest the mean, highest the label)

*Ordering the labels according to the target*

*Replace the labels by the joint probability*

#### 3) Another Encoding i.e Count Encoding: Replace the categories with there count
1. Count Encoding

*It is used when are are lots of categories of a variable*

*It does not create new feature.*

**Disadvantage:** If same labels has same count then Replaced by same count and we will loose some valuable information

In [38]:
import pandas as pd

df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [39]:
df.shape

(891, 12)

In [40]:
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [41]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [42]:
num = [col for col in df.columns if df[col].dtypes != 'O']
from sklearn.impute import KNNImputer
knn = KNNImputer(n_neighbors=5)
knn.fit(df[num])
knn.transform(df[num])
df2 = pd.DataFrame(knn.transform(df[num]),columns=num)
df2.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

In [43]:
df2.dtypes

PassengerId    float64
Survived       float64
Pclass         float64
Age            float64
SibSp          float64
Parch          float64
Fare           float64
dtype: object

In [44]:
df2 = df2.astype(int)
df2.dtypes

PassengerId    int32
Survived       int32
Pclass         int32
Age            int32
SibSp          int32
Parch          int32
Fare           int32
dtype: object

In [45]:
df2["Cabin"] = df.Cabin
df2["Embarked"] = df.Embarked
df2["Sex"] = df.Sex
df2["Name"] = df.Name
df2.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Age              0
SibSp            0
Parch            0
Fare             0
Cabin          687
Embarked         2
Sex              0
Name             0
dtype: int64

In [46]:
def impute_nan(df2,variable):
    most_frequent_category=df2[variable].mode()[0]
    df2[variable] = df2[variable].fillna(most_frequent_category)

for feature in ['Cabin','Embarked']:
    impute_nan(df2,feature)
    

df = df2
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Cabin          0
Embarked       0
Sex            0
Name           0
dtype: int64

In [47]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Cabin,Embarked,Sex,Name
0,1,0,3,22,1,0,7,B96 B98,S,male,"Braund, Mr. Owen Harris"
1,2,1,1,38,1,0,71,C85,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,3,26,0,0,7,B96 B98,S,female,"Heikkinen, Miss. Laina"
3,4,1,1,35,1,0,53,C123,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,3,35,0,0,8,B96 B98,S,male,"Allen, Mr. William Henry"


In [48]:
df.to_csv(r'titanic_with_no_nan.csv', index = False)

### 1.1) One Hot Encoding


In [49]:
df['Sex'].head()

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object

In [50]:
pd.get_dummies(df['Sex']).head()

Unnamed: 0,female,male
0,False,True
1,True,False
2,True,False
3,True,False
4,False,True


In [51]:
pd.get_dummies(df['Sex'],drop_first=True).head()

Unnamed: 0,male
0,True
1,False
2,False
3,False
4,True


### 1.2) One Hot Encoding With Many categorical Features

* PINCODE, CABIN NUMBER

* Find K number of categories repeating most frequent, and then take them and create k new cols and drop the original col.

In [52]:

top20=[x for x in df['Cabin'].value_counts().sort_values(ascending=False).head(20).index]
top20

['B96 B98',
 'C23 C25 C27',
 'G6',
 'C22 C26',
 'F33',
 'D',
 'F2',
 'E101',
 'C52',
 'D20',
 'B22',
 'E25',
 'D36',
 'E67',
 'D17',
 'C123',
 'E33',
 'B28',
 'F4',
 'C83']

In [53]:
df.shape

(891, 11)

In [54]:
import numpy as np
for label in top20:
    df[label]=np.where(df['Cabin']==label,1,0)
df = df.drop("Cabin",axis=1)

In [55]:
df.shape

(891, 30)

In [56]:
df.head(3).T

Unnamed: 0,0,1,2
PassengerId,1,2,3
Survived,0,1,1
Pclass,3,1,3
Age,22,38,26
SibSp,1,1,0
Parch,0,0,0
Fare,7,71,7
Embarked,S,C,S
Sex,male,female,female
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina"


### 1.3) Mean Encoding

In [57]:
import pandas as pd
df = pd.read_csv("titanic_with_no_nan.csv",usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,B96 B98
1,1,C85
2,1,B96 B98
3,1,C123
4,0,B96 B98


In [58]:
df['Cabin'].value_counts()

Cabin
B96 B98        691
G6               4
C23 C25 C27      4
C22 C26          3
F33              3
              ... 
E34              1
C7               1
C54              1
E36              1
C148             1
Name: count, Length: 147, dtype: int64

In [59]:
df['Cabin'].value_counts()/len(df['Cabin'])

Cabin
B96 B98        0.775533
G6             0.004489
C23 C25 C27    0.004489
C22 C26        0.003367
F33            0.003367
                 ...   
E34            0.001122
C7             0.001122
C54            0.001122
E36            0.001122
C148           0.001122
Name: count, Length: 147, dtype: float64

In [60]:
temp=df['Cabin'].value_counts()/len(df['Cabin'])
mean_encoding=temp.to_dict()

from collections import Counter
Counter(mean_encoding).most_common(5)

[('B96 B98', 0.7755331088664422),
 ('G6', 0.004489337822671156),
 ('C23 C25 C27', 0.004489337822671156),
 ('C22 C26', 0.003367003367003367),
 ('F33', 0.003367003367003367)]

In [61]:
df['mean_encoding']=df['Cabin'].map(mean_encoding)
df[['Cabin','mean_encoding']].head()

Unnamed: 0,Cabin,mean_encoding
0,B96 B98,0.775533
1,C85,0.001122
2,B96 B98,0.775533
3,C123,0.002245
4,B96 B98,0.775533


***************************************************************************************************************************

# Ordinal Encoding

In [62]:
import datetime
import pandas as pd
today_date = datetime.datetime.today()
days = [today_date-datetime.timedelta(x) for x in range(0,15)]
df = pd.DataFrame(days,columns=["Day"])
df

Unnamed: 0,Day
0,2025-04-02 17:23:16.760130
1,2025-04-01 17:23:16.760130
2,2025-03-31 17:23:16.760130
3,2025-03-30 17:23:16.760130
4,2025-03-29 17:23:16.760130
5,2025-03-28 17:23:16.760130
6,2025-03-27 17:23:16.760130
7,2025-03-26 17:23:16.760130
8,2025-03-25 17:23:16.760130
9,2025-03-24 17:23:16.760130


In [63]:
df["weekday"]=df["Day"].dt.day_name()
df["weekday"]

0     Wednesday
1       Tuesday
2        Monday
3        Sunday
4      Saturday
5        Friday
6      Thursday
7     Wednesday
8       Tuesday
9        Monday
10       Sunday
11     Saturday
12       Friday
13     Thursday
14    Wednesday
Name: weekday, dtype: object

### Label Encoding

In [64]:
dictionary = {
    "Monday":1, "Tuesday":2,
    "Wednesday":3, "Thursday":4,
    "Friday":5, "Saturday":6,
    "Sunday":7
}

df["weekday"].map(dictionary)

0     3
1     2
2     1
3     7
4     6
5     5
6     4
7     3
8     2
9     1
10    7
11    6
12    5
13    4
14    3
Name: weekday, dtype: int64

# Titanic Dataset Encoding

## 2.1) Label Encoding

**Advantages**


**Disadvantages**

In [65]:
import pandas as pd
df = pd.read_csv("titanic_with_no_nan.csv",usecols=['Sex','Survived'])
df.head()

Unnamed: 0,Survived,Sex
0,0,male
1,1,female
2,1,female
3,1,female
4,0,male


In [66]:
dict_map = {'male':1,"female":0}
df["Sex_num"] = df['Sex'].map(dict_map)
df.head()

Unnamed: 0,Survived,Sex,Sex_num
0,0,male,1
1,1,female,0
2,1,female,0
3,1,female,0
4,0,male,1


## 2.2) Target Guided Ordinal Encoding

In [67]:
import pandas as pd
df = pd.read_csv("titanic_with_no_nan.csv",usecols=['Sex','Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin,Sex
0,0,B96 B98,male
1,1,C85,female
2,1,B96 B98,female
3,1,C123,female
4,0,B96 B98,male


In [68]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A10    0.0
A14    0.0
A16    1.0
A19    0.0
A20    1.0
      ... 
F33    1.0
F38    0.0
F4     1.0
G6     0.5
T      0.0
Name: Survived, Length: 147, dtype: float64

In [69]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()
Counter(mean_ordinal).most_common(10)


[('A16', 1.0),
 ('A20', 1.0),
 ('A23', 1.0),
 ('A26', 1.0),
 ('A31', 1.0),
 ('A34', 1.0),
 ('A6', 1.0),
 ('B101', 1.0),
 ('B18', 1.0),
 ('B20', 1.0)]

In [70]:
mean_ordinal

{'A10': 0.0,
 'A14': 0.0,
 'A16': 1.0,
 'A19': 0.0,
 'A20': 1.0,
 'A23': 1.0,
 'A24': 0.0,
 'A26': 1.0,
 'A31': 1.0,
 'A32': 0.0,
 'A34': 1.0,
 'A36': 0.0,
 'A5': 0.0,
 'A6': 1.0,
 'A7': 0.0,
 'B101': 1.0,
 'B102': 0.0,
 'B18': 1.0,
 'B19': 0.0,
 'B20': 1.0,
 'B22': 0.5,
 'B28': 1.0,
 'B3': 1.0,
 'B30': 0.0,
 'B35': 1.0,
 'B37': 0.0,
 'B38': 0.0,
 'B39': 1.0,
 'B4': 1.0,
 'B41': 1.0,
 'B42': 1.0,
 'B49': 1.0,
 'B5': 1.0,
 'B50': 1.0,
 'B51 B53 B55': 0.5,
 'B57 B59 B63 B66': 1.0,
 'B58 B60': 0.5,
 'B69': 1.0,
 'B71': 0.0,
 'B73': 1.0,
 'B77': 1.0,
 'B78': 1.0,
 'B79': 1.0,
 'B80': 1.0,
 'B82 B84': 0.0,
 'B86': 0.0,
 'B94': 0.0,
 'B96 B98': 0.30390738060781475,
 'C101': 1.0,
 'C103': 1.0,
 'C104': 1.0,
 'C106': 1.0,
 'C110': 0.0,
 'C111': 0.0,
 'C118': 0.0,
 'C123': 0.5,
 'C124': 0.0,
 'C125': 1.0,
 'C126': 1.0,
 'C128': 0.0,
 'C148': 1.0,
 'C2': 0.5,
 'C22 C26': 0.3333333333333333,
 'C23 C25 C27': 0.5,
 'C30': 0.0,
 'C32': 1.0,
 'C45': 1.0,
 'C46': 0.0,
 'C47': 1.0,
 'C49': 0.0,
 'C50':

In [71]:
df['mean_nominal_encode']=df['Cabin'].map(mean_ordinal)
df[['Cabin','mean_nominal_encode']].head()

Unnamed: 0,Cabin,mean_nominal_encode
0,B96 B98,0.303907
1,C85,1.0
2,B96 B98,0.303907
3,C123,0.5
4,B96 B98,0.303907


# 3) Count Encoding

### Replace the categories with there count

**Advantages**
* Easy To Use
* Not increasing feature space
**Disadvantages**
* It will provide same weight if the frequencies are same

**Target Guided Ordinal Encoding**

* Ordering the labels according to the target
* Replace the labels by the joint probability of being 1 or 0

In [72]:
import pandas as pd
df = pd.read_csv("titanic_with_no_nan.csv",usecols=['Sex','Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin,Sex
0,0,B96 B98,male
1,1,C85,female
2,1,B96 B98,female
3,1,C123,female
4,0,B96 B98,male


In [73]:
df["Cabin"].value_counts().to_dict()

{'B96 B98': 691,
 'G6': 4,
 'C23 C25 C27': 4,
 'C22 C26': 3,
 'F33': 3,
 'D': 3,
 'F2': 3,
 'E101': 3,
 'E24': 2,
 'B49': 2,
 'E8': 2,
 'C125': 2,
 'B20': 2,
 'B77': 2,
 'D35': 2,
 'C78': 2,
 'C93': 2,
 'C65': 2,
 'B57 B59 B63 B66': 2,
 'B5': 2,
 'E121': 2,
 'B51 B53 B55': 2,
 'B18': 2,
 'C124': 2,
 'C126': 2,
 'B35': 2,
 'E44': 2,
 'C92': 2,
 'C68': 2,
 'D20': 2,
 'B22': 2,
 'E25': 2,
 'D36': 2,
 'E67': 2,
 'D17': 2,
 'D33': 2,
 'C123': 2,
 'C52': 2,
 'B28': 2,
 'F4': 2,
 'C83': 2,
 'E33': 2,
 'B58 B60': 2,
 'C2': 2,
 'F G73': 2,
 'D26': 2,
 'D28': 1,
 'D19': 1,
 'A23': 1,
 'D50': 1,
 'D9': 1,
 'A20': 1,
 'D11': 1,
 'B41': 1,
 'A26': 1,
 'E17': 1,
 'E68': 1,
 'A10': 1,
 'A24': 1,
 'C101': 1,
 'A16': 1,
 'C70': 1,
 'C86': 1,
 'C50': 1,
 'B42': 1,
 'B39': 1,
 'B50': 1,
 'B71': 1,
 'D48': 1,
 'B82 B84': 1,
 'D30': 1,
 'C46': 1,
 'E77': 1,
 'D45': 1,
 'F38': 1,
 'B3': 1,
 'B101': 1,
 'C95': 1,
 'D6': 1,
 'C45': 1,
 'E58': 1,
 'C90': 1,
 'C62 C64': 1,
 'A36': 1,
 'F G63': 1,
 'B102': 1,
 '

In [74]:
count = df["Cabin"].value_counts().to_dict()
df["cabin_count"]=df["Cabin"].map(count)
df.head()

Unnamed: 0,Survived,Cabin,Sex,cabin_count
0,0,B96 B98,male,691
1,1,C85,female,1
2,1,B96 B98,female,691
3,1,C123,female,2
4,0,B96 B98,male,691
