# Categorical Data:

A categorical variable is one that has two or more categories (values).
- Many ML algorithms are unable to operate on categorical or label data directly. 
- However, Decision tree can directly learn from such data. 
- Hence, they require all input variables and output variables to be numeric. 
- This means that categorical data must be converted to a numerical form.

- **Ordinal Data**: The categories have an inherent order
- **Nominal Data**: The categories do not have an inherent order

# One Hot Encoding

- get_dummies in Pandas library would do the job of encoding as shown below. 
- It would create extra columns for each category using 0 and 1 indicating if the category is present. 
- If category is present it would be indicated by 1 else indicated by 0.

In [2]:
import pandas as pd
import numpy as np


In [5]:
df=pd.read_csv('titanic.csv',usecols=['Sex'])
df.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [11]:
df.value_counts()

Sex   
male      577
female    314
dtype: int64

In [10]:
pd.get_dummies(df,drop_first=True)

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1
...,...
886,1
887,0
888,0
889,1


In [12]:
df=pd.read_csv('titanic.csv',usecols=['Embarked'])
df.head()

Unnamed: 0,Embarked
0,S
1,C
2,S
3,S
4,S


In [13]:
df.value_counts()

Embarked
S           644
C           168
Q            77
dtype: int64

In [14]:
pd.get_dummies(df,drop_first=True)

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1
...,...,...
886,0,1
887,0,1
888,0,1
889,0,0


# One Hot Encoding on Many Categories

The technique is that we will limit one-hot encoding to the 10 most frequent labels of the variable. This means that we would make one binary variable for each of the 10 most frequent labels only, this is equivalent to grouping all other labels under a new category, which in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present is 1 or not then 0 for a particular observation.

#### A dataset "mercedesbenz" contains many categories, Let's try to apply one hot encoding to this dataset.

# Importing Modules

In [1]:
import pandas as pd
import numpy as np

# Dataset

In [8]:
df=pd.read_csv('mercedesbenz.csv',usecols=['X0','X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [16]:
for col in df:
    print(col,'=',df[col].unique())

X0 = ['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']
X1 = ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
X2 = ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
X3 = ['a' 'e' 'c' 'f' 'd' 'b' 'g']
X4 = ['d' 'b' 'c' 'a']
X5 = ['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
X6 = ['j' 'l' 'd' 'h' 'i' 'a' 'g' 'c' 'k' 'e' 'f' 'b']


# Finding Unique Values:

In [10]:
df.columns

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6'], dtype='object')

In [14]:
for col in df.columns:
    print(col,':',len(df[col].unique()),'labels')
    

X0 : 47 labels
X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels


In [18]:
df.shape

(4209, 7)

# No of Columns after Applying One Hot Encoding:

In [78]:
pd.get_dummies(df,drop_first=True).shape

(4209, 183)

- Now, Total 163 columns will be made instead of 7 (Main problem to use one hot encoding as there are lot of categorical features)

### Most Frequent variables

In [68]:
# For X0:

df['X0'].value_counts().sort_values(ascending=False).head(10)


z     360
ak    349
y     324
ay    313
t     306
x     300
o     269
f     227
n     195
w     182
Name: X0, dtype: int64

In [74]:
top_10= [x for x in df['X0'].value_counts().sort_values(ascending=False).head(10).index]
top_10

['z', 'ak', 'y', 'ay', 't', 'x', 'o', 'f', 'n', 'w']

In [76]:
for label in top_10:
    df[label]=np.where(df['X0']==label,1,0)
df[['X0']+top_10].head(40)

Unnamed: 0,X0,z,ak,y,ay,t,x,o,f,n,w
0,k,0,0,0,0,0,0,0,0,0,0
1,k,0,0,0,0,0,0,0,0,0,0
2,az,0,0,0,0,0,0,0,0,0,0
3,az,0,0,0,0,0,0,0,0,0,0
4,az,0,0,0,0,0,0,0,0,0,0
5,t,0,0,0,0,1,0,0,0,0,0
6,al,0,0,0,0,0,0,0,0,0,0
7,o,0,0,0,0,0,0,1,0,0,0
8,w,0,0,0,0,0,0,0,0,0,1
9,j,0,0,0,0,0,0,0,0,0,0


In [83]:
# For whole dataset

def one_hot_encoding_top_10(df,variable,top_10_labels):
    for label in top_10_labels:
        df[variable+'_'+label]=np.where(df[variable]==label,1,0)


In [84]:
df=pd.read_csv('mercedesbenz.csv',usecols=['X0','X1','X2','X3','X4','X5','X6'])


In [86]:
one_hot_encoding_top_10(df,'X5',top_10)
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X5_z,X5_ak,X5_y,X5_ay,X5_t,X5_x,X5_o,X5_f,X5_n,X5_w
0,k,v,at,a,d,u,j,0,0,0,0,0,0,0,0,0,0
1,k,t,av,e,d,y,l,0,0,1,0,0,0,0,0,0,0
2,az,w,n,c,d,x,j,0,0,0,0,0,1,0,0,0,0
3,az,t,n,f,d,x,l,0,0,0,0,0,1,0,0,0,0
4,az,v,n,f,d,h,d,0,0,0,0,0,0,0,0,0,0


# Ordinal Number Encoding (Label Encoding)

- We use this categorical data encoding technique when the categorical feature is ordinal. In this case, retaining the order is important. Hence encoding should reflect the sequence.
- In Label encoding, each label is converted into an integer value.
- To replace each category in column, we have to create dictionary having key as each category and value as arbitrary number for that category.
- Then, each category can be mapped to the number defined in dictionary in column. 

In [15]:
import datetime

In [16]:
today_date=datetime.datetime.today()

In [17]:
today_date

datetime.datetime(2022, 12, 20, 10, 2, 58, 292633)

In [18]:
today_date-datetime.timedelta(5)

datetime.datetime(2022, 12, 15, 10, 2, 58, 292633)

In [21]:
[today_date-datetime.timedelta(n) for n in range(0,25)]

[datetime.datetime(2022, 12, 20, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 19, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 18, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 17, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 16, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 15, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 14, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 13, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 12, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 11, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 10, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 9, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 8, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 7, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 6, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 5, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 4, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 3, 10, 2, 58, 292633),
 datetime.datetime(2022, 12, 2, 10, 2, 58, 292633),
 

In [25]:
data=[today_date-datetime.timedelta(n) for n in range(0,25)]
data=pd.DataFrame(data)
data.columns=['days']
data

Unnamed: 0,days
0,2022-12-20 10:02:58.292633
1,2022-12-19 10:02:58.292633
2,2022-12-18 10:02:58.292633
3,2022-12-17 10:02:58.292633
4,2022-12-16 10:02:58.292633
5,2022-12-15 10:02:58.292633
6,2022-12-14 10:02:58.292633
7,2022-12-13 10:02:58.292633
8,2022-12-12 10:02:58.292633
9,2022-12-11 10:02:58.292633


In [29]:
data['week_day']=data['days'].dt.day_name()

In [30]:
data.head()

Unnamed: 0,days,week_day
0,2022-12-20 10:02:58.292633,Tuesday
1,2022-12-19 10:02:58.292633,Monday
2,2022-12-18 10:02:58.292633,Sunday
3,2022-12-17 10:02:58.292633,Saturday
4,2022-12-16 10:02:58.292633,Friday


In [32]:
dict={'Monday':1,
'Tuesday':2,
    'Wednesday':3,
        'Thursday':4,
            'Friday':5,
                'Saturday':6,
                    'Sunday':7    
    
}

In [33]:
data['day_ordinal']=data['week_day'].map(dict)

In [34]:
data.head()

Unnamed: 0,days,week_day,day_ordinal
0,2022-12-20 10:02:58.292633,Tuesday,2
1,2022-12-19 10:02:58.292633,Monday,1
2,2022-12-18 10:02:58.292633,Sunday,7
3,2022-12-17 10:02:58.292633,Saturday,6
4,2022-12-16 10:02:58.292633,Friday,5


# Count or Frequency Encoding

- First step is to create the dictionary with key as category and values as frequency(or count) of that category. 
- Then, replace the categories by counts using dictionary

In [62]:
df = pd.read_csv('mercedesbenz.csv', usecols=['X1', 'X2', 'X3'])
df.head()

Unnamed: 0,X1,X2,X3
0,v,at,a
1,t,av,e
2,w,n,c
3,t,n,f
4,v,n,f


In [63]:
df.shape

(4209, 3)

**Unique Values**

In [64]:
for feature in ['X1','X2','X3']:
    print(feature,'\n' ,df[feature].unique())

X1 
 ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
X2 
 ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
X3 
 ['a' 'e' 'c' 'f' 'd' 'b' 'g']


**No. of Labels**

In [65]:
for feature in ['X1','X2','X3']:
    print(feature,':' ,df[feature].nunique(),'Labels')

X1 : 27 Labels
X2 : 44 Labels
X3 : 7 Labels


In [66]:
df['X2'].value_counts()

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
ag      19
z       19
d       18
ac      13
g       12
ap      11
y       11
x       10
aw       8
at       6
h        6
al       5
an       5
q        5
av       4
ah       4
p        4
au       3
am       1
j        1
af       1
l        1
aa       1
c        1
o        1
ar       1
Name: X2, dtype: int64

In [67]:
frequency=df['X2'].value_counts().to_dict()

In [68]:
df.head()

Unnamed: 0,X1,X2,X3
0,v,at,a
1,t,av,e
2,w,n,c
3,t,n,f
4,v,n,f


In [69]:
df['X2']=df['X2'].map(frequency)
df.head()

Unnamed: 0,X1,X2,X3
0,v,6,a
1,t,4,e
2,w,137,c
3,t,137,f
4,v,137,f


# Target Guided Ordinal Encoding

- Target encoding is a Baysian encoding technique.
- Bayesian encoders use information from dependent/target variables to encode the categorical data.
- In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value.
- In the case of the categorical target variables, the posterior probability of the target replaces each category..

In [76]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [77]:
df['Cabin'].fillna('Missing',inplace=True)
df.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [78]:
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [79]:
df['Cabin'].unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [84]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [88]:
df.groupby(['Cabin'])['Survived'].mean().sort_values(ascending=False)*100

Cabin
D    75.757576
E    75.000000
B    74.468085
F    61.538462
C    59.322034
G    50.000000
A    46.666667
M    29.985444
T     0.000000
Name: Survived, dtype: float64

In [90]:
ordinal_labels=(df.groupby(['Cabin'])['Survived'].mean().sort_values(ascending=False)*100).index
ordinal_labels

Index(['D', 'E', 'B', 'F', 'C', 'G', 'A', 'M', 'T'], dtype='object', name='Cabin')

In [93]:
enumerate(ordinal_labels,0)

<enumerate at 0x1eb1c1bbb80>

In [110]:
ordinal_labels1={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels1

{'D': 0, 'E': 1, 'B': 2, 'F': 3, 'C': 4, 'G': 5, 'A': 6, 'M': 7, 'T': 8}

In [114]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels1)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,7
1,1,C,4
2,1,M,7
3,1,C,4
4,0,M,7


#### Mean Encoding

In [115]:
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [117]:
ordinal_mean_encoding=df.groupby(['Cabin'])['Survived'].mean().to_dict()
ordinal_mean_encoding

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [120]:
df['Cabin_mean_encoding']=df['Cabin'].map(ordinal_mean_encoding)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,Cabin_mean_encoding
0,0,M,7,0.299854
1,1,C,4,0.59322
2,1,M,7,0.299854
3,1,C,4,0.59322
4,0,M,7,0.299854


# Probability Ratio Encoding 
1. Probability of Survived based on Cabin......Categorical Feature
2. Probability of Not Survived
3. pr(Survived)/pr(Not Survived)
4. Dictionery to map cabin with probability
5. Replace with the categorical feature

In [122]:
df=pd.read_csv('titanic/train.csv',usecols=['Cabin', 'Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [123]:
df['Cabin'].fillna('Missing ',inplace=True)
df.head(20)

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing
5,0,Missing
6,0,E46
7,0,Missing
8,1,Missing
9,1,Missing


In [124]:
df['Cabin'].unique()

array(['Missing ', 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C6

In [125]:
df['Cabin']=df['Cabin'].astype(str).str[0]
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [126]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [127]:
prob_df=df.groupby(['Cabin'])['Survived'].mean()
prob_df=pd.DataFrame(prob_df)
prob_df

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [128]:
prob_df['Died']=1-prob_df['Survived']
prob_df.head()

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [129]:
prob_df['Probability_ratio']=prob_df['Survived']/prob_df['Died']
prob_df.head()

Unnamed: 0_level_0,Survived,Died,Probability_ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [130]:
probability_encoded=prob_df['Probability_ratio'].to_dict()
df['Cabin_encoded']=df['Cabin'].map(probability_encoded)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274


# Conclusion
- To summarize, encoding categorical data is an unavoidable part of the feature engineering. 
- It is more important to know what coding scheme should we use. Having into consideration the dataset we are working with and the model we are going to use.
- As handling categorical variables in any dataset is crucial step in feature engineering, any of the above techniques can be applied depending upon type of model. 
- Some techniques works better with linear models such as Logistic Regression and some with non-linear models such as Decision trees. 
- If there are lesser categories and it is nominal categorical data, then one-hot encoding works just fine. 
- If the relationship between any categorical column as independent variable and dependent variable (Target Variable) is important, then Target Guided Ordinal Encoding (Ordered Integer Encoding) can be applied.
- For ordinal categorical data, simply Label Encoding can be used.