## One Hot Encoding - variables with many categories

In [1]:
import pandas as pd
import numpy as np

In [9]:
data = pd.read_csv("Downloads/mercedesbenz.csv",usecols=['X1','X2','X3','X4','X5','X6'])

In [11]:
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,n,f,d,t,a
1,b,ai,a,d,b,g
2,v,as,f,d,a,j
3,l,n,f,d,z,l
4,s,as,c,d,y,i


In [25]:
## Total no.s of unique categorical variables in each column.
for i in data.columns:
    print(i , ':', len(data[i].unique()), 'labels')

X1 : 27 labels
X2 : 45 labels
X3 : 7 labels
X4 : 4 labels
X5 : 32 labels
X6 : 12 labels


In [26]:
# Let's examine how many columns we will obtain after one hot encoding these variables
pd.get_dummies(data, drop_first= True).shape

(4209, 121)

#### Observation : in this scenerio we have 121 features after removing 1st feature but what if we have thousands of features by using one hot encoding. So, we took only most frequent categories of the variable.

In [35]:
# let's find the top 10 most frequent categories for the variable X2

data.X2.value_counts().sort_values(ascending=False).head(10)

as    1658
ae     478
ai     462
m      348
ak     260
r      155
n      113
s      100
f       85
e       84
Name: X2, dtype: int64

In [37]:
#let's make a list with the most frequent categories of the variable 

top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [39]:
#now we make the 10 binary variables
for label in top_10:
    data[label] = np.where(data.X2 == label, 1 , 0)
    
data[['X2'] + top_10].head(20)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,n,0,0,0,0,0,0,1,0,0,0
1,ai,0,0,1,0,0,0,0,0,0,0
2,as,1,0,0,0,0,0,0,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,as,1,0,0,0,0,0,0,0,0,0
5,ai,0,0,1,0,0,0,0,0,0,0
6,ae,0,1,0,0,0,0,0,0,0,0
7,ae,0,1,0,0,0,0,0,0,0,0
8,s,0,0,0,0,0,0,0,1,0,0
9,as,1,0,0,0,0,0,0,0,0,0


In [49]:
# now we will do for all categorical variables.

def one_hot_top_x (df,variable,top_x_labels):
    for label in top_x_labels:
        df[variable + '_' + label] = np.where(data[variable] == label ,1 ,0)
        
one_hot_top_x(data,'X2' , top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,n,f,d,t,a,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,b,ai,a,d,b,g,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
2,v,as,f,d,a,j,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,l,n,f,d,z,l,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,s,as,c,d,y,i,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0


#### after that will drop X1 to X6 col and we will do this for every feature.

## One Hot Encoding of top variables

#### Advantages
1. Straightforward to implement
2. Does not require hrs of variable exploration
3. Does not expand massively the feature space (No.s of columns in the dataset. )

#### Disadvantages 
1. Does not add any information that may make the variable more predictive
2. Does not keep the info of the ignored labels.