## One Hot Encoding - variables with many categories

How to perform One Hot Encoding for a variable with a lot of categories?

In [1]:
import pandas as pd
import numpy as np

In [8]:
data = pd.read_csv('/Users/suryapratapsingh/Downloads/test.csv',
                usecols = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,n,f,d,t,a
1,b,ai,a,d,b,g
2,v,as,f,d,a,j
3,l,n,f,d,z,l
4,s,as,c,d,y,i


In [11]:
#let's check how many unique categories are present in each columns

for col in data.columns:
    print(col, ': ', len(data[col].unique()), 'labels')

X1 :  27 labels
X2 :  45 labels
X3 :  7 labels
X4 :  4 labels
X5 :  32 labels
X6 :  12 labels


In [13]:
#let's check how many columns will get created after applying one hot encoding on these variables
pd.get_dummies(data, drop_first = True).shape

(4209, 121)

- we can see that there are total of 121 columns after implementing One Hot Encoding

##### Let's say we have cases where the number of columns generated after One Hot Encoding are even higher, say ~500. How do we handle such cases?

### What can we do instead?

- We can limit the One Hot encoding to 10 most frequent labels of the variable. Therefore, we will create one binary variable for 10 (can be different depending on the # variables present) most frequent labels only. 
- All other labels will be grouped into a new categiry and dropped.
- Therefore, the 10 new dummy variables will indicate if one of the 10 most frequenct labels are present for a particualr observation

In [18]:
# Let's see how we can implement this in python

data.X2.value_counts().sort_values(ascending = False).head(20)

X2
as    1658
ae     478
ai     462
m      348
ak     260
r      155
n      113
s      100
f       85
e       84
ay      78
aq      72
a       44
b       38
k       25
t       25
ag      23
ac      20
ao      19
i       15
Name: count, dtype: int64

In [22]:
#let's create a list with most frequent vategories of the variable

top_10 = [x for x in data.X2.value_counts().sort_values(ascending = False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [26]:
#let's create the 10 binary variables

for label in top_10:
    data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top_10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,n,0,0,0,0,0,0,1,0,0,0
1,ai,0,0,1,0,0,0,0,0,0,0
2,as,1,0,0,0,0,0,0,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,as,1,0,0,0,0,0,0,0,0,0
5,ai,0,0,1,0,0,0,0,0,0,0
6,ae,0,1,0,0,0,0,0,0,0,0
7,ae,0,1,0,0,0,0,0,0,0,0
8,s,0,0,0,0,0,0,0,1,0,0
9,as,1,0,0,0,0,0,0,0,0,0


In [27]:
#now, we need to implement this case across all categorical variables

def one_hot_top_x(df, variable, top_x_labels): #function to create dummy variables for top 10 frequent labels
    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)
        
#read the data again
data = pd.read_csv('/Users/suryapratapsingh/Downloads/test.csv',
                usecols = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'])

#encode X2 for the top 10 most frequenct labels
one_hot_top_x(data, 'X2', top_10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,n,f,d,t,a,0,0,0,0,0,0,1,0,0,0
1,b,ai,a,d,b,g,0,0,1,0,0,0,0,0,0,0
2,v,as,f,d,a,j,1,0,0,0,0,0,0,0,0,0
3,l,n,f,d,z,l,0,0,0,0,0,0,1,0,0,0
4,s,as,c,d,y,i,1,0,0,0,0,0,0,0,0,0


In [28]:
# let's encode the reamaining categorical variables
top_10 = [x for x in data.X1.value_counts().sort_values(ascending = False).head(10).index]
one_hot_top_x(data, 'X1', top_10)

top_10 = [x for x in data.X3.value_counts().sort_values(ascending = False).head(10).index]
one_hot_top_x(data, 'X3', top_10)

top_10 = [x for x in data.X4.value_counts().sort_values(ascending = False).head(10).index]
one_hot_top_x(data, 'X4', top_10)

top_10 = [x for x in data.X5.value_counts().sort_values(ascending = False).head(10).index]
one_hot_top_x(data, 'X5', top_10)

top_10 = [x for x in data.X6.value_counts().sort_values(ascending = False).head(10).index]
one_hot_top_x(data, 'X6', top_10)

In [30]:
#dropping the original columns
data.drop(columns = ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'], inplace = True)

In [31]:
data.head()

Unnamed: 0,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e,...,X6_g,X6_j,X6_d,X6_i,X6_l,X6_h,X6_a,X6_k,X6_c,X6_f
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [33]:
data.columns

Index(['X2_as', 'X2_ae', 'X2_ai', 'X2_m', 'X2_ak', 'X2_r', 'X2_n', 'X2_s',
       'X2_f', 'X2_e', 'X1_aa', 'X1_s', 'X1_l', 'X1_b', 'X1_v', 'X1_r', 'X1_i',
       'X1_a', 'X1_c', 'X1_o', 'X3_c', 'X3_f', 'X3_a', 'X3_d', 'X3_g', 'X3_e',
       'X3_b', 'X4_d', 'X4_b', 'X4_a', 'X4_c', 'X5_v', 'X5_r', 'X5_p', 'X5_w',
       'X5_af', 'X5_ad', 'X5_ac', 'X5_n', 'X5_l', 'X5_s', 'X6_g', 'X6_j',
       'X6_d', 'X6_i', 'X6_l', 'X6_h', 'X6_a', 'X6_k', 'X6_c', 'X6_f'],
      dtype='object')

# Note:
   - In this particular case we are limiting the no. of most frequenct labels to 10, i.e., we are considering only top 10 most frequenct labels per categorical variable.
   - This restriction can be different for different types of data and would be subjected to change depending on the domain knowledge

# Advantages:
 - Straightforward to implement
 - Prevents from spending hours on variable exploration
 - Does not massively expand the feature space
 
# Disadvantages:
 - Does not add any information to the data that makes the variable more predictive
 - Does not retain the information of the ignored variables