## One hot Encoding - For Multiple Features

In [23]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [24]:
# Read the data mercedez benz data

df = pd.read_csv('D:\\CS50\\projects data set\\mercedes benz\\train.csv', usecols=['X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [25]:
for col in df.columns:
        print(col, ':', len(df[col].unique()) ,'labels')

X1 : 27 labels
X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels


In [26]:
# lets check how much columns will we obtain after one hot encoding these variables

pd.get_dummies(df, drop_first=True).shape

(4209, 117)

We can see that from just 6 existing categorical variables, we end up with 117 new variables.

These numbers are still not huge, and in practice we could work with them relatively easily. However, in business datasets and also other Kaggle or KDD datasets, it is not unusual to find several categorical variables with multiple labels. And if we use one hot encoding on them, we will end up with datasets with thousands of columns.

which leads to increase in dimensionality  and it affects the performance of the algorithm

What can we do instead?

In the winning solution of the KDD 2009 cup: "Winning the KDD Cup Orange Challenge with Ensemble Selection" (http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the 10 most frequent labels is present (1) or not (0) for a particular observation.

How can we do that in python?


In [27]:
# let's find the top 10 most frequent categories for the variable X2

df.X2.value_counts().sort_values(ascending=False).head(10)

as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
Name: X2, dtype: int64

In [28]:
# let's make a list with the most frequent categories of the variable

# when we use   'x for x'   we get a list of labels
# when we dont use   'x for x'   we get a index of labels , which cannot be used in the following next cell

## SAMPLE ENCODING FOR   X2 alone

In [29]:
top_10 = [x for x in df.X2.value_counts().sort_values(ascending=False).head(10).index]
top_10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [30]:
# df['as'] = np.where(df['X2'] == 'as',1,0)

In [31]:
# and now we make the 10 binary variables

for label in top_10:
    df[label] = np.where(df['X2']==label, 1, 0) # if it matches the condition replace it with 1,
                                                # if not then replace with 0
        
df[['X2'] + top_10].head(10)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


only for the top ten labels the replacement of 1 or 0 is possible.

for the labels which are under top ten will be replaced with 0 alone.

## ENCODING THE WHOLE DATA 
#####  i.e)  creating dummy variables for categorical data

*

In [32]:
# read the data again
df = pd.read_csv('D:\\CS50\\projects data set\\mercedes benz\\train.csv', usecols=['X1','X2','X3','X4','X5','X6'])
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [34]:
# Top ten categories in each feature

for x in df.columns:
    print('top_10_'+ x, '=', [x for x in df[x].value_counts().sort_values(ascending=False).head(10).index])


top_10_X1 = ['aa', 's', 'b', 'l', 'v', 'r', 'i', 'a', 'c', 'o']
top_10_X2 = ['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']
top_10_X3 = ['c', 'f', 'a', 'd', 'g', 'e', 'b']
top_10_X4 = ['d', 'a', 'b', 'c']
top_10_X5 = ['w', 'v', 'q', 'r', 's', 'd', 'n', 'p', 'm', 'i']
top_10_X6 = ['g', 'j', 'd', 'i', 'l', 'a', 'h', 'k', 'c', 'b']


In [35]:
# get whole set of dummy variables, for all the categorical variables

    # function to create the dummy variables for the most frequent labels
    # we can also vary the number of most frequent labels that we encode (like top 10, like top 5)


def one_hot_encode_features (data, feature , top_10_labels):                               
    for label in top_10_labels:
        df[feature + '_' + label] = np.where(df[feature]==label, 1, 0)


In [36]:

# column_name = ['']


for features in df.columns[:]:
    
    one_hot_encode_features(df, features , 'top_10_'+features)

df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X1_t,X1_o,X1_p,X1__,...,X5_X,X5_5,X6_t,X6_o,X6_p,X6__,X6_1,X6_0,X6_X,X6_6
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,t,n,f,d,x,l,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


One Hot encoding of top variables

Advantages

- Straightforward to implement
- Does not require hrs of variable exploration
- Does not expand massively the feature space (number of columns in the dataset)

Disadvantages

- Does not add any information that may make the variable more predictive
- Does not keep the information of the ignored labels

Because it is not unusual that categorical variables have a few dominating categories and the remaining labels add mostly NOISE, this is a quite simple and straight forward approach that may be useful on many occasions.

It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.

This modelling was more than enough for the team to win the KDD 2009 cup. They did do some other powerful feature engineering as we will see in following lectures, that improved the performance of the variables dramatically.