#### One Hot Encoding - variables with many categories

##### Advantages
- Straight forward to implement
- Does not require hrs of variable exploration
- Does not expand massively the feature space (no. of columns in the dataset)

##### Disadvantages
- Does not add any information that may make the variables more predictive
- Does not keep the information of the ignored labels

Because it is not unusual categorical variables have a few dominating categories and the remaining labels add mostly noise, this is a quite simple and straight forward approach that may be useful on many occasions.

It is worth noting that the top 10 variables is a totally arbitrary number. You could also choose the top 5, or top 20.

This modelling was more than enough for the team to win the KDD 2009 cup. 

In [2]:
import pandas as pd
import numpy as np

In [4]:
data = pd.read_csv('../../datasets/mercedesbenz.csv', usecols=['X1','X2','X3','X4','X5','X6'])
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
0,v,at,a,d,u,j
1,t,av,e,d,y,l
2,w,n,c,d,x,j
3,t,n,f,d,x,l
4,v,n,f,d,h,d


In [4]:
for col in data.columns:
    print(col, ' : ', len(data[col].unique()), ' labels')

X1  :  27  labels
X2  :  44  labels
X3  :  7  labels
X4  :  4  labels
X5  :  29  labels
X6  :  12  labels


In [7]:
# Let's examine how many columns we will obtain after one hot encoding these variables
pd.get_dummies(data, drop_first=True).shape

(4209, 117)

Limiting one hot encoding to the 10 most frequent labels of the variable. This means that we would make one binary variable for each of the 10 most frequent labels only. This is equivalent to grouping all the other labels under a new category, that in this case will be dropped. Thus, the 10 new dummy variables indicate if one of the most frequent labels is present (1) or not (0) for a particular observation.

In [9]:
data.X2.value_counts().sort_values(ascending=False).head(20)

X2
as    1659
ae     496
ai     415
m      367
ak     265
r      153
n      137
s       94
f       87
e       81
aq      63
ay      54
a       47
t       29
k       25
i       25
b       21
ao      20
z       19
ag      19
Name: count, dtype: int64

In [10]:
top10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
top10

['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

In [11]:
for label in top10:
    data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top10].head(40)

Unnamed: 0,X2,as,ae,ai,m,ak,r,n,s,f,e
0,at,0,0,0,0,0,0,0,0,0,0
1,av,0,0,0,0,0,0,0,0,0,0
2,n,0,0,0,0,0,0,1,0,0,0
3,n,0,0,0,0,0,0,1,0,0,0
4,n,0,0,0,0,0,0,1,0,0,0
5,e,0,0,0,0,0,0,0,0,0,1
6,e,0,0,0,0,0,0,0,0,0,1
7,as,1,0,0,0,0,0,0,0,0,0
8,as,1,0,0,0,0,0,0,0,0,0
9,aq,0,0,0,0,0,0,0,0,0,0


In [12]:
def one_hot_top_x(df, variable, top_x_labels):

    for label in top_x_labels:
        df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)
    
one_hot_top_x(data, 'X2', top10)
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,as,ae,ai,m,...,X2_as,X2_ae,X2_ai,X2_m,X2_ak,X2_r,X2_n,X2_s,X2_f,X2_e
0,v,at,a,d,u,j,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,t,av,e,d,y,l,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,w,n,c,d,x,j,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,t,n,f,d,x,l,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,v,n,f,d,h,d,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


#### Mean / Target Encoding

In [None]:
df = pd.read_csv('../../datasets/loanPrediction.csv')
df.head()
# Loan status is the target variable

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
df['Loan_Status'] = df['Loan_Status'].map({'Y':1,"N":0})
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1


In [None]:
from category_encoders import TargetEncoder

cols = ['Gender','Dependents']
target = 'Loan_Status'
for col in cols:
    te = TargetEncoder()
    te.fit(X=df[col], y=df[target])
    values = te.transform(df[col])
    df = pd.concat([df,values], axis=1)

In [8]:
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender.1,Dependents.1
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1,0.693252,0.689855
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0,0.693252,0.64707
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1,0.693252,0.689855
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1,0.693252,0.689855
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1,0.693252,0.689855


In [10]:
df.sample(frac=1).head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender.1,Dependents.1
457,LP002467,Male,Yes,0,Graduate,No,3708,2569.0,173.0,360.0,1.0,Urban,0,0.693252,0.689855
453,LP002449,Male,Yes,0,Graduate,No,2483,2466.0,90.0,180.0,0.0,Rural,1,0.693252,0.689855
161,LP001562,Male,Yes,0,Graduate,No,7933,0.0,275.0,360.0,1.0,Urban,0,0.693252,0.689855
179,LP001630,Male,No,0,Not Graduate,No,2333,1451.0,102.0,480.0,0.0,Urban,0,0.693252,0.689855
54,LP001186,Female,Yes,1,Graduate,Yes,11500,0.0,286.0,360.0,0.0,Urban,0,0.669645,0.64707
364,LP002180,Male,No,0,Graduate,Yes,6822,0.0,141.0,360.0,1.0,Rural,1,0.693252,0.689855
488,LP002555,Male,Yes,2,Graduate,Yes,4583,2083.0,160.0,360.0,1.0,Semiurban,1,0.693252,0.752455
400,LP002288,Male,Yes,2,Not Graduate,No,2889,0.0,45.0,180.0,0.0,Urban,0,0.693252,0.752455
207,LP001698,Male,No,0,Not Graduate,No,3975,2531.0,55.0,360.0,1.0,Rural,1,0.693252,0.689855
323,LP002055,Female,No,0,Graduate,No,3166,2985.0,132.0,360.0,,Rural,1,0.669645,0.689855


#### Label Encoder

In [13]:
iris = pd.read_csv('../../datasets/Iris.csv')
iris.head()
# Speicies is the target

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [15]:
iris.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [17]:
iris.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [20]:
iris['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [21]:
iris['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [24]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
print(label_encoder)

LabelEncoder()


In [25]:
iris['Species'] = label_encoder.fit_transform(iris['Species'])

In [27]:
iris.sample(frac=1).head(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
147,148,6.5,3.0,5.2,2.0,2
86,87,6.7,3.1,4.7,1.5,1
146,147,6.3,2.5,5.0,1.9,2
46,47,5.1,3.8,1.6,0.2,0
74,75,6.4,2.9,4.3,1.3,1
11,12,4.8,3.4,1.6,0.2,0
68,69,6.2,2.2,4.5,1.5,1
87,88,6.3,2.3,4.4,1.3,1
136,137,6.3,3.4,5.6,2.4,2
31,32,5.4,3.4,1.5,0.4,0


In [28]:
iris.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species            int64
dtype: object

In [30]:
iris['Species'].value_counts()

Species
0    50
1    50
2    50
Name: count, dtype: int64