# Encoding Categorical Features

Here we will cover three different ways of encoding categorical features:

1. LabelEncoder and OneHotEncoder

2. DictVectorizer

3. Pandas get_dummies


The first choice method for anyone should be pandas get dummies. But if the number of categorical features are huge, DictVectorizer will be a good choice as it supports sparse matrix output.

In [1]:
import pandas as pd
import numpy as np

In [58]:
# load data
df = pd.read_csv('input/kidney.csv', header=None, 
 names=['age', 'bp', 'sg', 'al', 'su', 'rb', 'pc', 'pcc', 'ba', 'bgr', 'bu', 'sc', 'sod', 'pot', 
 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'class'])

In [59]:
df.head(10)

Unnamed: 0,age,bp,sg,al,su,rb,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,class
id,'age','bp','sg','al','su','rbc','pc','pcc','ba','bgr',...,'pcv','wbcc','rbcc','htn','dm','cad','appet','pe','ane','class'
1,48,80,1.020,1,0,?,normal,notpresent,notpresent,121,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
2,7,50,1.020,4,0,?,normal,notpresent,notpresent,?,...,38,6000,?,no,no,no,good,no,no,ckd
3,62,80,1.010,2,3,normal,normal,notpresent,notpresent,423,...,31,7500,?,no,yes,no,poor,no,yes,ckd
4,48,70,1.005,4,0,normal,abnormal,present,notpresent,117,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
5,51,80,1.010,2,0,normal,normal,notpresent,notpresent,106,...,35,7300,4.6,no,no,no,good,no,no,ckd
6,60,90,1.015,3,0,?,?,notpresent,notpresent,74,...,39,7800,4.4,yes,yes,no,good,yes,no,ckd
7,68,70,1.010,0,0,?,normal,notpresent,notpresent,100,...,36,?,?,no,no,no,good,no,no,ckd
8,24,?,1.015,2,4,normal,abnormal,notpresent,notpresent,410,...,44,6900,5,no,yes,no,good,yes,no,ckd
9,52,100,1.015,3,0,normal,abnormal,present,notpresent,138,...,33,9600,4.0,yes,yes,no,good,no,yes,ckd


In [30]:
# load data
df = pd.read_csv('input/kidney.csv', sep='\t')
df.head()

Unnamed: 0,"id,'age','bp','sg','al','su','rbc','pc','pcc','ba','bgr','bu','sc','sod','pot','hemo','pcv','wbcc','rbcc','htn','dm','cad','appet','pe','ane','class'"
0,"1,48,80,1.020,1,0,?,normal,notpresent,notprese..."
1,"2,7,50,1.020,4,0,?,normal,notpresent,notpresen..."
2,"3,62,80,1.010,2,3,normal,normal,notpresent,not..."
3,"4,48,70,1.005,4,0,normal,abnormal,present,notp..."
4,"5,51,80,1.010,2,0,normal,normal,notpresent,not..."


In [88]:
df= pd.read_csv('input/kidney.csv',header=0,error_bad_lines=False)

b'Skipping line 71: expected 26 fields, saw 27\nSkipping line 74: expected 26 fields, saw 27\nSkipping line 371: expected 26 fields, saw 27\n'


In [89]:
df.head(10)

Unnamed: 0,id,'age','bp','sg','al','su','rbc','pc','pcc','ba',...,'pcv','wbcc','rbcc','htn','dm','cad','appet','pe','ane','class'
0,1,48,80,1.02,1,0,?,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,2,7,50,1.02,4,0,?,normal,notpresent,notpresent,...,38,6000,?,no,no,no,good,no,no,ckd
2,3,62,80,1.01,2,3,normal,normal,notpresent,notpresent,...,31,7500,?,no,yes,no,poor,no,yes,ckd
3,4,48,70,1.005,4,0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,5,51,80,1.01,2,0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd
5,6,60,90,1.015,3,0,?,?,notpresent,notpresent,...,39,7800,4.4,yes,yes,no,good,yes,no,ckd
6,7,68,70,1.01,0,0,?,normal,notpresent,notpresent,...,36,?,?,no,no,no,good,no,no,ckd
7,8,24,?,1.015,2,4,normal,abnormal,notpresent,notpresent,...,44,6900,5,no,yes,no,good,yes,no,ckd
8,9,52,100,1.015,3,0,normal,abnormal,present,notpresent,...,33,9600,4.0,yes,yes,no,good,no,yes,ckd
9,10,53,90,1.02,2,0,abnormal,abnormal,present,notpresent,...,29,12100,3.7,yes,yes,no,poor,no,yes,ckd


# LabelEncoder & OneHotEncoder



The labelEncoder and OneHotEncoder only works on categorical features. We need first to extract the categorial featuers using boolean mask.

In [64]:
# Categorical boolean mask
categorical_feature_mask = df.dtypes==object

# filter categorical columns using mask and turn it into a list
categorical_cols = df.columns[categorical_feature_mask].tolist()

LabelEncoder converts each class under specified feature to a numerical value. Let’s go through the steps to see how to do it.

Instantiate a LabelEncoder object:

In [65]:
# import labelencoder
from sklearn.preprocessing import LabelEncoder

# instantiate labelencoder object
le = LabelEncoder()

In [66]:
# Apply LabelEncoder on each of the categorical columns:

# apply le on categorical feature columns
df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))

df[categorical_cols].head(10)

Unnamed: 0,'age','bp','sg','al','su','rbc','pc','pcc','ba','bgr',...,'pcv','wbcc','rbcc','htn','dm','cad','appet','pe','ane','class'
0,35,8,3,1,0,0,2,1,1,21,...,30,69,33,2,2,1,1,1,1,0
1,59,5,3,4,0,0,2,1,1,145,...,24,53,48,1,1,1,1,1,1,0
2,51,8,1,2,3,2,2,1,1,113,...,17,67,48,1,2,1,2,1,2,0
3,35,7,0,4,0,2,1,2,1,17,...,18,59,18,2,1,1,2,2,2,0
4,39,8,1,2,0,2,2,1,1,6,...,21,65,26,1,1,1,1,1,1,0
5,49,9,2,3,0,0,0,1,1,120,...,25,69,24,2,2,1,1,2,1,0
6,57,7,1,0,0,0,2,1,1,0,...,22,89,48,1,1,1,1,1,1,0
7,11,10,2,2,4,2,1,1,1,111,...,30,61,30,1,2,1,1,2,1,0
8,40,0,2,3,0,2,1,2,1,35,...,19,85,20,2,2,1,1,1,2,0
9,41,9,3,2,0,1,1,2,1,119,...,15,15,16,2,2,1,2,1,2,0


As we can see, all the categorical feature columns are binary class. But if the categorical feature is multi class, LabelEncoder will return different values for different classes. See for example, the ‘anyfeature’ feature might have as many as 24 classes.
so,for example:
class_a has value 5 but class_b has value 24, is class_b ‘greater’ than class_a? The answer is obviously no. Thus allowing model learning this result will lead to poor performance. Therefore, for dataframe containing multi class features, a further step of OneHotEncoder is needed. Let’s see the steps to do it.

In [74]:
# Instantiate OneHotEncoder object:

# import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# instantiate OneHotEncoder
ohe = OneHotEncoder(categorical_features = categorical_feature_mask, sparse=False ) 
# categorical_features = boolean mask for categorical columns
# sparse = False output an array not sparse matrix

In [75]:
# Apply OneHotEncoder on DataFrame:

# apply OneHotEncoder on categorical feature columns
X_ohe = ohe.fit_transform(df) # It returns an numpy array


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [76]:
X_ohe

array([[  0.,   0.,   0., ...,   1.,   0.,   1.],
       [  0.,   0.,   0., ...,   1.,   0.,   2.],
       [  0.,   0.,   0., ...,   1.,   0.,   3.],
       ...,
       [  0.,   1.,   0., ...,   0.,   1., 398.],
       [  0.,   0.,   0., ...,   0.,   1., 399.],
       [  0.,   0.,   0., ...,   0.,   1., 400.]])

Note that the output is a numpy array, not a dataframe. For each class under a categorical feature, a new column is created for it. For example, there are 20 columns created for the ten binary class categorical features.


# DictVectorizer

As we can see, the LabelEncoder and OneHotEncoder usually need to be used together as two steps procedure. An more convenient way is using DictVectorizer which can achieve these two steps all at once.

First, we need to convert the dataframe into a dictionary. This can be achieved by Pandas to_dict method.

In [84]:
# turn X into dict
X_dict = df.to_dict(orient='records')

# turn each row as key-value pairs
# show X_dict
X_dict

[{'id': 1,
  "'age'": '48',
  "'bp'": '80',
  "'sg'": '1.020',
  "'al'": '1',
  "'su'": '0',
  "'rbc'": '?',
  "'pc'": 'normal',
  "'pcc'": 'notpresent',
  "'ba'": 'notpresent',
  "'bgr'": '121',
  "'bu'": '36',
  "'sc'": '1.2',
  "'sod'": '?',
  "'pot'": '?',
  "'hemo'": '15.4',
  "'pcv'": '44',
  "'wbcc'": '7800',
  "'rbcc'": '5.2',
  "'htn'": 'yes',
  "'dm'": 'yes',
  "'cad'": 'no',
  "'appet'": 'good',
  "'pe'": 'no',
  "'ane'": 'no',
  "'class'": 'ckd'},
 {'id': 2,
  "'age'": '7',
  "'bp'": '50',
  "'sg'": '1.020',
  "'al'": '4',
  "'su'": '0',
  "'rbc'": '?',
  "'pc'": 'normal',
  "'pcc'": 'notpresent',
  "'ba'": 'notpresent',
  "'bgr'": '?',
  "'bu'": '18',
  "'sc'": '0.8',
  "'sod'": '?',
  "'pot'": '?',
  "'hemo'": '11.3',
  "'pcv'": '38',
  "'wbcc'": '6000',
  "'rbcc'": '?',
  "'htn'": 'no',
  "'dm'": 'no',
  "'cad'": 'no',
  "'appet'": 'good',
  "'pe'": 'no',
  "'ane'": 'no',
  "'class'": 'ckd'},
 {'id': 3,
  "'age'": '62',
  "'bp'": '80',
  "'sg'": '1.010',
  "'al'": '2',
 

The orient='records' is required to turn the data frame into a {column:value} format. The result is a list of dictionaries, among which each dictionary represent one sample. Note that, in this case we don’t need to extract the categorical features, we can convert the whole dataframe into a dict. This is one advantage compared to LabelEncoder and OneHotEncoder.

In [85]:
# Now we instantiate a DictVectorizer:

# DictVectorizer
from sklearn.feature_extraction import DictVectorizer
# instantiate a Dictvectorizer object for X
dv_X = DictVectorizer(sparse=False) 
# sparse = False makes the output is not a sparse matrix

In [86]:
# DictVectorizer fit and transform on the converted dict:

# apply dv_X on X_dict
X_encoded = dv_X.fit_transform(X_dict)# show X_encoded
X_encoded

array([[  0.,   0.,   0., ...,   0.,   0.,   1.],
       [  0.,   0.,   0., ...,   0.,   0.,   2.],
       [  0.,   0.,   0., ...,   0.,   0.,   3.],
       ...,
       [  0.,   1.,   0., ...,   0.,   0., 398.],
       [  0.,   0.,   0., ...,   0.,   0., 399.],
       [  0.,   0.,   0., ...,   0.,   0., 400.]])

Each row represents a sample and each column represents a feature. If we want to know what feature for each column, we can check the vocabulary of this DictVectorizer:

In [87]:
# to check vocabulary
vocab = dv_X.vocabulary_# show vocab
vocab

{'id': 870,
 "'age'=48": 35,
 "'bp'=80": 246,
 "'sg'=1.020": 735,
 "'al'=1": 77,
 "'su'=0": 773,
 "'rbc'=?": 593,
 "'pc'=normal": 500,
 "'pcc'=notpresent": 502,
 "'ba'=notpresent": 90,
 "'bgr'=121": 113,
 "'bu'=36": 312,
 "'sc'=1.2": 654,
 "'sod'=?": 772,
 "'pot'=?": 592,
 "'hemo'=15.4": 433,
 "'pcv'=44": 534,
 "'wbcc'=7800": 849,
 "'rbcc'=5.2": 629,
 "'htn'=yes": 497,
 "'dm'=yes": 375,
 "'cad'=no": 369,
 "'appet'=good": 87,
 "'pe'=no": 548,
 "'ane'=no": 84,
 "'class'=ckd": 371,
 "'age'=7": 59,
 "'bp'=50": 243,
 "'al'=4": 80,
 "'bgr'=?": 237,
 "'bu'=18": 281,
 "'sc'=0.8": 649,
 "'hemo'=11.3": 389,
 "'pcv'=38": 528,
 "'wbcc'=6000": 833,
 "'rbcc'=?": 644,
 "'htn'=no": 496,
 "'dm'=no": 374,
 "'age'=62": 51,
 "'sg'=1.010": 733,
 "'al'=2": 78,
 "'su'=3": 776,
 "'rbc'=normal": 595,
 "'bgr'=423": 205,
 "'bu'=53": 330,
 "'sc'=1.8": 660,
 "'hemo'=9.6": 490,
 "'pcv'=31": 521,
 "'wbcc'=7500": 847,
 "'appet'=poor": 88,
 "'ane'=yes": 85,
 "'bp'=70": 245,
 "'sg'=1.005": 732,
 "'pc'=abnormal": 499,
 

# Get Dummies

Pandas get_dummies method is a very straight forward one step procedure to get the dummy variables for categorical features. The advantage is you can directly apply it on the dataframe and the algorithm inside will recognize the categorical features and perform get dummies operation on it. Here is how to do it:

In [90]:
# Get dummies
X = pd.get_dummies(df, prefix_sep='_', drop_first=True)

# X head
X.head()

Unnamed: 0,id,'age'_12,'age'_14,'age'_15,'age'_17,'age'_19,'age'_2,'age'_20,'age'_21,'age'_22,...,'dm'_yes,'cad'_no,'cad'_yes,'appet'_good,'appet'_poor,'pe'_no,'pe'_yes,'ane'_no,'ane'_yes,'class'_notckd
0,1,0,0,0,0,0,0,0,0,0,...,1,1,0,1,0,1,0,1,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,1,0,1,0,0
2,3,0,0,0,0,0,0,0,0,0,...,1,1,0,0,1,1,0,0,1,0
3,4,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,1,0,1,0
4,5,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,1,0,1,0,0


The prefix_sep='_' makes each class has a unique name separated by the delimiter. The drop_first=True drops one column from the resulted dummy features. The purpose is to avoid multicollinearity. 