## Encoding Categorical Features

Python code on how to transform nominal and ordinal variables to integers.

This Notebook includes:

- Ordinal Encoding with LabelEncoder, Panda's Factorize, and Panda's Map
- Nominal Encoding with One-Hot Encoding and Binary Encoding

### Why??

- There is a great number of machine learning algorithms that can't work with categorical data - they require numbers. Examples are SVMs, Logistic Regression, Linear Regreesion, Neural Networks.

- On the other hand tree-based algorithms like Random Forest, XGBoost, LightGBM, and Naive Bayes can work with categorical data, but their accuracy *might* improve with encoding.

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce

pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('max_colwidth', 100)

In [2]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [3]:
train['ExterQual'].value_counts()

TA    906
Gd    488
Ex     52
Fa     14
Name: ExterQual, dtype: int64

### Ordinal Encoding - LabelEncoder and Panda's Factorize
Each **unique** categorical value becomes a **unique** number. With *LabelEncoder* and *Panda's Factorize*, each unique instance gets a value between 0 and *N*-1, where *N* is the total number of unique values.

The result will have **1 dimension**.

After encoding, the feature is considered **continuous** and that might be a problem. So after encoding nominal features with LabelEncoder, an one-hot encoding **should** be used.

The main difference between pandas and scikit-learn encoders is that scikit-learn encoders are made to be used in scikit-learn pipelines with fit and transform methods.

In [4]:
train1 = train.copy()

lbl = LabelEncoder()
lbl.fit((train1['ExterQual']))
train1['ExterQual'] = lbl.transform(train1['ExterQual'])

train1['ExterQual'].value_counts()

3    906
2    488
0     52
1     14
Name: ExterQual, dtype: int64

In [5]:
train2 = train.copy()

train2['ExterQual'], indexer = pd.factorize(train2['ExterQual'])

train2['ExterQual'].value_counts()

1    906
0    488
2     52
3     14
Name: ExterQual, dtype: int64

### Ordinal Encoding - Pandas map

LabelEncoding converts categorical features to numbers depending on the **order** that it finds them in the column. 

For ordinal columns that might be a problem. For example having a column with unique values <font color='#3796C8'>['TA','Fa','Gd','Ex']</font> like the above example, LabelEncoder encodes them to <font color='#3796C8'>[3,1,2,0]</font> which is not the right order (considering 'Fa' is Fair, 'Gd' is Good, and 'Ex' is Excellent).

So, pandas **.map()** can be used to give are own mapping to the column values. Having too many unique values, this becomes impractical though.

In [6]:
train3 = train.copy()

train3['ExterQual'] = train3['ExterQual'].map({'TA':0,'Fa':1,'Gd':2,'Ex':3})

train3['ExterQual'].value_counts()

0    906
2    488
3     52
1     14
Name: ExterQual, dtype: int64

### Nominal Encoding - One-hot Encoding with Panda's get_dummies
It converts a categorical column to *N* columns, where *N* is the number of unique values of this column. Each new column gets a 0, if the row does not contain the column value, and 1 if it does.

This encoding can create a **huge** number of features, if there are high cardinality features and can result in **memory**problems.

In [7]:
train4 = train.copy()

train4 = pd.concat([train4, pd.get_dummies(train4['ExterQual'])], axis=1)
#train4.drop(['ExterQual'], axis=1, inplace=True) # Uncomment this

train4[['ExterQual','Ex','Fa','Gd','TA']].head()

Unnamed: 0,ExterQual,Ex,Fa,Gd,TA
0,Gd,0,0,1,0
1,TA,0,0,0,1
2,Gd,0,0,1,0
3,TA,0,0,0,1
4,Gd,0,0,1,0


### Nominal Encoding - Binary Encoding with category_encoders
This technique is not as intuitive as the previous ones. In this technique, **first** the categories are encoded as **ordinal**, then those integers are converted into **binary code**, then the **digits** from that binary string are split into separate **columns**. This encodes the data in **fewer dimensions than one-hot**.

In [8]:
train5 = train.copy()

encoder = ce.BinaryEncoder(cols=['ExterQual'])
train5 = encoder.fit_transform(train5)

train5.head()

Unnamed: 0,ExterQual_0,ExterQual_1,ExterQual_2,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,0,1,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,0,1,0,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,0,0,1,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,0,1,0,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,0,0,1,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


We can see that 3 columns were created.