## Encoding Categorical Columns

Categorical variables are variables/values/features in our dataset that are non-numerical. i.e. they are usually in text format (i.e. datatype is object). The opposite of categorical values in our dataset are called continuous variables. A continuous variable can be numeric or a date/time.

### There are two main types of categorical variables, these are Nominal and Ordinal.

1.	Nominal feature: these are feature where the categories are only labelled without any order of precedence. For example, a feature like gender having two categories (male and female) has no particular order.  

2.	Ordinal feature:  these are features which have some order associated with. For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.

#### Now why are we concerned about categorical variables:

•	“Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them."

•	“While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.”


•	“The machine doesn’t interpret categorical data like the way we humans do. For example, let take the names of the cities: Abuja, Minna and Accra, humans tends to categorized the cities Abuja and Minna as one i.e. similar and Accra to be distinct from these two because of the relationship between Abuja and Minna is closer compare to any of them to Accra (Abuja and Minna are in the same country, Nigeria). But for the machine model these cities are just three different levels (possible values) of the same feature City. If you do not specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.”


Now because of all these reasons we are now face with the task of transforming these categorical data into numerical data for further processing and/or at the same time getting additional, useful information from these categorical columns which is one of the process in feature engineering.
In this class we will be using the Housing Price competition dataset from kaggle to explain how to encode categprical variables.

#### I will be explaining two simple approaches to categorical encoding

These are:
1. Label Encoding
2. One-hot Encoding

Let jump into our cell to communicate in codes

First we will start by importing the basic libraries we need 

In [99]:
import pandas as pd
import numpy as np
import copy

Now let read out our data into a pandas dataframe

In [66]:
df = pd.read_csv('housing_train.csv')

In [67]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


We can see from the head of the dataframe there are some missing values (NaN) and there are integer, object, and float cloumns.


Dropping all columns with nan value

In [68]:
nan_cols = [c for c in df.columns if df[c].isnull().any()]

df.drop(nan_cols, axis=1, inplace=True)

Getting the categorical columns

In [69]:
cat_cols = [c for c in df if df[c].dtypes == 'object']

cat_cols

['MSZoning',
 'Street',
 'LotShape',
 'LandContour',
 'Utilities',
 'LotConfig',
 'LandSlope',
 'Neighborhood',
 'Condition1',
 'Condition2',
 'BldgType',
 'HouseStyle',
 'RoofStyle',
 'RoofMatl',
 'Exterior1st',
 'Exterior2nd',
 'ExterQual',
 'ExterCond',
 'Foundation',
 'Heating',
 'HeatingQC',
 'CentralAir',
 'KitchenQual',
 'Functional',
 'PavedDrive',
 'SaleType',
 'SaleCondition']

In [70]:
X = df.drop('SalePrice', axis=1)
y = df.SalePrice

In [72]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [74]:
# function for comparing different approaches
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Label Encoding

This approach is to encode categorical values with a technique called "label encoding", which allows you to convert each value in a column to a number. Numerical labels are always between 0 and n_categories-1.


In [75]:
X_train_lb  = X_train.copy()
X_test_lb = X_test.copy()

from sklearn.preprocessing import LabelEncoder

lb = LabelEncoder()

for col in cat_cols:
    X_train_lb[col] = lb.fit_transform(X_train[col])
    X_test_lb[col] = lb.fit_transform(X_test[col])
    

In [76]:
X_train_lb.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
254,255,20,3,8400,1,3,3,0,4,0,...,0,0,0,0,0,0,6,2010,8,4
1066,1067,60,3,7837,1,0,3,0,4,0,...,40,0,0,0,0,0,5,2009,8,4
638,639,30,3,8777,1,3,3,0,4,0,...,0,164,0,0,0,0,5,2008,8,4
799,800,50,3,7200,1,3,3,0,0,0,...,0,264,0,0,0,0,6,2007,8,4
380,381,50,3,5000,1,3,3,0,4,0,...,0,242,0,0,0,0,5,2010,8,4


In [77]:
X_test_lb.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
892,893,20,3,8414,1,3,3,0,4,0,...,0,0,0,0,0,0,2,2006,6,3
1105,1106,60,3,12256,1,0,3,0,0,0,...,32,0,0,0,0,0,4,2010,6,3
413,414,30,4,8960,1,3,3,0,4,0,...,0,130,0,0,0,0,3,2010,6,3
522,523,50,4,5000,1,3,3,0,0,0,...,24,36,0,0,0,0,10,2006,6,3
1036,1037,20,3,12898,1,0,1,0,4,0,...,0,0,0,0,0,0,9,2009,6,3


In [78]:
print("Mesn Absolute Error from Approach Label Encoding:") 
print(score_dataset(X_train_lb, X_test_lb, y_train, y_test))

Mesn Absolute Error from Approach Label Encoding:
18129.23383561644


### One-Hot encoding 

What this does is that it converts each value in the category column(s) and turns it into new columns then it assigns a 1 or 0 (i.e. True/False) value to the columns. This make-up for the disadvantage in Label Encoding that attach hierarchy of importance to values.

This can be done with several libraries but the simplest way for doing this is the pandas' .get_dummies() method.

This pandas' .get_dummies() function is named this way because it creates dummy/indicator variables (1 or 0). There are mainly three arguments important here, the first one is the DataFrame you want to encode on, second being the columns argument which lets you specify the columns you want to do encoding on, and third, the prefix argument which lets you specify the prefix for the new columns that will be created after encoding.

In [96]:
X_pandas = X.copy()

X_pandas = pd.get_dummies(X_pandas, columns=cat_cols)

X_pandas.head()

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,8450,7,5,2003,2003,706,0,150,...,0,0,0,1,0,0,0,0,1,0
1,2,20,9600,6,8,1976,1976,978,0,284,...,0,0,0,1,0,0,0,0,1,0
2,3,60,11250,7,5,2001,2002,486,0,434,...,0,0,0,1,0,0,0,0,1,0
3,4,70,9550,7,5,1915,1970,216,0,540,...,0,0,0,1,1,0,0,0,0,0
4,5,60,14260,8,5,2000,2000,655,0,490,...,0,0,0,1,0,0,0,0,1,0


We can see now that new columns have been created and their names has been changed to _'old columns name'_\_'value' i.e. values of thta columns.

In [93]:
X_train_pandas, X_test_pandas, y_train, y_test = train_test_split(X_pandas, y, test_size=0.2, random_state=42)

In [97]:
print("Mean Absolute Error from Approach One Hot Encoding(pd.get_dummies):") 
print(score_dataset(X_train_pandas, X_test_pandas, y_train, y_test))

Mean Absolute Error from Approach One Hot Encoding(pd.get_dummies):
17626.303698630138


#### scikit-learn also supports one hot encoding through LabelBinarizer and OneHotEncoder in its preprocessing module. For this class we are going to do the same encoding through OneHotEncoder:

In [98]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_enc.fit_transform(X_train[cat_cols]))
OH_cols_valid = pd.DataFrame(OH_enc.transform(X_test[cat_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(cat_cols, axis=1)
num_X_test = X_test.drop(cat_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_valid], axis=1)

print("Mean Absolute Error from Approach One-Hot Encoding:") 
print(score_dataset(OH_X_train, OH_X_test, y_train, y_test))

Mean Absolute Error from Approach One-Hot Encoding:
17790.68866438356


From the two approaches we have seen, we can see that the Label Encoding approache perform better.