# Categorical Encoding

This project demonstrates Encoding of Categorical data. Categorical Encoding is one of an important task that you might have to do while transforming data in Feature Engineering. Before data is fed to the Machine Learning model, it has to be tranformed in a way that the model can understand and give precise predictions.

Categorical data is divided into two ftypes:
1. Nominal data
2. Ordinal data

In this article we are going to see how can we change the data by imputing and encoding of: <br>
Ordinal and Nominal data using OrdinalEncoder, LabelEncoder, OneHotEncoder, ColumnTransformer, SimpleImputer different python libraries.

In [1]:
import numpy as np
import pandas as pd

In [52]:
customers = pd.read_csv("./Datasets/customers.csv")

In [104]:
# customers.drop(columns=['ID', 'Ever_Married', 'Work_Experience', 'Family_Size', 'Var_1', 'Segmentation'], inplace=True)
# customers.rename(columns={'Profession':'Education','Spending_Score':'Review','Graduated':'Purchased'}, inplace=True)
# customers.reset_index(drop=True, inplace=True)
customers['Education'].replace({'Healthcare': 'PG', 'Engineer':'PG', 'Lawyer':'UG', 'Artist':'UG', 'Doctor':'PG','Homemaker':'HS',
                 'Entertainment': 'HS','Marketing':'UG','Executive':'PG' }, inplace=True)
customers

Unnamed: 0,Gender,Age,Purchased,Education,Review
0,Male,22,No,PG,Low
1,Female,67,Yes,PG,Low
2,Male,67,Yes,UG,High
3,Male,56,No,UG,Average
4,Male,32,Yes,PG,Low
...,...,...,...,...,...
6660,Male,41,Yes,UG,High
6661,Male,35,No,PG,Low
6662,Female,33,Yes,PG,Low
6663,Female,27,Yes,PG,Low


In the above data we have both Nominal and Ordinal data.

__Nominal Data__ is the one which cannot be arranged in order. <br>
In the above dataset __Gender__, __Purchased__ columns are nominal. You cannot arrange 'yes' and 'no' in ascending or descending order. <br>

On the contrary __Ordinal Data__ can be arranged in order. <br>
In the above dataset columns __Education__ abd __Review__ have ordical data


## Handling Ordinal Categorical Data


From the above data set we will consider only 'Purchased', 'Education and 'Review' columns for encoding.

In [155]:
newcust = customers.iloc[:, 2:5]
newcust

Unnamed: 0,Purchased,Education,Review
0,No,PG,Low
1,Yes,PG,Low
2,Yes,UG,High
3,No,UG,Average
4,Yes,PG,Low
...,...,...,...
6660,Yes,UG,High
6661,No,PG,Low
6662,Yes,PG,Low
6663,Yes,PG,Low


In [119]:
# checking the categories of the data for each column
print(newcust.Education.unique())
print(newcust.Purchased.unique())
print(newcust.Review.unique())

['PG' 'UG' 'HS']
['No' 'Yes']
['Low' 'High' 'Average']


In [120]:
# splitting the data into train and test 
# We will consider 'Purchased' column as output

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newcust.iloc[:, 1:3],
                                                   newcust.iloc[:, 0],
                                                   test_size=0.2)

In [108]:
# check how the input data looks now
X_train

Unnamed: 0,Education,Review
5981,UG,Low
3766,UG,Low
5503,PG,Low
2781,UG,Average
5393,PG,Low
...,...,...
885,PG,Average
4548,PG,Average
3877,HS,Average
172,PG,Average


## Ordinal Encoding 

Ordinal encoding is performed on input data(X_train, X_test)

In [109]:
from sklearn.preprocessing import OrdinalEncoder

In [112]:
# creating an object of OrdinalEncoder class
oe = OrdinalEncoder(categories=[['HS','UG','PG'], ['Low','Average','High']])


In [113]:
# fitting
oe.fit(X_train)

OrdinalEncoder(categories=[['HS', 'UG', 'PG'], ['Low', 'Average', 'High']])

In [114]:
# tranform
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [115]:
# Check the tranformed data
X_train

array([[1., 0.],
       [1., 0.],
       [2., 0.],
       ...,
       [0., 1.],
       [2., 1.],
       [2., 0.]])

In [116]:
# check what are the categories
oe.categories_

[array(['HS', 'UG', 'PG'], dtype=object),
 array(['Low', 'Average', 'High'], dtype=object)]

## Label Encoding

Label Encoding is performed on output data

In [125]:
from sklearn.preprocessing import LabelEncoder

In [126]:
# creating LabelEncoder object
le = LabelEncoder()

In [127]:
# fitting
le.fit(y_train)

LabelEncoder()

In [134]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [156]:
y_train

array([1, 1, 1, ..., 1, 1, 1])

In [157]:
le.classes_

array(['No', 'Yes'], dtype=object)

## Handling Nominal Categorical Data

In [161]:
cars = pd.read_csv('./Datasets/cars.csv')

In [174]:
cars

Unnamed: 0,manufacturer_name,year_produced,engine_fuel,state,price_usd
0,Subaru,2010,gasoline,owned,10900.00
1,Subaru,2002,gasoline,owned,5000.00
2,Subaru,2001,gasoline,owned,2800.00
3,Subaru,1999,gasoline,owned,9999.00
4,Subaru,2001,gasoline,owned,2134.11
...,...,...,...,...,...
38526,Chrysler,2000,gasoline,owned,2750.00
38527,Chrysler,2004,diesel,owned,4800.00
38528,Chrysler,2000,gasoline,owned,4300.00
38529,Chrysler,2001,gasoline,owned,4000.00


In [175]:
cars.columns

Index(['manufacturer_name', 'year_produced', 'engine_fuel', 'state',
       'price_usd'],
      dtype='object')

In [177]:
cars.manufacturer_name.nunique()

55

In [178]:
cars.engine_fuel.value_counts()

gasoline         24065
diesel           12872
gas               1347
hybrid-petrol      235
electric            10
hybrid-diesel        2
Name: engine_fuel, dtype: int64

In [179]:
cars.state.value_counts()

owned        37723
new            438
emergency      370
Name: state, dtype: int64

## One-Hot Encoding with Pandas

In [180]:
pd.get_dummies(cars, columns=['engine_fuel', 'state'])

Unnamed: 0,manufacturer_name,year_produced,price_usd,engine_fuel_diesel,engine_fuel_electric,engine_fuel_gas,engine_fuel_gasoline,engine_fuel_hybrid-diesel,engine_fuel_hybrid-petrol,state_emergency,state_new,state_owned
0,Subaru,2010,10900.00,0,0,0,1,0,0,0,0,1
1,Subaru,2002,5000.00,0,0,0,1,0,0,0,0,1
2,Subaru,2001,2800.00,0,0,0,1,0,0,0,0,1
3,Subaru,1999,9999.00,0,0,0,1,0,0,0,0,1
4,Subaru,2001,2134.11,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
38526,Chrysler,2000,2750.00,0,0,0,1,0,0,0,0,1
38527,Chrysler,2004,4800.00,1,0,0,0,0,0,0,0,1
38528,Chrysler,2000,4300.00,0,0,0,1,0,0,0,0,1
38529,Chrysler,2001,4000.00,0,0,0,1,0,0,0,0,1


## K-1 One-Hot Encoding

In [182]:
pd.get_dummies(cars, columns=['engine_fuel', 'state'], drop_first=True)

Unnamed: 0,manufacturer_name,year_produced,price_usd,engine_fuel_electric,engine_fuel_gas,engine_fuel_gasoline,engine_fuel_hybrid-diesel,engine_fuel_hybrid-petrol,state_new,state_owned
0,Subaru,2010,10900.00,0,0,1,0,0,0,1
1,Subaru,2002,5000.00,0,0,1,0,0,0,1
2,Subaru,2001,2800.00,0,0,1,0,0,0,1
3,Subaru,1999,9999.00,0,0,1,0,0,0,1
4,Subaru,2001,2134.11,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
38526,Chrysler,2000,2750.00,0,0,1,0,0,0,1
38527,Chrysler,2004,4800.00,0,0,0,0,0,0,1
38528,Chrysler,2000,4300.00,0,0,1,0,0,0,1
38529,Chrysler,2001,4000.00,0,0,1,0,0,0,1


## Drawback:

A big drawback by using above process is, when we re run the code then pandas might sort the columns in different order. Hence this will result into problem while using Machine Learning models. <br>

To resolve this issue we can use onehotencoder() from scikitlearn library.

## One-Hot Encoding in Scikitlearn

In [185]:
# splitting data
X_train, X_test, y_train, y_test = train_test_split(cars.iloc[:, 0:4],
                                                   cars.iloc[:, -1],
                                                   test_size=0.2,
                                                   random_state=True)

In [187]:
X_train

Unnamed: 0,manufacturer_name,year_produced,engine_fuel,state
21590,Audi,2003,gasoline,owned
26715,Nissan,2013,gasoline,owned
13508,Renault,2009,diesel,owned
5849,Mitsubishi,1993,gasoline,owned
11121,Ford,2008,gasoline,owned
...,...,...,...,...
7813,SsangYong,2000,gasoline,owned
32511,Skoda,2017,gasoline,owned
5192,Mitsubishi,2003,diesel,owned
12172,Renault,1996,gas,owned


In [218]:
from sklearn.preprocessing import OneHotEncoder

In [219]:
# creating object
ohe = OneHotEncoder(sparse=False, dtype=np.int32) 
# sparse = False means the new fit_transorm data will be saved as array. 
# you dont need to do .toarray() in below code ohe.fit_transform(X_train[['engine_fuel',state']])

# to remove multicollinearity replace avove as follows
# ohe = OneHotEncoder(dtop='first')
# this will delete first category of both 'engine_fuel' and 'state' columns

In [220]:
X_train_new = ohe.fit_transform(X_train[['engine_fuel', 'state']])

In [221]:
# take manufatrureu_name and yer columns and append this ohe array to these columns
np.hstack((X_train[['manufacturer_name', 'year_produced']].values, X_train_new))

array([['Audi', 2003, 0, ..., 0, 0, 1],
       ['Nissan', 2013, 0, ..., 0, 0, 1],
       ['Renault', 2009, 1, ..., 0, 0, 1],
       ...,
       ['Mitsubishi', 2003, 1, ..., 0, 0, 1],
       ['Renault', 1996, 0, ..., 0, 0, 1],
       ['Infiniti', 2003, 0, ..., 0, 0, 1]], dtype=object)

## One-Hot Encoding with Top Categories

In [228]:
# cars count in of each brand
brandCount=cars.manufacturer_name.value_counts()
brandCount

Volkswagen       4243
Opel             2759
BMW              2610
Ford             2566
Renault          2493
Audi             2468
Mercedes-Benz    2237
Peugeot          1909
Citroen          1562
Nissan           1361
Mazda            1328
Toyota           1246
Hyundai          1116
Skoda            1089
Kia               912
Mitsubishi        887
Fiat              824
Honda             797
Volvo             721
ВАЗ               481
Chevrolet         436
Chrysler          410
Seat              303
Dodge             297
Subaru            291
Rover             235
Suzuki            234
Daewoo            221
Lexus             213
Alfa Romeo        207
ГАЗ               200
Land Rover        184
Infiniti          162
LADA              146
Iveco             139
Saab              108
Jeep              107
Lancia             92
SsangYong          79
УАЗ                74
Geely              71
Mini               68
Acura              66
Porsche            61
Dacia              59
Chery     

In [225]:
# number of brands
cars.manufacturer_name.nunique()

55

In [235]:
# count of cars is 100 or less
threshold = 100

LowCountBrand = brandCount[brandCount <= threshold].index
LowCountBrand

Index(['Lancia', 'SsangYong', 'УАЗ', 'Geely', 'Mini', 'Acura', 'Porsche',
       'Dacia', 'Chery', 'Москвич', 'Jaguar', 'Buick', 'Lifan', 'Cadillac',
       'Pontiac', 'ЗАЗ', 'Lincoln', 'Great Wall'],
      dtype='object')

In [239]:
# replacing name of those brands with cars less than 100 as 'uncommon'
pd.get_dummies(cars['manufacturer_name'].replace(LowCountBrand, 'uncommon'))

Unnamed: 0,Alfa Romeo,Audi,BMW,Chevrolet,Chrysler,Citroen,Daewoo,Dodge,Fiat,Ford,...,Seat,Skoda,Subaru,Suzuki,Toyota,Volkswagen,Volvo,uncommon,ВАЗ,ГАЗ
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38526,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38527,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38528,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
38529,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Column Transformer

In [250]:
from sklearn.impute import SimpleImputer

In [319]:
covid = pd.read_csv('./Datasets/coviddummy.csv')

In [379]:
covid

array(['Male', 'Female'], dtype=object)

In [372]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(covid.iloc[:,1:5],
                                                   covid.iloc[:,0],
                                                   test_size=0.2)

In [373]:
X_train

Unnamed: 0,Age,Severity,Gender,Fever
8,24,Moderate,Male,98.9
7,12,Mild,Female,99.0
5,60,Severe,Female,
2,40,Severe,Male,
3,20,Mild,Male,99.8
13,29,Severe,Female,99.0
0,10,Mild,Male,100.0
10,41,Mild,Male,102.0
1,20,Moderate,Female,
6,70,Moderate,Male,


In [396]:
from sklearn.compose import ColumnTransformer

In [392]:
cot = ColumnTransformer(transformers=[('tnf1', SimpleImputer(), ['Fever']),
                                      ('tnf2', OrdinalEncoder(categories=[['Mild','Moderate','Severe']]),['Severity']),
                                      ('tnf3', OneHotEncoder(sparse=False, drop='first'), ['Gender'])],
                        remainder='passthrough')

In [393]:
cot.fit_transform(X_train).shape

(12, 4)

In [395]:
cot.transform(X_test).shape

(3, 4)