<a href="https://colab.research.google.com/github/Ravi-Rsankar/python/blob/master/python/data_science/machine_learning/catogorical_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np

### Label Encoder

Convert the categorical columns to numeric values for the algorithm to understand

In [0]:
bridge_types = ('Arch','Beam','Truss','Cantilever','Tied Arch','Suspension','Cable')
bridge_df = pd.DataFrame(bridge_types, columns=['Bridge_Type'])

In [0]:
bridge_df

Unnamed: 0,Bridge_Type
0,Arch
1,Beam
2,Truss
3,Cantilever
4,Tied Arch
5,Suspension
6,Cable


In [0]:
bridge_df.dtypes

Bridge_Type    object
dtype: object

In [0]:
bridge_df['Bridge_Type'] = bridge_df['Bridge_Type'].astype('category')

In [0]:
bridge_df.dtypes

Bridge_Type    category
dtype: object

Encode the categories

In [0]:
bridge_df['Bridge_Type_cat'] = bridge_df["Bridge_Type"].cat.codes

In [0]:
bridge_df

Unnamed: 0,Bridge_Type,Bridge_Type_cat
0,Arch,0
1,Beam,1
2,Truss,6
3,Cantilever,3
4,Tied Arch,5
5,Suspension,4
6,Cable,2


In [0]:
bridge_df['Bridge_Type'].cat.categories

Index(['Arch', 'Beam', 'Cable', 'Cantilever', 'Suspension', 'Tied Arch',
       'Truss'],
      dtype='object')

In [0]:
bridge_df['Bridge_Type'].cat.ordered

False

Using scikit learn library

In [0]:
from sklearn.preprocessing import LabelEncoder

In [0]:
labelencoder = LabelEncoder()

In [0]:
bridge_df['Bridge_Type_Label'] = labelencoder.fit_transform(bridge_df['Bridge_Type'])

In [0]:
bridge_df

Unnamed: 0,Bridge_Type,Bridge_Type_cat,Bridge_Type_Label
0,Arch,0,0
1,Beam,1,1
2,Truss,6,6
3,Cantilever,3,3
4,Tied Arch,5,5
5,Suspension,4,4
6,Cable,2,2


LabelEncoder creates a new problem while converting the values to numeric. The numeric values are interpreted as hierarchy by the algorithm. To avoid this OneHotEncoder is used

### One Hot Encoder

Each categorical value is made a new column with 0 and 1 as values. This solves the hierarchy problem

In [0]:
from sklearn.preprocessing import OneHotEncoder

In [0]:
enc = OneHotEncoder(handle_unknown='ignore')# ignores unknown categories. 
#The values for unknown categories will be set as 0. If not ignored then it will be null

In [0]:
enc_df=pd.DataFrame(enc.fit_transform(bridge_df[['Bridge_Type_cat']]).toarray())

In [0]:
bridge_df = bridge_df.join(enc_df)
bridge_df

Unnamed: 0,Bridge_Type,Bridge_Type_cat,Bridge_Type_Label,0,1,2,3,4,5,6
0,Arch,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Beam,1,1,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,Truss,6,6,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,Cantilever,3,3,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,Tied Arch,5,5,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,Suspension,4,4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,Cable,2,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0
