# What is Encoding?

Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format, for the secured transmission of data. 

# Categorical Variables

Most of the Machine Learning algorithms can not handle Categorical Variables unless we convert them to numerical values. 

Many algorithm’s performances vary based on how Categorical Variables are encoded.

Categorical Variables can be divided into two categories: 
Nominal (No particular order)
Ordinal (some ordered).

![title](../images/cat.png)

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('data/breast-cancer.csv')

In [3]:
df.rename(columns={'Class': 'target'}, inplace=True)

In [4]:
df.fillna('unknown', inplace = True)

In [5]:
bin_cols = ['breast','irradiat','target']

df[bin_cols].head()

Unnamed: 0,breast,irradiat,target
0,right,no,recurrence-events
1,right,no,no-recurrence-events
2,left,no,recurrence-events
3,right,yes,no-recurrence-events
4,left,no,recurrence-events


In [6]:
nom_cols = ['breast-quad','menopause','node-caps']

df[nom_cols].head()

Unnamed: 0,breast-quad,menopause,node-caps
0,left_up,premeno,yes
1,central,ge40,no
2,left_low,ge40,no
3,left_low,premeno,yes
4,right_up,premeno,yes


In [7]:
ord_cols = ['age','tumor-size','inv-nodes','deg-malig']

df[ord_cols].head()

Unnamed: 0,age,tumor-size,inv-nodes,deg-malig
0,40-49,15-19,0-2,3
1,50-59,15-19,0-2,1
2,50-59,35-39,0-2,2
3,40-49,35-39,0-2,3
4,40-49,30-34,3-5,2


# Replace Values

In [8]:
df[bin_cols].head()

Unnamed: 0,breast,irradiat,target
0,right,no,recurrence-events
1,right,no,no-recurrence-events
2,left,no,recurrence-events
3,right,yes,no-recurrence-events
4,left,no,recurrence-events


In [9]:
df['breast'].replace({'right': 1, 'left': 0}, inplace=True)
df['irradiat'].replace({'yes': 1, 'no': 0}, inplace=True)
df['target'].replace({'recurrence-events': 1, 'no-recurrence-events': 0}, inplace=True)

In [10]:
df[bin_cols].head()

Unnamed: 0,breast,irradiat,target
0,1,0,1
1,1,0,0
2,0,0,1
3,1,1,0
4,0,0,1


In [11]:
y = df['target']

In [12]:
bin_cols.remove('target')

# Label Encoding

In this encoding, each category is assigned a value from 0 to N-1 (here N is the number of categories for the feature). 

One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship.

![title](images/label.png)

In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
X_label = df[nom_cols].copy()

X_label.head()

Unnamed: 0,breast-quad,menopause,node-caps
0,left_up,premeno,yes
1,central,ge40,no
2,left_low,ge40,no
3,left_low,premeno,yes
4,right_up,premeno,yes


In [15]:
for col in nom_cols:
        labelEnc = LabelEncoder()
        X_label[col] = labelEnc.fit_transform(X_label[col])

In [16]:
X_label.head()

Unnamed: 0,breast-quad,menopause,node-caps
0,2,2,2
1,0,0,0
2,1,0,0
3,1,2,2
4,4,2,2


# Ordinal Encoding

We do Ordinal Encoding to ensure the encoding of variables retains the ordinal nature of the variable.

This encoding looks almost similar to Label Encoding but slightly different as Label Encoding would not consider whether variable is ordinal or not and it will assign sequence of integers.

In [17]:
from sklearn.preprocessing import OrdinalEncoder

In [18]:
X_ord = df[ord_cols].copy()

X_ord.head()

Unnamed: 0,age,tumor-size,inv-nodes,deg-malig
0,40-49,15-19,0-2,3
1,50-59,15-19,0-2,1
2,50-59,35-39,0-2,2
3,40-49,35-39,0-2,3
4,40-49,30-34,3-5,2


In [19]:
ordEnc_1 = OrdinalEncoder(categories=[['20-29', '30-39', '40-49', '50-59', '60-69', '70-79']])
X_ord[['age']] = ordEnc_1.fit_transform(X_ord[['age']])
                                    
ordEnc_2 = OrdinalEncoder(categories=[['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54']])
X_ord[['tumor-size']] = ordEnc_2.fit_transform(X_ord[['tumor-size']])

ordEnc_3 = OrdinalEncoder(categories=[['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '24-26']])
X_ord[['inv-nodes']] = ordEnc_3.fit_transform(X_ord[['inv-nodes']])

ordEnc_4 = OrdinalEncoder()
X_ord[['deg-malig']] = ordEnc_4.fit_transform(X_ord[['deg-malig']])

In [20]:
X_ord.head()

Unnamed: 0,age,tumor-size,inv-nodes,deg-malig
0,2.0,3.0,0.0,2.0
1,3.0,3.0,0.0,0.0
2,3.0,7.0,0.0,1.0
3,2.0,7.0,0.0,2.0
4,2.0,6.0,1.0,1.0


In [21]:
X_label.to_csv('X_label.csv', index=False)
X_ord.to_csv('X_ord.csv', index=False)