## `Ordinal Encoding` and `Label Encoder`



**Encoding the `Categorical Data`**
- The **Categorical Data** is mainly of two types:
    - **Nominal Categorical Data**
        - Here there is no relationship or order between the categories.
        - Example is different states of the country or different branches of engineering.
    - **Ordinal Categorical Data**
        - Here there is an order between the categories.
        - Example is Grades in a marksheet i.e. Distinction, 1st Division, 2nd Division, 3rd Division etc. So here the order of grade will be:
        - **Distinction > 1st Division > 2nd Division > 3rd Division**.
        - Here we need to set value according to the order of the categories, so here **Distinction** will get highest value and **3rd Division** will get lowest value. 
        


**How to encode `Categorical Data`?**

- Mostly the **Categorical Data** is in `str` format. To use them in ML algorithm we need to convert them to numeric values. To do this conversion there are a few techniques like `Ordinal Encoding` mainly used in `Ordinal Data`, `One Hot Encoding` mainly used for `Nominal Data` and `Label Encoding` it is almost similar as `Ordinal Encoding`.
- If in the dataset there are `Ordinal Data` in any input feature then we need to use the **Ordinal Encoder** but if the output feature of the dataset is of categorical type then we need to use `Label Encoding`. For this there in **Scikitlearn** we have a library named **LabelEncoder**.

In [1]:
# importing the libraries

import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# importing the dataset

df = pd.read_csv('datasets/customer.csv')
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
6,18,Male,Good,School,No
30,73,Male,Average,UG,No
38,45,Female,Good,School,No
18,19,Male,Good,School,No
42,30,Female,Good,PG,Yes


In [3]:
df.dtypes

age           int64
gender       object
review       object
education    object
purchased    object
dtype: object

In [4]:
# Checking how many unique categories in each column

df.nunique()

age          41
gender        2
review        3
education     3
purchased     2
dtype: int64

**Notes**

- Here we can see input features `gender`, `review`, `education` and output feature `purchased` all are categorical data.
- Here `gender` is of `Nominal category` type and it has 2 unique categories. So here we will use **One Hot Encoder**.
- Here `review` and `education` are of `Ordinal category` type and they each have 3 unique categories. These will be transformed by using **Ordinal Encoder**.
- The output feature `purchased` is of `Nominal category` type and it has 2 unique categories. Here we will apply **Label Encoder**.
- As we have not read **One Hot Encoder** yet so here we will remove the `age` and `gender` columns.

In [5]:
# removing 'age' and 'gender' columns

df = df.iloc[:, 2:]
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
df.shape

(50, 3)

### Doing train test split

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 0:2],
                                                    df.iloc[:, -1],
                                                    test_size=0.2,
                                                    random_state=42)

X_train.shape, X_test.shape

((40, 2), (10, 2))

In [8]:
X_train

Unnamed: 0,review,education
12,Poor,School
4,Average,UG
37,Average,PG
8,Average,UG
3,Good,PG
6,Good,School
41,Good,PG
46,Poor,PG
47,Good,PG
15,Poor,UG


### `OrdinalEncoder`

In [9]:
from sklearn.preprocessing import OrdinalEncoder

In [10]:
# Creating object of OrdinalEncoder class.
# Here we pass a parameter 'categories' which will be a list.
# In this list we pass list of all the categorical data in order we want to transform for each column contains ordinal data.
# As here for column 'review' the order of categories will be 'Poor'<'Average'<'Good'.
# So when the transformation took place then 'Good' will get maximum number and 'Poor' will get minimum number.
# Similar thing will happen with 'education' column.
# If this parameter is not passed then order get decided randomly.


oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [11]:
# training with the train data

oe.fit(X_train)

In [12]:
# Now transforming both train and test data


X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [13]:
# Checking the training data after transformation

X_train

array([[0., 0.],
       [1., 1.],
       [1., 2.],
       [1., 1.],
       [2., 2.],
       [2., 0.],
       [2., 2.],
       [0., 2.],
       [2., 2.],
       [0., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 1.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [2., 1.],
       [1., 0.],
       [0., 0.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.]])

In [14]:
# Checking the categories in the object

oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

### `LabelEncoder`

**Always remember `LabelEncoder` only to be used for encoding the output feature if it has categorical values.**

In [15]:
from sklearn.preprocessing import LabelEncoder

In [16]:
# Creating object

le = LabelEncoder()

In [17]:
# training the object

le.fit(y_train)

In [18]:
# Checking the classes
# Here 'No' is assigned 0 and 'Yes' is assigned 1
# It is done automatically we cannot control it.

le.classes_

array(['No', 'Yes'], dtype=object)

In [19]:
# Transforming both train and test output features

y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [20]:
# Checking the transformed training output data

y_train

array([0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0])

In [21]:
# Just checking
# look for the index 12 as it is the 1st row in the train data

df.head(15)

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No
5,Average,School,Yes
6,Good,School,No
7,Poor,School,Yes
8,Average,UG,No
9,Good,UG,Yes
