# Ordinal Encoding
Ordinal encoding is a technique in data preprocessing used in machine learning and statistics to convert categorical variables into numerical values. This technique assigns a unique integer value to each category in a feature or attribute, which preserves the order or hierarchy between them.

For example, if we have a feature called "Education Level" with categories "High School," "Bachelor's Degree," and "Master's Degree," we can assign 1, 2, and 3 respectively to preserve the order. In ordinal encoding, we don't create dummy variables for each category as in one-hot encoding.

Ordinal encoding is useful when we have categorical variables that have an inherent order or hierarchy between them. However, it should be used with caution as it can introduce biases in the model, especially when the numerical values assigned do not have any inherent meaning.
# Label Encoding
Label encoding is a process of converting categorical variables into numerical variables by assigning a unique integer label to each category in the variable. This is a simple and commonly used technique in machine learning for encoding categorical data.

For example, if we have a categorical variable "Color" with categories "Red", "Green", and "Blue", we can use label encoding to assign "Red" to 1, "Green" to 2, and "Blue" to 3.

Label encoding is useful when there is a natural ordering of categories in the variable, such as in the case of ordinal variables. However, it is important to note that label encoding can introduce an arbitrary order that may not necessarily reflect the underlying relationship between categories, and it can also cause issues in certain machine learning algorithms that assume that the values of the variable are not ordered.
## Label encoding and ordinal encoding are similar, but not the same.
Both encoding methods are used to convert categorical data into numerical data, which can be used as input for machine learning algorithms.

In label encoding, each category is assigned a unique numerical label. For example, if we have a categorical variable "color" with categories "red", "green", and "blue", we might assign the labels 0, 1, and 2 to these categories, respectively. The problem with label encoding is that it implicitly assumes an ordering between the categories, which may not always be valid or desirable.

In ordinal encoding, the categories are assigned numerical labels according to their order or rank. For example, if we have a categorical variable "education" with categories "high school", "college", and "graduate", we might assign the labels 0, 1, and 2 to these categories, respectively, based on their increasing level of education. Ordinal encoding preserves the ordering between the categories, which can be useful in certain contexts.

So while both encoding methods convert categorical data into numerical data, label encoding assigns arbitrary numerical labels to each category, while ordinal encoding assigns numerical labels based on the order or rank of the categories.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("customer.csv")

In [3]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
5,31,Female,Average,School,Yes
39,76,Male,Poor,PG,No
0,30,Female,Average,School,No
23,96,Female,Good,School,No
30,73,Male,Average,UG,No


In [4]:
# For now i am removing age and gender becuase if I do keep gender then i would have to use on-hot endoing which is not the topic for this.

In [5]:
df = df.iloc[:,2:]
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
from sklearn.model_selection import train_test_split
X = df.iloc[:,:2]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# or abve can also be written as 
# X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:2], df.iloc[:,-1], test_size=0.2, random_state=42)

In [7]:
X_train.sample(5)

Unnamed: 0,review,education
33,Good,PG
46,Poor,PG
27,Poor,PG
44,Average,UG
22,Poor,PG


In [8]:
from sklearn.preprocessing import OrdinalEncoder #Encoding features not the target

In [9]:
oe = OrdinalEncoder(categories=[["Poor", "Average", "Good"],["School", "UG", "PG"]]) # note that you can pass n number of categorical data and keep the orders in ascending
                                                                                     # order

In [10]:
oe.fit(X_train)

In [11]:
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [12]:
X_train

array([[0., 0.],
       [1., 1.],
       [1., 2.],
       [1., 1.],
       [2., 2.],
       [2., 0.],
       [2., 2.],
       [0., 2.],
       [2., 2.],
       [0., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 1.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [2., 1.],
       [1., 0.],
       [0., 0.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.]])

In [13]:
oe.categories_ # to check the categories in a dataset

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [15]:
from sklearn.preprocessing import LabelEncoder # label encoding the taget variable. Note that labelencoder should only be used in target variables as per the documentation

In [16]:
le = LabelEncoder() # in this class you can decide the order it gets initalized by itself

In [17]:
le.fit(y_train)

In [21]:
y_train.sample(5)

20    Yes
3      No
23     No
22    Yes
9     Yes
Name: purchased, dtype: object

In [24]:
le.classes_ # checking how many classes were there. here no=0 and yes=1

array(['No', 'Yes'], dtype=object)

In [25]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [28]:
y_train

array([0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0])