In [1]:
import numpy as np 
import pandas as pd

In [2]:
df = pd.read_csv('/content/customer.csv')

In [3]:
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
12,51,Male,Poor,School,No
42,30,Female,Good,PG,Yes
27,69,Female,Poor,PG,No
21,32,Male,Average,PG,No
2,70,Female,Good,PG,No


In [4]:
df = df.iloc[:,2:]


In [5]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [6]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:2],df.iloc[:,-1],test_size=0.2)

In [7]:
X_train


Unnamed: 0,review,education
48,Good,UG
19,Poor,PG
29,Average,UG
11,Good,UG
21,Average,PG
12,Poor,School
46,Poor,PG
42,Good,PG
3,Good,PG
38,Good,School


# Ordinal Encoder

## OrdinalEncoder is a technique used to encode categorical variables into numerical values, similar to LabelEncoder. However, the difference between LabelEncoder and OrdinalEncoder is that in OrdinalEncoder, the numerical values assigned to the categories reflect some meaningful ordinal relationship between the categories.

## For example, suppose you have a categorical feature "education" with the categories "high school", "college", and "graduate school". Using OrdinalEncoder, you could represent "high school" as 1, "college" as 2, and "graduate school" as 3. This encoding reflects the ordinal relationship between the categories, where higher education levels have higher numerical values.
 
 ## In this way, OrdinalEncoder can capture the ordinal relationship between categories and allow machine learning algorithms to utilize this information. However, it is still limited by the fact that it only captures linear relationships between categories, and does not create any new features. In many cases, one-hot encoding may still be a better choice, as it can capture more complex relationships between categories.

In [8]:
from sklearn.preprocessing import OrdinalEncoder

In [9]:
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [10]:
oe.fit(X_train)

OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])

In [11]:
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

In [13]:
X_train


array([[2., 1.],
       [0., 2.],
       [1., 1.],
       [2., 1.],
       [1., 2.],
       [0., 0.],
       [0., 2.],
       [2., 2.],
       [2., 2.],
       [2., 0.],
       [0., 0.],
       [0., 2.],
       [1., 1.],
       [2., 2.],
       [2., 0.],
       [0., 2.],
       [2., 2.],
       [2., 0.],
       [2., 1.],
       [2., 0.],
       [0., 1.],
       [1., 1.],
       [0., 1.],
       [0., 2.],
       [1., 0.],
       [1., 1.],
       [2., 2.],
       [0., 1.],
       [1., 2.],
       [0., 2.],
       [2., 1.],
       [0., 0.],
       [1., 1.],
       [2., 0.],
       [2., 0.],
       [0., 0.],
       [1., 0.],
       [1., 2.],
       [1., 0.],
       [0., 0.]])

In [14]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

# LabelEncoding 

## Generaly we do label encoding on the target column

## LabelEncoding is a common technique used in machine learning for encoding categorical variables into numerical values so that they can be processed by algorithms. In label encoding, each unique category in a categorical feature is assigned a unique integer value. This can be useful when working with algorithms that can only handle numerical data, and when there is some meaningful ordinal relationship between the categories, such that converting the categories to numerical values would maintain this relationship.

## For example, suppose you have a categorical feature "color" with the categories "red", "green", and "blue". Using label encoding, you could represent "red" as 0, "green" as 1, and "blue" as 2. This allows you to use this feature in machine learning algorithms that only accept numerical input.
 
 ## It is important to note that label encoding does not create any new features or capture any non-linear relationships between categories. In many cases, one-hot encoding is a better choice, as it creates a new binary feature for each category, which can capture more complex relationships between categories.

In [15]:
from sklearn.preprocessing import LabelEncoder

In [16]:
le = LabelEncoder()

In [17]:
le.fit(y_train)

LabelEncoder()

In [19]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [20]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [21]:
y_train

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1])

# Difference 


## LabelEncoder assigns a unique integer value to each category in a categorical feature, without considering any ordinal relationship between the categories. It is a simple method of converting categorical data into numerical data, but it does not capture any non-linear relationships between categories.

## On the other hand, OrdinalEncoder assigns numerical values to the categories that reflect some meaningful ordinal relationship between the categories. This can be useful when there is a meaningful order or ranking between the categories, as the numerical values can capture this relationship and allow machine learning algorithms to utilize it.

## Another difference is that LabelEncoder only assigns integer values to the categories, while OrdinalEncoder can assign any numerical values, as long as they reflect the ordinal relationship between the categories.