Encoding Categorical Data involves techniques like Ordinal Encoding and Label Encoding, which assign numeric values to categorical variables, facilitating model interpretation.

Ordinal Encoding – Categorical Feature Encoding

✅ When to Use:

Categories have a meaningful order (e.g., "low", "medium", "high").

You want to preserve that order for models that benefit from it (like tree-based or linear models).

In [76]:
import pandas as pd

In [77]:
df = pd.read_csv('data.csv')

df

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No
5,31,Female,Average,School,Yes
6,18,Male,Good,School,No
7,60,Female,Poor,School,Yes
8,65,Female,Average,UG,No
9,74,Male,Good,UG,Yes


In [78]:
df = df.iloc[:,2:]

df

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No
5,Average,School,Yes
6,Good,School,No
7,Poor,School,Yes
8,Average,UG,No
9,Good,UG,Yes


In [79]:
from sklearn.model_selection import train_test_split

In [80]:
x_train, x_test, y_train, y_test = train_test_split(
    df.iloc[:,0:2],
    df.iloc[:,-1],
    test_size=0.2,
    random_state=42
)

In [81]:
x_train

Unnamed: 0,review,education
12,Poor,School
4,Average,UG
37,Average,PG
8,Average,UG
3,Good,PG
6,Good,School
41,Good,PG
46,Poor,PG
47,Good,PG
15,Poor,UG


In [82]:
from sklearn.preprocessing import OrdinalEncoder

In [83]:
encoder = OrdinalEncoder(
    categories=[['Poor','Average','Good'],['School','UG','PG']]
)

In [84]:
encoder.fit(x_train)

In [85]:
x_train = encoder.transform(x_train)
x_test = encoder.transform(x_test)

In [86]:
x_train

array([[0., 0.],
       [1., 1.],
       [1., 2.],
       [1., 1.],
       [2., 2.],
       [2., 0.],
       [2., 2.],
       [0., 2.],
       [2., 2.],
       [0., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [1., 0.],
       [0., 0.],
       [1., 0.],
       [1., 1.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 1.],
       [2., 1.],
       [0., 1.],
       [1., 2.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [2., 1.],
       [1., 0.],
       [0., 0.],
       [2., 2.],
       [0., 2.],
       [0., 0.],
       [2., 0.]])

In [87]:
x_test

array([[1., 0.],
       [0., 2.],
       [1., 1.],
       [0., 2.],
       [0., 1.],
       [2., 1.],
       [0., 2.],
       [2., 0.],
       [1., 1.],
       [0., 2.]])

In [88]:
encoder.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [89]:
y_train

12     No
4      No
37    Yes
8      No
3      No
6      No
41    Yes
46     No
47    Yes
15     No
9     Yes
16    Yes
24    Yes
34     No
31    Yes
0      No
44     No
27     No
33    Yes
5     Yes
29    Yes
11    Yes
36    Yes
1      No
21     No
2      No
43     No
35    Yes
23     No
40     No
10    Yes
22    Yes
18     No
49     No
20    Yes
7     Yes
42    Yes
14    Yes
28     No
38     No
Name: purchased, dtype: object

In [90]:
encode = OrdinalEncoder()

In [91]:
encode.fit(y_train)

ValueError: Expected a 2-dimensional container but got <class 'pandas.core.series.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

In [None]:
y_train = encode.transform(y_train)
y_test = encode.transform(y_test)

In [None]:
y_train

array([[0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.]])

In [None]:
y_test

array([[0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [0.],
       [1.],
       [1.]])