### Handling Categorical Variables

- In categorical variables we have ordinal data and nominal data
- In case of ordinal data we do ordinal encoding
- In case of nominal data we do one hot encoding
- Also if we have our output variable as categorical data, we should always do label encoding on it

### Why we do encoding?


- Because our machine learning model doesnt understand strings. It only understands numbers. So we need to encode those string into nos.

In [51]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [35]:
# load my data

df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


#### Ordinal Encoding

In the above dataset, ordinal encoding can be done in review and education since both of these columns shows some order in it

In [36]:
df1 = df.iloc[:,2:]

In [37]:
df1.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [38]:
X = df1[["review","education"]]
y = df1["purchased"]

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [40]:
X_train.shape

(35, 2)

In [41]:
X_test.shape

(15, 2)

In [42]:
y_train.shape

(35,)

In [43]:
y_test.shape

(15,)

In [44]:
oe = OrdinalEncoder(categories=[["Poor","Average","Good"],["School","UG","PG"]])

In [45]:
oe.fit(X_train)

OrdinalEncoder(categories=[['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])

In [46]:
X_train_transformed = oe.transform(X_train)
X_test_transformed = oe.transform(X_test)

In [47]:
X_train

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG
15,Poor,UG
30,Average,UG
32,Average,UG
16,Poor,UG
42,Good,PG


In [49]:
X_train_transformed

array([[0., 0.],
       [0., 2.],
       [0., 2.],
       [2., 1.],
       [1., 1.],
       [0., 1.],
       [1., 1.],
       [1., 1.],
       [0., 1.],
       [2., 2.],
       [1., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [2., 0.],
       [1., 0.],
       [0., 1.],
       [2., 0.],
       [2., 1.],
       [0., 1.],
       [0., 0.],
       [1., 2.],
       [1., 2.],
       [2., 0.],
       [2., 0.],
       [2., 1.],
       [1., 2.],
       [0., 2.],
       [2., 1.],
       [0., 2.],
       [0., 2.],
       [2., 2.],
       [1., 0.],
       [2., 2.],
       [1., 1.]])

In [50]:
# To see my categories

oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

#### Label Encoder

Since our target column also has categories in it so we will use label encoder to encode it

In [52]:
le = LabelEncoder()

le.fit(y_train)

y_train_transformed = le.transform(y_train)
y_test_transformed = le.transform(y_test)

In [53]:
le.classes_

array(['No', 'Yes'], dtype=object)

So thats how we do encoding of ordinal categorical data