# Encoding the Categorical Data

In [1]:
"""
Two types of categorical data

1. Nominal data  
   - Categories have **no inherent order or ranking**.  
   - Each value holds **equal importance**.  
   - Examples: gender (male, female), color (red, blue, green), city names.  
   - In this case we use One Hot Encoding

2. Ordinal data  
   - Categories have a **meaningful order or ranking**, but the **intervals between them are not equal**.  
   - Each value holds **different importance or priority**.  
   - Examples: education level (high school < bachelor < master < PhD), customer satisfaction (poor < average < good < excellent).
   - In this case we use Ordinal Encoding


Categorical datas are in text, hence we need to convert them into numerical data
We can do so by two types

1/ ordinal encoding
    i/ label encoding
2/ one hot encoding


When the input section is in textual categorical, there you will be using Ordinal Encoding. But if the output column is in the same format, then you can't use ordinal encoding, rather you will be using Label Encoding.

Label Encoding is explicitely for output column

"""

"\nTwo types of categorical data\n\n1. Nominal data  \n   - Categories have **no inherent order or ranking**.  \n   - Each value holds **equal importance**.  \n   - Examples: gender (male, female), color (red, blue, green), city names.  \n   - In this case we use One Hot Encoding\n\n2. Ordinal data  \n   - Categories have a **meaningful order or ranking**, but the **intervals between them are not equal**.  \n   - Each value holds **different importance or priority**.  \n   - Examples: education level (high school < bachelor < master < PhD), customer satisfaction (poor < average < good < excellent).\n   - In this case we use Ordinal Encoding\n\n\nCategorical datas are in text, hence we need to convert them into numerical data\nWe can do so by two types\n\n1/ ordinal encoding\n    i/ label encoding\n2/ one hot encoding\n\n\nWhen the input section is in textual categorical, there you will be using Ordinal Encoding. But if the output column is in the same format, then you can't use ordinal

### Ordinal encoding, & Label encoding

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

In [3]:
df = pd.read_csv('customer.csv')
df.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [4]:
df = df.iloc[:, 2:]

# review, education -- Ordianal encoder
# purchased -- Label encoder

x_train, x_test, y_train, y_test = train_test_split(df.drop('purchased', axis=1), df['purchased'], test_size=0.2)
# x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:2], df.iloc[:,-1], test_size=0.2)

# using ordinal encoding
oe = OrdinalEncoder(categories = [['Poor', 'Average', 'Good'], ['School', 'UG', 'PG']])
# ['Poor', 'Average', 'Good'] -- categories from 'review' column [these values are put in place order wise, Poor being the lower value and Good being the higher value, they are placed in lower to higher hierarchy]
# ['School', 'UG', 'PG'] -- categories from 'education' column

# using ordinal encoding on input columns
x_train = oe.fit_transform(x_train)
x_test = oe.transform(x_test)

# using label encoding on output columns
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)
# le.classes_     # uncomment it to see which value is assigned which number


print(x_train)
print('*******************************************************')
print(y_train)


#  THIS IS ALL ABOUT THE ORDINAL ENCODING AND LABEL ENCODING, AFTER DOING THIS, YOU CAN NOW USE LOGISTIC REGRESSION TO MAKE PREDICTIONS




[[0. 0.]
 [2. 2.]
 [2. 2.]
 [1. 1.]
 [2. 2.]
 [2. 0.]
 [2. 1.]
 [0. 2.]
 [0. 0.]
 [2. 0.]
 [0. 2.]
 [0. 0.]
 [2. 1.]
 [0. 2.]
 [2. 0.]
 [0. 1.]
 [1. 1.]
 [2. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [0. 1.]
 [0. 0.]
 [2. 2.]
 [2. 1.]
 [1. 1.]
 [1. 2.]
 [0. 0.]
 [1. 0.]
 [2. 2.]
 [0. 2.]
 [0. 1.]
 [0. 2.]
 [0. 2.]
 [1. 1.]
 [0. 2.]
 [0. 2.]
 [1. 0.]
 [1. 1.]]
*******************************************************
[1 1 1 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0
 1 0 0]
