<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">Encoding Categorical Data
    </p>
</div>

---

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>What is Encoding Categorical Data?</strong>
<br>
• Encoding categorical data refers to the process of converting categorical variables into a numerical format data.
<br>
• It transforms categorical features into a numerical format that models can understand thereby enhancing model performance and interpretability.
<br>
<br>
<strong>Categorical Data and its Types</strong>
<br>
• Categorical data can be classified into different types and various encoding techniques are suitable for each type. 
<br>
➩ <strong>Nominal Data</strong>
<br>
• Categories that do not have any intrinsic ordering.
<br>
Examples: ["Ford", "Swift", "Honda"], ["Red", "Green", "Blue"]
<br>
➩ <strong>Ordinal Data</strong>
<br>
• Categories that have a meaningful order assosiated with them.
<br>
Examples: ["low", "medium", "high"], ["poor", "average", "good"].
</div>

<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">Ordinal Data
    </p>
</div>

---

In [88]:
# Importing Libraries
import numpy as np
import pandas as pd

In [89]:
# Reading CSV Data
df = pd.read_csv("customer.csv")
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
39,76,Male,Poor,PG,No
43,27,Male,Poor,PG,No
41,23,Male,Good,PG,Yes
14,15,Male,Poor,PG,Yes
45,61,Male,Poor,PG,Yes


In [90]:
# Selecting some specific column
df = df[["review","education","purchased"]]
df.sample(5)

Unnamed: 0,review,education,purchased
5,Average,School,Yes
30,Average,UG,No
26,Poor,PG,No
36,Good,UG,Yes
38,Good,School,No


In [91]:
# Shape of the DataFrame
df.shape

(50, 3)

In [92]:
# Unique values in review column
df["review"].value_counts()

review
Poor       18
Good       18
Average    14
Name: count, dtype: int64

In [93]:
# Unique values in education column
df["education"].value_counts()

education
PG        18
School    16
UG        16
Name: count, dtype: int64

In [94]:
# Unique values in purchased column
df["purchased"].value_counts()

purchased
No     26
Yes    24
Name: count, dtype: int64

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Train Test Split</strong>
<br>
The train-test split is a common technique in machine learning for evaluating model performance. It involves dividing your dataset into two parts :
<br>
• <strong>Training Set :</strong> Used to train the model.
<br>
• <strong>Testing Set :</strong> Used to evaluate the model's performance on unseen data.
<br>
<br>
<strong>Parameters</strong>
<br>
• <strong>arrays :</strong> This can be a list or a tuple of arrays (e.g, features and target variables).
<br>
• <strong>test_size :</strong> Determines the proportion of the dataset to include in the test split (e.g, 0.2 for 20%).
<br>
• <strong>random_state :</strong> Controls the shuffling applied to the data before the split (e.g., any integer).
<br>
• <strong>shuffle :</strong> A boolean that indicates whether to shuffle the data before splitting.
</div>

In [95]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [96]:
# Dividing Features and Target Variables
X = df[["review","education"]]
y = df["purchased"]

In [97]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [98]:
# Shape of Training and Testing Set 
print(X_train.shape, X_test.shape)

(35, 2) (15, 2)


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Ordinal Encoder</strong>
<br>
• Ordinal encoding is a technique used to convert categorical data into numerical format specifically for ordinal categorical variables where the categories have a meaningful order or ranking.
<br>
<strong>How does Ordinal Encoding works?</strong>
<br>
• Each unique category is assigned an integer value that reflects its order.
<br>
• For instance: "low" = 0 | "medium" = 1 | "high" = 2
<br>
• Unlike one-hot encoding which creates multiple binary columns ordinal encoding results in a single numeric column.
<br>
<strong>How to implement?</strong>
<br>
• You can specify the order of categories directly when initializing the encoder. This is done by providing a list of categories in the desired order.
<br>
• When creating the OrdinalEncoder you provide the categories parameter which is a list of lists. 
<br>
• Each inner list represents the ordered categories for a specific feature.
<br>
• Let’s say we have a dataset with a "satisfaction" variable that has three levels : "low," "medium" and "high." 
<br>
• We want to encode these in that specific order.
<br>
<div style="background-color:white; padding:8px; border:1px solid gainsboro; border-radius:4px;">
from sklearn.preprocessing import OrdinalEncoder
<br>
oe = OrdinalEncoder(categories=[["low","medium","high"]])
</div>
</div>

In [99]:
# Importing OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder

In [100]:
# Creating OrdinalEncoder Object 
# With order of categories defined in categories parameter
oe = OrdinalEncoder(categories=[["Poor","Average","Good"],["School","UG","PG"]])

In [101]:
# Fit and Transform is called on Training Data only
X_train_transformed = oe.fit_transform(X_train)

In [102]:
# Transformation on Testing Data 
X_test_transformed = oe.transform(X_test)

In [103]:
# Parameter to view the Categories
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>LabelEncoder</strong>
<br>
• LabelEncoder is a class from Python's Scikit-learn library. 
<br>
• It is used to convert categorical labels into a numerical format which is often necessary for machine learning algorithms that require numerical input.
<br>
• The official Scikit-learn documentation advises using LabelEncoder primarily for encoding target labels rather than features.
<br>
• This is because LabelEncoder converts categories into integers which can imply a ranking or order that may not exist in the original data.
<br>
<br>
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">View the Official Documentation</a>
</div>

In [104]:
# Importing LabelEncoder
from sklearn.preprocessing import LabelEncoder

In [105]:
# Creating LabelEncoder Object
le = LabelEncoder()

In [106]:
# Fit and Transform is called on Training Data only
y_train_transformed = le.fit_transform(y_train)

In [107]:
# Transformation on Testing Data 
y_test_transformed = le.transform(y_test)

<div id="header">
    <p style="color:#6a66bd; text-align:center; font-weight:bold; font-family:verdana; font-size:20px;">Nominal Data
    </p>
</div>

---

In [108]:
# Reading CSV Data
df = pd.read_csv("cars.csv")
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
2377,Ford,20706,Diesel,First Owner,650000
6611,Mahindra,152000,Diesel,Fourth & Above Owner,200000
1847,Maruti,20000,Petrol,First Owner,550000
4176,Volkswagen,80000,Diesel,Fourth & Above Owner,250000
1074,Hyundai,63309,Diesel,First Owner,1150000


In [109]:
# Shape of the DataFrame
df.shape

(8128, 5)

In [110]:
# Unique values in fuel column
df["fuel"].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [111]:
# Unique values in owner column
df["owner"].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [112]:
# Unique values in brand column
df["brand"].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Fiat', 'Datsun', 'Jeep',
       'Mercedes-Benz', 'Mitsubishi', 'Audi', 'Volkswagen', 'BMW',
       'Nissan', 'Lexus', 'Jaguar', 'Land', 'MG', 'Volvo', 'Daewoo',
       'Kia', 'Force', 'Ambassador', 'Ashok', 'Isuzu', 'Opel', 'Peugeot'],
      dtype=object)

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>What is pd.get_dummies()?</strong>
<br>
• pd.get_dummies() is a function in Pandas that converts categorical variables into a format suitable for machine learning.
<br>
• It does this by creating new binary (0 or 1) columns for each category within a nominal variable.
<br>
<strong>Why to use it for Nominal Data?</strong>
<br>
• Nominal data consists of categories without any inherent order. 
<br>
• Many machine learning algorithms require numerical input so pd.get_dummies() transformers these categorical values into numerical format.
<br>
<strong>How it Works?</strong>
<br>
• Each unique category in a nominal variable becomes its own column. 
<br>
• If a data point belongs to that category the new column will have a value of 1 otherwise it will have a value of 0.
</div>

In [113]:
# Encoding Categorical Data 
# Using Pandas pd.get_dummies() method
pd.get_dummies(df, columns=["fuel","owner"], dtype=np.int32)

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Why to use drop_first=True?</strong>
<br>
• When you use drop_first=True, it drops the first category for each categorical variable which helps to avoid the dummy variable trap. 
<br>
➩ <strong>Here’s what that means :</strong>
<br>
• If you have a categorical variable with three categories ("red", "green", "blue") pd.get_dummies() would create three new binary columns.
<br>
→ is_red
<br>
→ is_green
<br>
→ is_blue
<br>
• Each column contains 1 or 0 indicating the presence of that category.
<br>
• If you include all these columns in your model they can be perfectly correlated. This can lead to multicollinearity which complicates the model.
<br>
• By setting drop_first=True, the first category ("Red") is dropped resulting in only (is_green, is_blue).
<br>
• This way you can reduce the number of columns and you can still determine the missing category based on the other two.
<br>
• If both is_green and is_blue are 0 then the category is "Red".
</div>

In [114]:
# Encoding Categorical Data 
# Using Pandas pd.get_dummies() method with drop_first=True
# To avoid dummy variable trap and multicollinearity
pd.get_dummies(df, columns=["fuel","owner"], drop_first=True, dtype=np.int32)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Train Test Split</strong>
<br>
The train-test split is a common technique in machine learning for evaluating model performance. It involves dividing your dataset into two parts :
<br>
• <strong>Training Set :</strong> Used to train the model.
<br>
• <strong>Testing Set :</strong> Used to evaluate the model's performance on unseen data.
<br>
<br>
<strong>Parameters</strong>
<br>
• <strong>arrays :</strong> This can be a list or a tuple of arrays (e.g, features and target variables).
<br>
• <strong>test_size :</strong> Determines the proportion of the dataset to include in the test split (e.g, 0.2 for 20%).
<br>
• <strong>random_state :</strong> Controls the shuffling applied to the data before the split (e.g., any integer).
<br>
• <strong>shuffle :</strong> A boolean that indicates whether to shuffle the data before splitting.
</div>

In [115]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

In [116]:
# Dividing Features and Target Variables
X = df.iloc[:,0:4]
y = df["selling_price"]

In [117]:
# Splitting the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [118]:
# Shape of Training and Testing Set 
print(X_train.shape, X_test.shape)

(5689, 4) (2439, 4)


<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>OneHotEncoder</strong>
<br>
• OneHotEncoder is a technique used to convert categorical data into a numerical format that can be used in machine learning models.
<br>
• It transforms each category into a new binary column (0 or 1) representing the presence or absence of that category.
<br>
<strong>How it Works</strong>
<br>
• For a categorical variable identify all unique categories. For example, if you have a variable "Color" with the values "Red", "Green" and "Blue".
<br>
• Each category is converted into a new column (Is_Red, Is_Green, Is_Blue).
<br>
• For each observation, put a 1 in the column corresponding to its category and 0s in the others.
<br>
<strong>Example</strong>
<br>
• Suppose you have the following data :
    
<table style="border-collapse: collapse; width: 25%; text-align:center;">
    <tr>
        <th style="border:2px solid black;">Color</th>
    </tr>
    <tr>
        <td style="border:2px solid black;">Red</td>
    </tr>
    <tr>
        <td style="border:2px solid black;">Green</td>
    </tr>
    <tr>
        <td style="border:2px solid black;">Blue</td>
    </tr>
</table>

• After one-hot encoding it would look like this :

<table style="border-collapse: collapse; width: 50%;">
    <tr>
        <th style="border: 2px solid black; padding: 8px; text-align:center;">Is_Red</th>
        <th style="border: 2px solid black; padding: 8px; text-align:center;">Is_Green</th>
        <th style="border: 2px solid black; padding: 8px; text-align:center;">Is_Blue</th>
    </tr>
    <tr>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">1</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
    </tr>
    <tr>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">1</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
    </tr>
    <tr>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">0</td>
        <td style="border: 2px solid black; padding: 8px; text-align:center;">1</td>
    </tr>
</table>
<br>
<strong>Parameters of OneHotEncoder</strong>
<br>
➩ <strong>drop</strong>
<br>
• Controls whether to drop one of the categories to avoid the dummy variable trap.
<br>
→ 'first': The first category of each feature is dropped.
<br>
➩ <strong>sparse</strong>
<br>
• Determines whether to return a sparse matrix (which uses less memory for large datasets) or a dense array.
<br>
→ True: Returns a sparse matrix.
<br>
→ False: Returns a dense array.
</div>

In [119]:
# Importing OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

In [120]:
# Creating OneHotEncoder Object
ohe = OneHotEncoder(drop="first", sparse_output=False)

In [121]:
# Fit and Transform is called on Training Data only
X_train_transformed = ohe.fit_transform(X_train[["fuel","owner"]])

In [122]:
# Transformation on Testing Data 
X_test_transformed = ohe.transform(X_test[["fuel","owner"]])

<div style="background-color:gainsboro; padding:8px; border:2px dotted black; border-radius:8px; font-family:verdana; line-height: 1.7em">
<strong>Combining Rare Categories</strong>
<br>
• When dealing with categorical data in machine learning it’s common to encounter categories that occur infrequently. 
<br>
• These rare categories can introduce noise and complexity into your models often leading to poorer performance. 
<br>
• To address this issue one effective approach is to combine these less frequent categories into a single category typically labeled as "Others".
</div>

In [123]:
# Unique values in brand column
df["brand"].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [124]:
# No of categories in brand column
df["brand"].nunique()

32

In [125]:
# Saving value_counts() result in counts variable
counts = df["brand"].value_counts()

In [126]:
# Setting a limit for frequency of categories
limit = 100

In [127]:
# Saving all car brand names with frequency less than 100 in the data
repl = counts[counts <= limit].index

In [128]:
# Replacing all car brand name having frequency less than 100
# With "uncommon" and then transforming all categories into numerical data
pd.get_dummies(df['brand'].replace(repl, 'uncommon'), dtype=np.int32).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
5695,0,0,0,0,0,0,1,0,0,0,0,0,0
4792,0,0,0,1,0,0,0,0,0,0,0,0,0
2959,0,0,0,0,0,0,0,0,0,0,1,0,0
4331,0,0,0,0,0,0,0,0,0,0,1,0,0
1147,0,0,0,0,1,0,0,0,0,0,0,0,0
