<a href="https://colab.research.google.com/github/Ds2023/ML_Concepts/blob/main/Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Encoding Techniques**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df_original = pd.read_csv("/content/drive/MyDrive/data/Invistico_Airline.csv")

In [None]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129880 non-null  object 
 1   Gender                             129880 non-null  object 
 2   Customer Type                      129880 non-null  object 
 3   Age                                129880 non-null  int64  
 4   Type of Travel                     129880 non-null  object 
 5   Class                              129880 non-null  object 
 6   Flight Distance                    129880 non-null  int64  
 7   Seat comfort                       129880 non-null  int64  
 8   Departure/Arrival time convenient  129880 non-null  int64  
 9   Food and drink                     129880 non-null  int64  
 10  Gate location                      129880 non-null  int64  
 11  Inflight wifi service              1298

## Step 2: Data exploration, data cleaning, and model preparation

### Prepare the data

After loading the dataset, prepare the data to be suitable for decision tree classifiers. This includes:

*   Exploring the data
*   Checking for missing values
*   Encoding the data
*   Renaming a column
*   Creating the training and testing data

In [None]:
pd.set_option('display.max_columns', 999)

In [None]:
df_original.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,2,4,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,0,2,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,2,0,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,3,4,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,4,3,4,2,2,0,2,4,2,5,0,0.0


# Drop the missing rows

In [None]:
df_subset = df_original.dropna(axis = 0)

In [None]:
df_subset.isnull().sum()

satisfaction                         0
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

# **Encoding**

In the realm of data preprocessing for machine learning tasks, encoding is a pivotal step. Encoding involves converting categorical data into a numerical format that machine learning algorithms can understand. Categorical data consists of values that represent categories or groups, such as types of animals, colors, or names of countries.

There are several encoding techniques, each with its own advantages and use cases. In this notebook, we'll explore some popular encoding techniques including:

**One-Hot Encoding:** This technique converts categorical variables into a binary format where each category is represented by a binary feature (0 or 1).

**Label Encoding:** Label encoding assigns a unique numerical label to each category in the variable, effectively converting categories into numerical values.

**Frequency Encoding:** Frequency encoding replaces categories with the frequency of their occurrences in the dataset, which can be useful for high cardinality categorical variables.

**Binary Encoding:** Binary encoding converts categories into binary code, resulting in fewer binary features compared to one-hot encoding while still preserving the categorical information.

**Ordinal Encoding:** A technique used to convert categorical variables into numerical values based on the order or rank of the categories.

**Target encoding:** Also known as mean encoding or likelihood encoding, is a method used to encode categorical variables by replacing each category with the mean (or other statistical measure) of the target variable for that category.

By understanding and applying these encoding techniques effectively, we can prepare categorical data for machine learning models, ultimately improving their performance and predictive accuracy.

In the following sections, we'll demonstrate each encoding technique and discuss their pros and cons.

In [None]:
categorical_columns = df_subset.select_dtypes(include=['category', 'object'])

In [None]:
categorical_columns

Unnamed: 0,satisfaction,Gender,Customer Type,Type of Travel,Class
0,satisfied,Female,Loyal Customer,Personal Travel,Eco
1,satisfied,Male,Loyal Customer,Personal Travel,Business
2,satisfied,Female,Loyal Customer,Personal Travel,Eco
3,satisfied,Female,Loyal Customer,Personal Travel,Eco
4,satisfied,Female,Loyal Customer,Personal Travel,Eco
...,...,...,...,...,...
129875,satisfied,Female,disloyal Customer,Personal Travel,Eco
129876,dissatisfied,Male,disloyal Customer,Personal Travel,Business
129877,dissatisfied,Male,disloyal Customer,Personal Travel,Eco
129878,dissatisfied,Male,disloyal Customer,Personal Travel,Eco


# **Ordinal Encoding**

The class column has an ordinal nature hence we'll employ an ordinal encoder.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
classes = categorical_columns['Class'].unique()
classes

array(['Eco', 'Business', 'Eco Plus'], dtype=object)

In [None]:
# Create the encoder with custom mapping
encoder = OrdinalEncoder(categories = [classes])

In [None]:
encoder.fit_transform(categorical_columns[['Class']])

array([[0.],
       [1.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]])

In [None]:
categorical_columns['Class_Ord_Enc'] = encoder.fit_transform(categorical_columns[['Class']])

In [None]:
categorical_columns.head(30)

Unnamed: 0,satisfaction,Gender,Customer Type,Type of Travel,Class,Class_Ord_Enc
0,satisfied,Female,Loyal Customer,Personal Travel,Eco,0.0
1,satisfied,Male,Loyal Customer,Personal Travel,Business,1.0
2,satisfied,Female,Loyal Customer,Personal Travel,Eco,0.0
3,satisfied,Female,Loyal Customer,Personal Travel,Eco,0.0
4,satisfied,Female,Loyal Customer,Personal Travel,Eco,0.0
5,satisfied,Male,Loyal Customer,Personal Travel,Eco,0.0
6,satisfied,Female,Loyal Customer,Personal Travel,Eco,0.0
7,satisfied,Male,Loyal Customer,Personal Travel,Eco,0.0
8,satisfied,Female,Loyal Customer,Personal Travel,Business,1.0
9,satisfied,Male,Loyal Customer,Personal Travel,Eco,0.0


# **Pandas Get Dummies**

In [None]:
encoded_df = pd.get_dummies(categorical_columns, columns=['Type of Travel'])

In [None]:
encoded_df.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Class,Class_Ord_Enc,Type of Travel_Business travel,Type of Travel_Personal Travel
0,satisfied,Female,Loyal Customer,Eco,0.0,False,True
1,satisfied,Male,Loyal Customer,Business,1.0,False,True
2,satisfied,Female,Loyal Customer,Eco,0.0,False,True
3,satisfied,Female,Loyal Customer,Eco,0.0,False,True
4,satisfied,Female,Loyal Customer,Eco,0.0,False,True


# **One Hot Encoding**

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Chicago'],
    'Age': [25, 30, 35, 40, 45]
}
df = pd.DataFrame(data)

In [None]:
df.head()

Unnamed: 0,Gender,City,Age
0,Male,New York,25
1,Female,Los Angeles,30
2,Male,Chicago,35
3,Female,Houston,40
4,Male,Chicago,45


In [None]:
# Columns to be one-hot encoded
columns_to_encode = ['Gender', 'City']

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the categorical variables
encoded_data = one_hot_encoder.fit_transform(df[columns_to_encode])

encoded_data



array([[0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0.],
       [0., 1., 1., 0., 0., 0.],
       [1., 0., 0., 1., 0., 0.],
       [0., 1., 1., 0., 0., 0.]])

In [None]:
# Create a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=one_hot_encoder.get_feature_names_out(columns_to_encode))

encoded_df

Unnamed: 0,Gender_Female,Gender_Male,City_Chicago,City_Houston,City_Los Angeles,City_New York
0,0.0,1.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,1.0,0.0,0.0,0.0


Alternatively

In [None]:
ohe = OneHotEncoder(sparse_output=False).set_output(transform='pandas')

In [None]:
ohetransform = ohe.fit_transform(df[['City']])

ohetransform

Unnamed: 0,City_Chicago,City_Houston,City_Los Angeles,City_New York
0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0


# **Target Encoding**

In [None]:
pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━[0m [32m71.7/81.9 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3


In [None]:
# Example DataFrame with categorical features and target variable
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Chicago'],
    'Age': [25, 30, 35, 40, 45],
    'Target': ['ClassA', 'ClassB', 'ClassA', 'ClassB', 'ClassB']  # Categorical target variable
}
df = pd.DataFrame(data)
df

Unnamed: 0,Gender,City,Age,Target
0,Male,New York,25,ClassA
1,Female,Los Angeles,30,ClassB
2,Male,Chicago,35,ClassA
3,Female,Houston,40,ClassB
4,Male,Chicago,45,ClassB


In [None]:
from category_encoders import TargetEncoder

In [None]:
mapping = {'ClassA': 1, 'ClassB': 2}

# Use the map function to apply the mapping to the column
df['Target'] = df['Target'].map(mapping)

In [None]:
df.head()

Unnamed: 0,Gender,City,Age,Target
0,Male,New York,25,1
1,Female,Los Angeles,30,2
2,Male,Chicago,35,1
3,Female,Houston,40,2
4,Male,Chicago,45,2


In [None]:

cols = ['Gender', 'City']
target = 'Target'
for col in cols:
    te = TargetEncoder()
    te.fit(X=df[col], y=df[target])
    values = te.transform(df[col])
    df = pd.concat([df, values], axis=1)


In [None]:
df.head()

Unnamed: 0,Gender,City,Age,Target,Gender.1,City.1
0,Male,New York,25,1,1.558809,1.521935
1,Female,Los Angeles,30,2,1.65674,1.652043
2,Male,Chicago,35,1,1.558809,1.585815
3,Female,Houston,40,2,1.65674,1.652043
4,Male,Chicago,45,2,1.558809,1.585815


# **Label Encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
df.head()

Unnamed: 0,Gender,City,Age,Target
0,Male,New York,25,ClassA
1,Female,Los Angeles,30,ClassB
2,Male,Chicago,35,ClassA
3,Female,Houston,40,ClassB
4,Male,Chicago,45,ClassB


In [None]:
# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder and transform the column
df['label_enc_city'] = label_encoder.fit_transform(df['City'])

df

Unnamed: 0,Gender,City,Age,Target,label_enc_city
0,Male,New York,25,ClassA,3
1,Female,Los Angeles,30,ClassB,2
2,Male,Chicago,35,ClassA,0
3,Female,Houston,40,ClassB,1
4,Male,Chicago,45,ClassB,0


# **Frequency Encoding**

In [None]:
# Calculate the frequency of each category
frequency_map = df['Gender'].value_counts(normalize=True)

In [None]:
# Replace categories with their frequencies
df['frequency_encoded_gender'] = df['Gender'].map(frequency_map)

In [None]:
df

Unnamed: 0,Gender,City,Age,Target,label_enc_city,frequency_encoded_gender
0,Male,New York,25,ClassA,3,0.6
1,Female,Los Angeles,30,ClassB,2,0.4
2,Male,Chicago,35,ClassA,0,0.6
3,Female,Houston,40,ClassB,1,0.4
4,Male,Chicago,45,ClassB,0,0.6


# **Binary Encoding**

In [None]:
df['City'].unique()

array(['New York', 'Los Angeles', 'Chicago', 'Houston'], dtype=object)

In [None]:
import category_encoders as ce

In [None]:
# Initialize BinaryEncoder
binary_encoder = ce.BinaryEncoder()

# Fit and transform the DataFrame
city_binary_encoded = binary_encoder.fit_transform(df['City'])


city_binary_encoded

Unnamed: 0,City_0,City_1,City_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,0,1,1


In [None]:
one_hot_encoded = pd.get_dummies(df['City'], prefix='one_hot')

In [None]:
one_hot_encoded

Unnamed: 0,one_hot_Chicago,one_hot_Houston,one_hot_Los Angeles,one_hot_New York
0,False,False,False,True
1,False,False,True,False
2,True,False,False,False
3,False,True,False,False
4,True,False,False,False
