# Intro to Data Encoding

**`Feature Encoding`**

Feature encoding is the process of transforming categorical features into numeric features. This is necessary because machine learning algorithms can only handle numeric features. There are many different ways to encode categorical features, and each method has its own advantages and disadvantages. In this notebook, we will explore some of the most popular methods for encoding categorical features, such as:

- Label encoding
- Ordinal encoding
- One-hot encoding
- Binary encoding

## Importing the Libraries and Dataset

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

## Label Encoder

In [5]:
# let's encode the time in labelencoder with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
le = LabelEncoder()
df['encoded_time'] = le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [6]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

## Ordinal Encoder

In [7]:
# ordinal encoding the day column using specific order
oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [8]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

## One Hot Encoder

In [11]:
# Perform one hot encoding on sex column
ohe = OneHotEncoder()
df_encoded = pd.DataFrame(ohe.fit_transform(df[['sex']]).toarray())
df_encoded.head()

Unnamed: 0,0,1
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


## Binary Encoding

In [12]:
# Binary Encoding on day column
df['encoded_day'] = df['day'].map({'Thur': 0, 'Fri': 1, 'Sat': 2, 'Sun': 3})
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3


## Binary Encoding using Category Encoder

In [14]:
from category_encoders import BinaryEncoder

binary_encoder = BinaryEncoder()
df_binary = binary_encoder.fit_transform(df['day'])
df_binary

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
239,0,1,0
240,0,1,0
241,0,1,0
242,0,1,0


## Dummy Variables

In [15]:
# use pandas for feature encoding

df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [16]:
get_dummies = pd.get_dummies(df, columns=['day'])
get_dummies.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,day_Thur,day_Fri,day_Sat,day_Sun
0,16.99,1.01,Female,No,Dinner,2,False,False,False,True
1,10.34,1.66,Male,No,Dinner,3,False,False,False,True
2,21.01,3.5,Male,No,Dinner,3,False,False,False,True
3,23.68,3.31,Male,No,Dinner,2,False,False,False,True
4,24.59,3.61,Female,No,Dinner,4,False,False,False,True


## **When and Where To Use Different Encoding Methods?**

Understanding when and where to use different types of encoding methods is crucial for effective data preprocessing. Here’s a breakdown of various encoding techniques and their appropriate applications:
- **One Hot Encoding**
    * `When to Use:` One Hot Encoding is ideal when dealing with categorical variables that do not have any ordinal relationship (i.e., categories that do not have a natural order) and when the number of categories is relatively small.
    * `Where to Use:` It is particularly useful in machine learning algorithms that do not handle categorical data directly, such as linear regression and logistic regression.
- **Label Encoding**
    * `When to Use:` Label Encoding is suitable for categorical variables with a natural order (ordinal data). It assigns a unique integer to each category.
    * `Where to Use:` This method is commonly used in algorithms that can interpret the ordinal relationship, such as decision trees and gradient boosting.
- **Ordinal Encoding**
    * `When to Use:` Ordinal Encoding is specifically designed for categorical variables that have a clear order or ranking (e.g., "low," "medium," "high").
    * `Where to Use:` It is effective in models that can leverage the ordinal nature of the data, such as ordinal regression models.
- **Binary Encoding**
    * `When to Use:` Binary Encoding is beneficial when dealing with high cardinality categorical variables (i.e., variables with many unique categories) as it reduces the dimensionality compared to One Hot Encoding.
    * W`here to Use:` This method is useful in scenarios where you want to maintain the information of the categories while minimizing the number of features, particularly in tree-based models.
- **Dummy Encoding**
    * `When to Use:` Dummy Encoding is used when you want to avoid the dummy variable trap (i.e., multicollinearity) by removing one of the categories from the encoding process.
    * `Where to Use:` It is commonly applied in regression models and other algorithms that require the input features to be independent.