# Feature Encoding

Feature encoding is the process of converting categorical data into numerical data

There are many ways to encode categorical data, some of the most common are:

- **One-hot encoding**(Dummy Variables):
    - It creates a new binary column for each category in the feature.
    - For n categories, n-1 columns are created.

- **Label Encoding**:
    - It assigns a unique integer to each category.
    - It is used when the categories have an ordinal relationship.

- **Frequency Encoding**:
    - It replaces the category with the frequency of that category in the dataset.
    - It is used when the frequency of the category is important.

- **Ordinal Encoding**:
    - It assigns an integer to each category in the order of the target variable.
    - It is used when the categories have an ordinal relationship.

- **Target Encoding**:
    - It replaces the category with the mean of the target variable for that category.
    - It is used when the target variable is continuous.

- **Binary Encoding**:
    - It first converts the categories into integers and then into binary code.
    - It is used when the number of categories is high.

- **BaseN Encoding**:
    - It is similar to binary encoding but uses base-n instead of base-2.
    - It is used when the number of categories is high.

- **Hashing Encoding**:
    - It converts the categories into hash values.
    - It is used when the number of categories is high.

- **Embedding**:
    - It is used in neural networks to convert categorical data into dense vectors.

In [46]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import seaborn as sns
import mglearn

In [47]:
import category_encoders as ce
import warnings
# Filter out the specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning)

In [48]:
# Sample DataFrame
data = pd.DataFrame({
    'category': ['A', 'B', 'C', 'A'],
    'ordinal': ['Low', 'Medium', 'High', 'Low'],
    'frequency': ['A', 'A', 'B', 'C'],
    'target': [1, 2, 3, 4]
})

In [49]:
# One-hot Encoding
onehot_encoder = ce.OneHotEncoder(cols=['category'])
onehot_encoded = onehot_encoder.fit_transform(data)
print("One-hot Encoded DataFrame:")
display(onehot_encoded)

One-hot Encoded DataFrame:


Unnamed: 0,category_1,category_2,category_3,ordinal,frequency,target
0,1,0,0,Low,A,1
1,0,1,0,Medium,A,2
2,0,0,1,High,B,3
3,1,0,0,Low,C,4


In [50]:
# Label Encoding
label_encoder = ce.OrdinalEncoder(cols=['ordinal'])
label_encoded = label_encoder.fit_transform(data)
print("Label Encoded DataFrame:")
display(label_encoded)

Label Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target
0,A,1,A,1
1,B,2,A,2
2,C,3,B,3
3,A,1,C,4


In [51]:
# Frequency Encoding
frequency_encoding = data['frequency'].value_counts(normalize=True)
data['frequency_encoded'] = data['frequency'].map(frequency_encoding)
print("Frequency Encoded DataFrame:")
display(data)

Frequency Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target,frequency_encoded
0,A,Low,A,1,0.5
1,B,Medium,A,2,0.5
2,C,High,B,3,0.25
3,A,Low,C,4,0.25


In [52]:
# Target Encoding
target_encoder = ce.TargetEncoder(cols=['category'])
target_encoded = target_encoder.fit_transform(data['category'], data['target'])
data['target_encoded'] = target_encoded
print("Target Encoded DataFrame:")
display(data)

Target Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target,frequency_encoded,target_encoded
0,A,Low,A,1,0.5,2.5
1,B,Medium,A,2,0.5,2.434946
2,C,High,B,3,0.25,2.565054
3,A,Low,C,4,0.25,2.5


In [53]:
# Binary Encoding
binary_encoder = ce.BinaryEncoder(cols=['category'])
binary_encoded = binary_encoder.fit_transform(data['category'])
data = pd.concat([data, binary_encoded], axis=1)
print("Binary Encoded DataFrame:")
display(data)

Binary Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target,frequency_encoded,target_encoded,category_0,category_1
0,A,Low,A,1,0.5,2.5,0,1
1,B,Medium,A,2,0.5,2.434946,1,0
2,C,High,B,3,0.25,2.565054,1,1
3,A,Low,C,4,0.25,2.5,0,1


In [54]:
# BaseN Encoding
basen_encoder = ce.BaseNEncoder(cols=['category'], base=3)
basen_encoded = basen_encoder.fit_transform(data['category'])
data = pd.concat([data, basen_encoded], axis=1)
print("Base3 Encoded DataFrame:")
display(data)

Base3 Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target,frequency_encoded,target_encoded,category_0,category_1,category_0.1,category_1.1
0,A,Low,A,1,0.5,2.5,0,1,0,1
1,B,Medium,A,2,0.5,2.434946,1,0,0,2
2,C,High,B,3,0.25,2.565054,1,1,1,0
3,A,Low,C,4,0.25,2.5,0,1,0,1


In [55]:
# Hashing Encoding
hashing_encoder = ce.HashingEncoder(cols=['category'], n_components=4)
hashing_encoded = hashing_encoder.fit_transform(data['category'])
data = pd.concat([data, hashing_encoded], axis=1)
print("Hashing Encoded DataFrame:")
display(data)

Hashing Encoded DataFrame:


Unnamed: 0,category,ordinal,frequency,target,frequency_encoded,target_encoded,category_0,category_1,category_0.1,category_1.1,col_0,col_1,col_2,col_3
0,A,Low,A,1,0.5,2.5,0,1,0,1,0,1,0,0
1,B,Medium,A,2,0.5,2.434946,1,0,0,2,0,1,0,0
2,C,High,B,3,0.25,2.565054,1,1,1,0,0,0,0,1
3,A,Low,C,4,0.25,2.5,0,1,0,1,0,1,0,0
