<img src="./images/banner.png" width="800">

TODO:
- Use titanic dataset
- Comparison between different encoding techniques
- Use PCA to visualize the embeddings


# Encoding Categorical Variables

Categorical variables are a fundamental data type in machine learning and statistical analysis. These variables represent qualitative data that can be divided into distinct categories or groups. Understanding and properly handling categorical variables is crucial for effective data preprocessing and model performance.


Categorical variables, also known as nominal variables, are variables that can take on one of a limited number of possible values or categories. Unlike numerical variables, categorical variables don't have a natural numerical order or magnitude.


Examples of categorical variables include:
- Color (red, blue, green)
- Gender (male, female, non-binary)
- Country (USA, Canada, Japan)
- Product type (electronics, clothing, food)


Categorical variables can be further classified into two main types:

1. **Nominal Variables:** Categories without any natural order or ranking.
   Example: Blood types (A, B, AB, O)

2. **Ordinal Variables:** Categories with a meaningful order or ranking.
   Example: Education level (High School, Bachelor's, Master's, PhD)


Machine learning algorithms typically work with numerical data. Therefore, we need to convert categorical variables into a numerical format that algorithms can understand and process effectively. This process is called encoding.


🤔 **Why This Matters:** Proper encoding of categorical variables is essential for:
- Improving model performance
- Capturing the inherent information in the categories
- Enabling algorithms to process and learn from categorical data


Working with categorical variables presents several challenges:

1. **High Cardinality:** When a categorical variable has many unique categories, it can lead to the "curse of dimensionality" and overfitting.

2. **Rare Categories:** Some categories may appear very infrequently in the dataset, making it difficult for models to learn from them.

3. **New Categories:** In real-world applications, new categories may appear in test data that were not present in the training data.

4. **Ordinal Relationships:** Some encoding methods may introduce unintended ordinal relationships between categories.


There are various techniques for encoding categorical variables, each with its own strengths and use cases. Some common methods include:

- One-Hot Encoding
- Label Encoding
- Ordinal Encoding
- Target Encoding
- Frequency Encoding
- Binary Encoding
- Hashing Encoding
- Leave-One-Out Encoding
- Embedding Encoding
- Weight of Evidence (WOE) Encoding


💡 **Pro Tip:** The choice of encoding method can significantly impact model performance. It's often beneficial to experiment with different encoding techniques and evaluate their effect on your specific problem.


In the following sections, we'll dive deep into each of these encoding techniques, exploring their mechanics, advantages, disadvantages, and appropriate use cases.


❗️ **Important Note:** Always split your data into training and testing sets before applying any encoding technique to prevent data leakage and ensure the integrity of your model evaluation.

**Table of contents**<a id='toc0_'></a>    
- [Label Encoding and Ordinal Encoding](#toc1_)    
  - [Ordinal Encoding](#toc1_1_)    
  - [Limitations and Considerations](#toc1_2_)    
- [One-Hot Encoding](#toc2_)    
  - [Understanding One-Hot Encoding](#toc2_1_)    
  - [Advantages of One-Hot Encoding](#toc2_2_)    
  - [Handling Multiple Categorical Variables](#toc2_3_)    
  - [Considerations and Potential Drawbacks](#toc2_4_)    
  - [The Dummy Variable Trap](#toc2_5_)    
- [Target Encoding](#toc3_)    
  - [Understanding Target Encoding](#toc3_1_)    
  - [Advantages of Target Encoding](#toc3_2_)    
  - [Smoothing in Target Encoding](#toc3_3_)    
  - [Considerations and Potential Drawbacks](#toc3_4_)    
- [Frequency Encoding](#toc4_)    
  - [Understanding Frequency Encoding](#toc4_1_)    
  - [Advantages of Frequency Encoding](#toc4_2_)    
  - [Variations of Frequency Encoding](#toc4_3_)    
  - [Considerations and Potential Drawbacks](#toc4_4_)    
  - [Handling New Categories in Test Data](#toc4_5_)    
- [Binary Encoding](#toc5_)    
  - [Understanding Binary Encoding](#toc5_1_)    
  - [How Binary Encoding Works](#toc5_2_)    
  - [Advantages of Binary Encoding](#toc5_3_)    
  - [Implementing Binary Encoding](#toc5_4_)    
  - [Considerations and Potential Drawbacks](#toc5_5_)    
  - [Comparison with Other Encoding Methods](#toc5_6_)    
- [Embedding Encoding](#toc6_)    
  - [Using Pre-trained Embeddings](#toc6_1_)    
  - [Advantages of Embedding Encoding](#toc6_2_)    
  - [Visualizing Embeddings](#toc6_3_)    
  - [Considerations and Potential Drawbacks](#toc6_4_)    
  - [Handling New Categories in Test Data](#toc6_5_)    
  - [Comparison with Other Encoding Methods](#toc6_6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Label Encoding and Ordinal Encoding](#toc0_)

In the world of machine learning and data preprocessing, Label Encoding and Ordinal Encoding are two fundamental techniques used to convert categorical variables into numerical format. These methods are particularly useful when dealing with ordinal data or when we need to represent categories in a compact numerical form. Let's dive into these encoding techniques, understand their similarities and differences, and explore their applications and limitations.


Label Encoding is a simple and intuitive method of converting categorical variables into numerical values. It assigns a unique integer to each category, effectively creating a numerical representation of the categorical data.


Label Encoding is particularly useful when there's a natural order or hierarchy in the categories. Let's consider an example to illustrate this:


In [1]:
from sklearn.preprocessing import LabelEncoder

# Example: Education levels
education_levels = ['High School', 'Bachelor', 'Master', 'PhD']

le = LabelEncoder()
encoded_levels = le.fit_transform(education_levels)

print("Original levels:", education_levels)
print("Encoded levels:", encoded_levels)

Original levels: ['High School', 'Bachelor', 'Master', 'PhD']
Encoded levels: [1 0 2 3]


🔑 **Key Concept:** In this case, Label Encoding preserves the ordinal relationship between education levels, which can be meaningful for many machine learning algorithms.


Label Encoding is often used for encoding target variables in classification problems. Here's an example:


In [2]:
# Example: Customer Satisfaction Levels
satisfaction_levels = ['Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
customer_responses = ['Satisfied', 'Unsatisfied', 'Very Satisfied', 'Neutral', 'Satisfied']

le = LabelEncoder()
encoded_responses = le.fit_transform(customer_responses)

print("Original responses:", customer_responses)
print("Encoded responses:", encoded_responses)

Original responses: ['Satisfied', 'Unsatisfied', 'Very Satisfied', 'Neutral', 'Satisfied']
Encoded responses: [1 2 3 0 1]


Output:
```
Original responses: ['Satisfied', 'Unsatisfied', 'Very Satisfied', 'Neutral', 'Satisfied']
Encoded responses: [2 3 1 0 2]
```


💡 **Pro Tip:** When using Label Encoding for target variables, be aware that some algorithms may interpret the encoded values as having numerical significance. In such cases, you might need to use techniques like One-Hot Encoding for the input features to prevent the model from assuming ordinal relationships where none exist.


### <a id='toc1_1_'></a>[Ordinal Encoding](#toc0_)


Ordinal Encoding is similar to Label Encoding but is specifically designed for handling multiple categorical features simultaneously. It's particularly useful when dealing with ordinal variables across multiple columns in your dataset.


While both Label Encoding and Ordinal Encoding serve the purpose of converting categories to numbers, they differ in their implementation and typical use cases:

1. **Dimensionality:** OrdinalEncoder can handle multiple features (columns) at once, while LabelEncoder is designed for a single array of labels.

2. **Input Shape:** OrdinalEncoder expects input data with shape (n_samples, n_features), whereas LabelEncoder expects (n_samples,).

3. **Typical Usage:** OrdinalEncoder is commonly used for feature encoding, while LabelEncoder is often applied to target variables.


Let's see an example of Ordinal Encoding:


In [4]:
from sklearn.preprocessing import OrdinalEncoder

# Example: Multiple ordinal features
data = [
    ['Low', 'Warm', 'Fast'],
    ['Medium', 'Cold', 'Slow'],
    ['High', 'Hot', 'Medium']
]

oe = OrdinalEncoder()
encoded_data = oe.fit_transform(data)

In [6]:
print("Original data:")
data

Original data:


[['Low', 'Warm', 'Fast'],
 ['Medium', 'Cold', 'Slow'],
 ['High', 'Hot', 'Medium']]

In [7]:
print("\nEncoded data:")
encoded_data


Encoded data:


array([[1., 2., 0.],
       [2., 0., 2.],
       [0., 1., 1.]])

🤔 **Why This Matters:** Ordinal Encoding allows us to efficiently encode multiple categorical features while preserving their ordinal relationships, if they exist.


### <a id='toc1_2_'></a>[Limitations and Considerations](#toc0_)


While Label Encoding and Ordinal Encoding are powerful tools, they come with some important considerations:

1. **Assumed Ordinal Relationship:** Both methods assume an ordinal relationship between categories. This can be misleading if no such relationship exists.

2. **Arbitrary Numerical Assignment:** For non-ordinal categories, the numerical assignment is arbitrary and may introduce unintended relationships.


Consider this example where Label Encoding might be misleading:


In [8]:
# Example: Colors (non-ordinal)
colors = ['Red', 'Blue', 'Green', 'Yellow']

le = LabelEncoder()
encoded_colors = le.fit_transform(colors)

print("Original colors:", colors)
print("Encoded colors:", encoded_colors)

Original colors: ['Red', 'Blue', 'Green', 'Yellow']
Encoded colors: [2 0 1 3]


❗️ **Important Note:** In this case, the encoding suggests that 'Blue' (0) is somehow less than 'Green' (1), which is not a meaningful relationship for colors.


In situations where categories don't have a natural order, other encoding techniques like One-Hot Encoding might be more appropriate. We'll explore these alternatives in the upcoming sections.

## <a id='toc2_'></a>[One-Hot Encoding](#toc0_)

One-Hot Encoding is a popular and widely used technique for handling categorical variables in machine learning. It's particularly useful when dealing with nominal categorical variables, where there's no inherent order among the categories. This method creates new binary columns for each category, effectively representing the presence or absence of a category for each observation.


### <a id='toc2_1_'></a>[Understanding One-Hot Encoding](#toc0_)


One-Hot Encoding works by creating a new binary column for each unique category in the original categorical variable. Each observation will have a '1' in the column corresponding to its category and '0' in all other columns.


🔑 **Key Concept:** One-Hot Encoding transforms a single categorical variable into multiple binary variables, each representing a unique category.


Let's start with a simple example to illustrate this concept:


In [19]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']})

# Initialize and fit the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data[['Color']])

# Create a new dataframe with encoded values
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Color']))

In [21]:
print("Original data:")
data


Original data:


Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Red
4,Green


In [22]:
print("One-Hot Encoded data:")
encoded_df

One-Hot Encoded data:


Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0


### <a id='toc2_2_'></a>[Advantages of One-Hot Encoding](#toc0_)


One-Hot Encoding offers several benefits in data preprocessing and model training:

1. **No Ordinal Relationship:** It doesn't impose any ordinal relationship between categories, making it suitable for nominal variables.

2. **Model Interpretability:** Each category becomes its own feature, making it easier to interpret the impact of each category on the model's predictions.

3. **Compatibility:** Most machine learning algorithms can work with one-hot encoded features without any issues.


💡 **Pro Tip:** One-Hot Encoding is particularly useful when you want to preserve the categorical nature of your variables without introducing any artificial ordering.


### <a id='toc2_3_'></a>[Handling Multiple Categorical Variables](#toc0_)


One-Hot Encoding can be applied to multiple categorical variables simultaneously. Let's look at an example with two categorical variables:


In [24]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample data with multiple categorical variables
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Green'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']
})

# Initialize and fit the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(data)

# Create a new dataframe with encoded values
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(data.columns))

In [25]:
print("Original data:")
data

Original data:


Unnamed: 0,Color,Size
0,Red,Small
1,Blue,Medium
2,Green,Large
3,Red,Medium
4,Green,Small


In [26]:
print("One-Hot Encoded data:")
encoded_df

One-Hot Encoded data:


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,0.0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0


### <a id='toc2_4_'></a>[Considerations and Potential Drawbacks](#toc0_)


While One-Hot Encoding is powerful, it's important to be aware of its limitations:

1. **Curse of Dimensionality:** For categorical variables with many unique categories, One-Hot Encoding can lead to a large number of new features, potentially causing the "curse of dimensionality."

2. **Memory Usage:** One-Hot Encoded data can be memory-intensive, especially for large datasets with high-cardinality categorical variables.

3. **Multicollinearity:** It introduces perfect multicollinearity among the encoded features, which can be problematic for some algorithms (e.g., linear regression).


🤔 **Why This Matters:** Understanding these limitations helps in deciding when to use One-Hot Encoding and when to consider alternative encoding methods.


### <a id='toc2_5_'></a>[The Dummy Variable Trap](#toc0_)


When using One-Hot Encoding, it's important to be aware of the "dummy variable trap." This occurs when you include all binary columns created by One-Hot Encoding, leading to perfect multicollinearity.


To avoid this, you can drop one of the binary columns for each encoded categorical variable. This is known as using "k-1" categories, where k is the number of unique categories. You can do this manually or use the `drop='first'` parameter in the OneHotEncoder.


In [28]:
# Avoiding the dummy variable trap
encoded_df_no_trap = encoded_df.drop(['Color_Green', 'Size_Large'], axis=1)

print("One-Hot Encoded data (avoiding dummy variable trap):")
encoded_df_no_trap

One-Hot Encoded data (avoiding dummy variable trap):


Unnamed: 0,Color_Blue,Color_Red,Size_Medium,Size_Small
0,0.0,1.0,0.0,1.0
1,1.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,1.0,1.0,0.0
4,0.0,0.0,0.0,1.0


In [32]:
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_df_no_trap = pd.DataFrame(encoder.fit_transform(data), columns=encoder.get_feature_names_out(data.columns))
encoded_df_no_trap

Unnamed: 0,Color_Green,Color_Red,Size_Medium,Size_Small
0,0.0,1.0,0.0,1.0
1,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,1.0,0.0
4,1.0,0.0,0.0,1.0


❗️ **Important Note:** Some algorithms and libraries (like scikit-learn) automatically handle the dummy variable trap, but it's always good to be aware of this issue when preprocessing your data.

In conclusion, One-Hot Encoding is a versatile and widely-used technique for handling categorical variables. Its ability to represent nominal categories without imposing order makes it a go-to choice for many data preprocessing tasks. However, it's important to consider its limitations, especially when dealing with high-cardinality categorical variables or large datasets.

## <a id='toc3_'></a>[Target Encoding](#toc0_)

Target Encoding, also known as Mean Encoding or Likelihood Encoding, is an advanced technique for encoding categorical variables. This method replaces categories with a numerical value based on the target variable, making it particularly useful for high-cardinality features and in capturing complex relationships between categories and the target.


### <a id='toc3_1_'></a>[Understanding Target Encoding](#toc0_)


Target Encoding works by replacing each category with the mean (for regression) or proportion (for classification) of the target variable for that category. This approach can capture more information than simpler encoding methods, especially when there's a strong relationship between the category and the target variable.


🔑 **Key Concept:** Target Encoding leverages the target variable to create a meaningful numerical representation of categorical features.


Let's start with a simple example to illustrate this concept:


In [34]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Sample data
np.random.seed(0)
data = pd.DataFrame({
    'City': np.random.choice(['New York', 'London', 'Paris', 'Tokyo'], 1000),
    'Sales': np.random.normal(loc=100, scale=20, size=1000)
})

In [35]:
data

Unnamed: 0,City,Sales
0,New York,80.289785
1,Tokyo,70.563300
2,London,132.962699
3,New York,103.284555
4,Tokyo,111.345806
...,...,...
995,New York,103.292876
996,New York,117.703751
997,Paris,129.475296
998,London,107.781879


In [36]:
# Split the data
train, test = train_test_split(data, test_size=0.2, random_state=42)

In [37]:
# Calculate target encoding on training data
target_means = train.groupby('City')['Sales'].mean()
target_means

City
London      100.866774
New York     99.332273
Paris        98.009190
Tokyo       101.526541
Name: Sales, dtype: float64

In [38]:
# Apply encoding to both train and test sets
train['City_Encoded'] = train['City'].map(target_means)
test['City_Encoded'] = test['City'].map(target_means)

In [40]:
print("Original data (first 5 rows):")
train[['City', 'Sales']].head()

Original data (first 5 rows):


Unnamed: 0,City,Sales
29,New York,113.690022
535,New York,125.521507
695,Paris,93.666894
557,Tokyo,155.187102
836,London,135.493171


In [42]:
print("Encoded data (first 5 rows):")
train[['City', 'Sales', 'City_Encoded']].head()

Encoded data (first 5 rows):


Unnamed: 0,City,Sales,City_Encoded
29,New York,113.690022,99.332273
535,New York,125.521507,99.332273
695,Paris,93.666894,98.00919
557,Tokyo,155.187102,101.526541
836,London,135.493171,100.866774


In [43]:
print("Target encoding mapping:")
target_means

Target encoding mapping:


City
London      100.866774
New York     99.332273
Paris        98.009190
Tokyo       101.526541
Name: Sales, dtype: float64

### <a id='toc3_2_'></a>[Advantages of Target Encoding](#toc0_)


Target Encoding offers several benefits in data preprocessing and model training:

1. **Handling High Cardinality:** It's effective for categorical variables with many unique categories.

2. **Capturing Complex Relationships:** It can reveal non-linear relationships between categories and the target variable.

3. **Dimensionality Reduction:** Unlike One-Hot Encoding, it doesn't increase the number of features.

💡 **Pro Tip:** Target Encoding can be particularly powerful in competitions and when dealing with datasets where categorical variables have strong predictive power.


### <a id='toc3_3_'></a>[Smoothing in Target Encoding](#toc0_)


To prevent overfitting, especially for categories with few samples, it's common to use smoothing techniques. One popular method is to combine the global mean with the category mean:


In [45]:
def smooth_target_encoding(train, test, column, target, weight=100):
    # Calculate global mean
    global_mean = train[target].mean()

    # Calculate the means for each category
    category_means = train.groupby(column)[target].agg(['mean', 'count'])

    # Calculate smoothed means
    smoothed_means = (category_means['count'] * category_means['mean'] + weight * global_mean) / (category_means['count'] + weight)

    # Apply encoding
    train_encoded = train[column].map(smoothed_means)
    test_encoded = test[column].map(smoothed_means).fillna(global_mean)

    return train_encoded, test_encoded

# Apply smoothed target encoding
train['City_Smooth_Encoded'], test['City_Smooth_Encoded'] = smooth_target_encoding(train, test, 'City', 'Sales')

In [47]:
print("Smoothed Target Encoding (first 5 rows):")
train[['City', 'Sales', 'City_Smooth_Encoded']].head()

Smoothed Target Encoding (first 5 rows):


Unnamed: 0,City,Sales,City_Smooth_Encoded
29,New York,113.690022,99.519702
535,New York,125.521507,99.519702
695,Paris,93.666894,98.638451
557,Tokyo,155.187102,100.983928
836,London,135.493171,100.538959


### <a id='toc3_4_'></a>[Considerations and Potential Drawbacks](#toc0_)


While Target Encoding can be powerful, it's important to be aware of its limitations:

1. **Risk of Overfitting:** It can lead to overfitting if not implemented carefully, especially for categories with few samples.

2. **Data Leakage:** If not done properly, it can introduce data leakage, where information from the test set influences the encoding.

3. **Interpretability:** The encoded values are less interpretable compared to simpler encoding methods.


🤔 **Why This Matters:** Understanding these limitations helps in implementing Target Encoding correctly and interpreting the results accurately.


In conclusion, Target Encoding is a sophisticated technique that can capture complex relationships between categorical variables and the target. While it offers significant advantages, especially for high-cardinality features, it requires careful implementation to avoid overfitting and data leakage. When used correctly, it can significantly boost model performance, particularly in scenarios where categorical variables have strong predict

## <a id='toc4_'></a>[Frequency Encoding](#toc0_)

Frequency Encoding is a simple yet effective technique for handling categorical variables. This method replaces each category with its frequency (count) or relative frequency (percentage) of occurrences in the dataset. Frequency Encoding can be particularly useful for high-cardinality features and when the frequency of a category is informative for the prediction task.


### <a id='toc4_1_'></a>[Understanding Frequency Encoding](#toc0_)


Frequency Encoding works by calculating how often each category appears in the dataset and using this information to create a numerical representation. This approach can capture the relative importance or prevalence of each category, which can be valuable information for many machine learning algorithms.


🔑 **Key Concept:** Frequency Encoding replaces categories with their frequency or relative frequency in the dataset, potentially capturing the importance of each category.


Let's start with a simple example to illustrate this concept:


In [48]:
import pandas as pd
import numpy as np

# Sample data
np.random.seed(0)
data = pd.DataFrame({
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], 1000),
    'Sales': np.random.normal(loc=100, scale=20, size=1000)
})

In [49]:
data

Unnamed: 0,Product,Sales
0,Laptop,80.289785
1,Watch,70.563300
2,Phone,132.962699
3,Laptop,103.284555
4,Watch,111.345806
...,...,...
995,Laptop,103.292876
996,Laptop,117.703751
997,Tablet,129.475296
998,Phone,107.781879


In [50]:
# Calculate frequency encoding
freq_encoding = data['Product'].value_counts() / len(data)
data['Product_Freq_Encoded'] = data['Product'].map(freq_encoding)
data

Unnamed: 0,Product,Sales,Product_Freq_Encoded
0,Laptop,80.289785,0.255
1,Watch,70.563300,0.251
2,Phone,132.962699,0.253
3,Laptop,103.284555,0.255
4,Watch,111.345806,0.251
...,...,...,...
995,Laptop,103.292876,0.255
996,Laptop,117.703751,0.255
997,Tablet,129.475296,0.241
998,Phone,107.781879,0.253


In [51]:
print("Original data (first 5 rows):")
data[['Product', 'Sales']].head()

Original data (first 5 rows):


Unnamed: 0,Product,Sales
0,Laptop,80.289785
1,Watch,70.5633
2,Phone,132.962699
3,Laptop,103.284555
4,Watch,111.345806


In [52]:
print("Encoded data (first 5 rows):")
data[['Product', 'Sales', 'Product_Freq_Encoded']].head()

Encoded data (first 5 rows):


Unnamed: 0,Product,Sales,Product_Freq_Encoded
0,Laptop,80.289785,0.255
1,Watch,70.5633,0.251
2,Phone,132.962699,0.253
3,Laptop,103.284555,0.255
4,Watch,111.345806,0.251


In [53]:
print("Frequency encoding mapping:")
freq_encoding

Frequency encoding mapping:


Product
Laptop    0.255
Phone     0.253
Watch     0.251
Tablet    0.241
Name: count, dtype: float64

### <a id='toc4_2_'></a>[Advantages of Frequency Encoding](#toc0_)


Frequency Encoding offers several benefits in data preprocessing and model training:

1. **Handling High Cardinality:** It's effective for categorical variables with many unique categories, as it doesn't create new columns.

2. **Capturing Category Importance:** The frequency can often be a proxy for the importance or relevance of a category.

3. **Simplicity:** It's straightforward to implement and interpret.


💡 **Pro Tip:** Frequency Encoding can be particularly useful when the frequency of a category is likely to be informative for the prediction task, such as in recommender systems or market basket analysis.

### <a id='toc4_3_'></a>[Variations of Frequency Encoding](#toc0_)


Instead of using relative frequencies, we can use the raw count of occurrences. This can be useful when the absolute number of occurrences is more informative than the proportion. This is known as **Count Encoding**.


In [54]:
# Count Encoding
count_encoding = data['Product'].value_counts()
data['Product_Count_Encoded'] = data['Product'].map(count_encoding)

In [55]:
print("Count Encoding (first 5 rows):")
data[['Product', 'Product_Count_Encoded']].head()


Count Encoding (first 5 rows):


Unnamed: 0,Product,Product_Count_Encoded
0,Laptop,255
1,Watch,251
2,Phone,253
3,Laptop,255
4,Watch,251


For datasets with a large range of frequencies, applying a logarithm can help compress the range and potentially improve model performance. This is known as **Log-Frequency Encoding**.

In [56]:
# Log-Frequency Encoding
log_freq_encoding = np.log1p(data['Product'].value_counts())
data['Product_LogFreq_Encoded'] = data['Product'].map(log_freq_encoding)

print("Log-Frequency Encoding (first 5 rows):")
data[['Product', 'Product_LogFreq_Encoded']].head()

Log-Frequency Encoding (first 5 rows):


Unnamed: 0,Product,Product_LogFreq_Encoded
0,Laptop,5.545177
1,Watch,5.529429
2,Phone,5.537334
3,Laptop,5.545177
4,Watch,5.529429


### <a id='toc4_4_'></a>[Considerations and Potential Drawbacks](#toc0_)


While Frequency Encoding can be effective, it's important to be aware of its limitations:

1. **Loss of Categorical Nature:** The original categorical information is lost, and the model sees only the frequency.

2. **Potential for Overfitting:** In some cases, frequency might not be relevant to the target variable and could lead to overfitting.

3. **Handling New Categories:** It doesn't naturally handle new categories in test data that weren't present in training data.

🤔 **Why This Matters:** Understanding these limitations helps in deciding when to use Frequency Encoding and how to interpret the results.


### <a id='toc4_5_'></a>[Handling New Categories in Test Data](#toc0_)


When applying Frequency Encoding to test data, you might encounter categories that weren't present in the training data. Here's a way to handle this:


In [57]:
def frequency_encode(train, test, column):
    # Calculate frequency encoding on training data
    freq_encoding = train[column].value_counts() / len(train)

    # Apply encoding to train and test
    train_encoded = train[column].map(freq_encoding)
    test_encoded = test[column].map(freq_encoding)

    # Handle new categories in test data
    test_encoded = test_encoded.fillna(0)  # or use the minimum frequency from train

    return train_encoded, test_encoded

# Simulate train/test split
train, test = train_test_split(data, test_size=0.2, random_state=42)

# Add a new category to test data
test.loc[test.index[0], 'Product'] = 'Desktop'

# Apply frequency encoding
train['Product_Freq_Encoded'], test['Product_Freq_Encoded'] = frequency_encode(train, test, 'Product')

print("Test data with new category:")
test[['Product', 'Product_Freq_Encoded']].head()

Test data with new category:


Unnamed: 0,Product,Product_Freq_Encoded
521,Desktop,0.0
737,Laptop,0.26
740,Tablet,0.2525
660,Phone,0.24
411,Phone,0.24


❗️ **Important Note:** When using Frequency Encoding, always consider whether the frequency of categories is likely to be informative for your specific problem. In some cases, combining Frequency Encoding with other encoding methods might yield better results.


In conclusion, Frequency Encoding is a straightforward and often effective method for encoding categorical variables, especially when the frequency of categories carries meaningful information for the prediction task. Its simplicity and ability to handle high-cardinality features make it a valuable tool in the data preprocessing toolkit. However, like all encoding methods, it should be used thoughtfully and in conjunction with a good understanding of the data and the problem at hand.

## <a id='toc5_'></a>[Binary Encoding](#toc0_)

Binary Encoding is an efficient technique for encoding categorical variables, particularly useful for high-cardinality features. This method combines aspects of integer encoding and one-hot encoding, representing each category as a binary code. Binary Encoding can significantly reduce the dimensionality of the encoded data compared to one-hot encoding while still capturing the categorical nature of the variable.


### <a id='toc5_1_'></a>[Understanding Binary Encoding](#toc0_)


Binary Encoding works by first assigning an integer to each category, then converting these integers into binary code. Each digit of the binary code becomes a separate column in the encoded data.


🔑 **Key Concept:** Binary Encoding represents categories as binary codes, creating a compact representation that balances information retention and dimensionality reduction.


Let's start with a simple example to illustrate this concept:


In [58]:
%pip install category_encoders

Collecting category_encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.4
Note: you may need to restart the kernel to use updated packages.


In [60]:
import pandas as pd
import numpy as np
from category_encoders import BinaryEncoder

# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Purple', 'Orange', 'Pink', 'Brown']
})

# Initialize and fit the BinaryEncoder
encoder = BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(data)

print("Original data:")
data


Original data:


Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Yellow
4,Purple
5,Orange
6,Pink
7,Brown


In [61]:
print("Binary Encoded data:")
encoded_data

Binary Encoded data:


Unnamed: 0,Color_0,Color_1,Color_2,Color_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
5,0,1,1,0
6,0,1,1,1
7,1,0,0,0


### <a id='toc5_2_'></a>[How Binary Encoding Works](#toc0_)


1. **Integer Encoding:** Each unique category is assigned an integer (0 to n-1, where n is the number of categories).
2. **Binary Conversion:** These integers are converted to their binary representation.
3. **Column Creation:** Each bit in the binary representation becomes a separate column.


For example, with 8 categories:
- Red    -> 0 -> 000
- Blue   -> 1 -> 001
- Green  -> 2 -> 010
- Yellow -> 3 -> 011
- Purple -> 4 -> 100
- Orange -> 5 -> 101
- Pink   -> 6 -> 110
- Brown  -> 7 -> 111


### <a id='toc5_3_'></a>[Advantages of Binary Encoding](#toc0_)


Binary Encoding offers several benefits in data preprocessing and model training:


1. **Dimensionality Reduction:** It creates fewer columns than one-hot encoding, especially for high-cardinality variables.
2. **Information Preservation:** It retains more information than simple integer encoding.
3. **Efficiency:** It's memory-efficient and can improve computational performance.


💡 **Pro Tip:** Binary Encoding is particularly useful when dealing with categorical variables that have a large number of categories, where one-hot encoding would create too many features.


### <a id='toc5_4_'></a>[Implementing Binary Encoding](#toc0_)


While we used the `category_encoders` library in the previous example, let's implement Binary Encoding from scratch to better understand its mechanics:


In [62]:
def custom_binary_encode(data, column):
    # Get unique categories and assign integers
    categories = data[column].unique()
    int_encoded = {cat: i for i, cat in enumerate(categories)}

    # Convert integers to binary and pad with zeros
    max_bits = len(bin(len(categories) - 1)[2:])
    binary_encoded = {cat: format(i, f'0{max_bits}b') for cat, i in int_encoded.items()}

    # Create new columns for each bit
    for i in range(max_bits):
        data[f'{column}_bin_{i}'] = data[column].map(lambda x: int(binary_encoded[x][i]))

    return data

# Apply custom Binary Encoding
encoded_data_custom = custom_binary_encode(data.copy(), 'Color')

print("Custom Binary Encoded data:")
encoded_data_custom

Custom Binary Encoded data:


Unnamed: 0,Color,Color_bin_0,Color_bin_1,Color_bin_2
0,Red,0,0,0
1,Blue,0,0,1
2,Green,0,1,0
3,Yellow,0,1,1
4,Purple,1,0,0
5,Orange,1,0,1
6,Pink,1,1,0
7,Brown,1,1,1


### <a id='toc5_5_'></a>[Considerations and Potential Drawbacks](#toc0_)


While Binary Encoding is powerful, it's important to be aware of its limitations:

1. **Loss of Interpretability:** The binary columns are less interpretable than one-hot encoded columns.
2. **Ordinal Assumption:** It may introduce an implicit ordinal relationship between categories.
3. **Complexity:** It can be more complex to implement and explain compared to simpler encoding methods.


🤔 **Why This Matters:** Understanding these trade-offs helps in deciding when to use Binary Encoding and how to interpret the results.


When applying Binary Encoding to test data, you might encounter categories that weren't present in the training data. The `category_encoders` library handles unknown categories by encoding them as all zeros. In practice, you might want to implement a more sophisticated strategy for handling new categories based on your specific use case.

In [69]:
# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Purple', 'Orange', 'Pink', 'Brown']
})

# Initialize and fit the BinaryEncoder
encoder = BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(data)

test_data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Purple', 'Orange', 'Pink', 'Brown', 'Unknown']
})

encoder.transform(test_data['Color'])

Unnamed: 0,Color_0,Color_1,Color_2,Color_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
5,0,1,1,0
6,0,1,1,1
7,1,0,0,0
8,0,0,0,0


### <a id='toc5_6_'></a>[Comparison with Other Encoding Methods](#toc0_)


Let's briefly compare Binary Encoding with One-Hot Encoding for a high-cardinality variable:


In [71]:
from sklearn.preprocessing import OneHotEncoder

# Create a dataset with 100 unique categories
high_cardinality_data = pd.DataFrame({'Category': [f'Cat_{i}' for i in range(100)] * 10})

# One-Hot Encoding
onehot = OneHotEncoder(sparse_output=False)
onehot_encoded = onehot.fit_transform(high_cardinality_data)

# Binary Encoding
binary = BinaryEncoder(cols=['Category'])
binary_encoded = binary.fit_transform(high_cardinality_data)

print(f"One-Hot Encoded shape: {onehot_encoded.shape}")
print(f"Binary Encoded shape: {binary_encoded.shape}")

One-Hot Encoded shape: (1000, 100)
Binary Encoded shape: (1000, 7)


This example demonstrates how Binary Encoding can significantly reduce the number of features compared to One-Hot Encoding, especially for high-cardinality variables.


In conclusion, Binary Encoding is a powerful technique that offers a balance between information retention and dimensionality reduction. It's particularly useful for high-cardinality categorical variables where one-hot encoding would create too many features. While it introduces some complexity and potential loss of interpretability, its efficiency in handling large category sets makes it a valuable tool in the data preprocessing toolkit. As with all encoding methods, it's important to consider the specific characteristics of your data and problem when deciding to use Binary Encoding.

## <a id='toc6_'></a>[Embedding Encoding](#toc0_)

Embedding Encoding is an advanced technique for handling categorical variables, particularly useful for high-cardinality features and when dealing with natural language processing tasks. This method represents each category as a dense vector in a lower-dimensional space, capturing semantic relationships between categories.


Embedding Encoding maps each category to a fixed-size vector of real numbers. These vectors are learned representations that capture the meaning and relationships between categories in a continuous vector space.


🔑 **Key Concept:** Embedding Encoding represents categories as dense vectors, potentially capturing complex relationships and similarities between categories.


### <a id='toc6_1_'></a>[Using Pre-trained Embeddings](#toc0_)


In this section, we'll focus on using pre-trained embeddings rather than training our own. Pre-trained embeddings are particularly useful when working with text data or when you want to leverage existing knowledge about relationships between categories.


Let's start with an example using pre-trained word embeddings:


In [72]:
%pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (8.2 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.5-py3-none-any.whl.metadata (24 kB)
Downloading gensim-4.3.3-cp310-cp310-macosx_11_0_arm64.whl (24.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[36m0:00:01[0m:01[0m
[?25hDownloading smart_open-7.0.5-py3-none-any.whl (61 kB)
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.3.3 smart-open-7.0.5
Note: you may need to restart the kernel to use updated packages.


In [78]:
import gensim.downloader as api

# Download and load a pre-trained word embedding model
# In this example, we're using the 'word2vec-google-news-300' model
# You can choose other models like 'glove-twitter-25' or 'fasttext-wiki-news-subwords-300'
model_name = 'word2vec-google-news-300'
model = api.load(model_name)

def get_word_embedding(word):
    try:
        # Get the word embedding
        embedding = model[word]
        return embedding
    except KeyError:
        print(f"The word '{word}' is not in the vocabulary.")
        return None

# Example usage
word = "example"
embedding = get_word_embedding(word)

if embedding is not None:
    print(f"Word: {word}")
    print(f"Embedding shape: {embedding.shape}")
    print(f"First 10 dimensions of the embedding: {embedding[:10]}")



Word: example
Embedding shape: (300,)
First 10 dimensions of the embedding: [ 0.20507812  0.00078583  0.03540039  0.10058594 -0.05444336  0.15332031
  0.25585938 -0.21875    -0.00331116  0.20996094]


In [79]:
def get_phrase_embedding(phrase):
    # Tokenize the phrase into words
    words = phrase.lower().split()

    # Initialize a list to store embeddings of individual words
    word_embeddings = []

    for word in words:
        try:
            # Get the word embedding
            embedding = model[word]
            word_embeddings.append(embedding)
        except KeyError:
            print(f"The word '{word}' is not in the vocabulary and will be skipped.")

    if not word_embeddings:
        print("None of the words in the phrase were found in the vocabulary.")
        return None

    # Calculate the average of the word embeddings
    phrase_embedding = np.mean(word_embeddings, axis=0)

    return phrase_embedding

❗️ **Important Note:** Pre-trained embeddings can handle new categories as long as the words in the category are present in the embedding vocabulary. For completely new words, you might need to implement a fallback strategy.


In [81]:
# Sample data
data = pd.DataFrame({
    'Product': ['running shoes', 'laptop', 'smartphone', 'headphones', 'smart watch']
})

In [83]:
# Apply embedding encoding
data['Embedding'] = data['Product'].apply(lambda x: get_phrase_embedding(x))


In [84]:
print("Original data:")
data['Product']

Original data:


0    running shoes
1           laptop
2       smartphone
3       headphones
4      smart watch
Name: Product, dtype: object

In [87]:
print("\nEmbedding for 'running shoes':")
data


Embedding for 'running shoes':


Unnamed: 0,Product,Embedding
0,running shoes,"[0.021362305, -0.018798828, -0.014862061, 0.06..."
1,laptop,"[0.026489258, -0.1640625, -0.007019043, 0.2792..."
2,smartphone,"[0.13769531, -0.31445312, -0.47265625, 0.07666..."
3,headphones,"[-0.17480469, -0.07080078, -0.14257812, 0.1132..."
4,smart watch,"[0.056152344, 0.017150879, 0.064575195, 0.1923..."


### <a id='toc6_2_'></a>[Advantages of Embedding Encoding](#toc0_)


Embedding Encoding offers several benefits in data preprocessing and model training:

1. **Dimensionality Reduction:** It creates a dense, low-dimensional representation of categories.
2. **Semantic Relationships:** It can capture complex relationships and similarities between categories.
3. **Handling High Cardinality:** It's effective for variables with many unique categories.


💡 **Pro Tip:** Pre-trained embeddings are particularly useful when working with text data or when you want to leverage existing knowledge about relationships between categories.


### <a id='toc6_4_'></a>[Considerations and Potential Drawbacks](#toc0_)


While Embedding Encoding is powerful, it's important to be aware of its limitations:


1. **Complexity:** It can be more complex to implement and interpret compared to simpler encoding methods.
2. **Dependency on Pre-trained Models:** The quality of encoding depends on the quality and relevance of the pre-trained embeddings.
3. **Out-of-Vocabulary Words:** Pre-trained embeddings might not cover all possible categories in your data.


🤔 **Why This Matters:** Understanding these trade-offs helps in deciding when to use Embedding Encoding and how to interpret the results.


In conclusion, Embedding Encoding is a sophisticated technique that offers a dense, lower-dimensional representation of categorical variables. It's particularly useful for high-cardinality features and when working with text data. By leveraging pre-trained embeddings, we can capture complex semantic relationships between categories without the need for extensive training data. While it introduces some complexity in implementation and interpretation, its ability to handle high-cardinality variables and capture semantic relationships makes it a powerful tool in the data preprocessing toolkit, especially for natural language processing tasks and recommender systems.
