There are several techniques used for encoding categorical features (nominal or ordinal) in machine learning tasks. They are commonly used when dealing with discrete variables in machine learning algorithms that require numerical inputs. Each has its own advantages/disadvantages and is chosen based on the characteristics of the data and the requirements of the model being used.

Code examples shown in this notebook use the following libraries: Scikit-Learn, Pandas, and [Category Encoders](https://contrib.scikit-learn.org/category_encoders/). To install the latter, follow this [link](https://contrib.scikit-learn.org/category_encoders/).

Also, some of these examples use the `banking` dataset.

In [106]:
# pip install category_encoders

In [107]:
import pandas as pd

df_banking_train = pd.read_csv("../data/banking/train.csv", sep=';')
df_banking_test = pd.read_csv("../data/banking/test.csv", sep=';')

In [108]:
# First 10 rows of the training dataset
df_banking_train.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


In [109]:
df_banking_train.job.unique()

array(['management', 'technician', 'entrepreneur', 'blue-collar',
       'unknown', 'retired', 'admin.', 'services', 'self-employed',
       'unemployed', 'housemaid', 'student'], dtype=object)

In [110]:
df_banking_train.marital.unique()

array(['married', 'single', 'divorced'], dtype=object)

In [111]:
df_banking_train.education.unique()

array(['tertiary', 'secondary', 'unknown', 'primary'], dtype=object)

In [112]:
df_banking_train.contact.unique()

array(['unknown', 'cellular', 'telephone'], dtype=object)

In [113]:
df_banking_train.month.unique()

array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)

In [114]:
df_banking_train.poutcome.unique()

array(['unknown', 'failure', 'other', 'success'], dtype=object)

# Label Encoding

Label encoding converts each category into a unique integer.

> LabelEncoder should only be used to encode the labels in the target variable. LabelEncoder assigns an integer to each unique category. However, these numbers do not represent meaningful relationships unless the categorical variable is truly ordinal.

For categorical attributes to be used as features, use one-hot encoder, ordinal encoder, or any of the other encoding schemes described in the next sections.

In [115]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# df = pd.DataFrame(df_banking_train)
X = df_banking_train.drop(columns=['y'])
y = df_banking_train['y']

print(y.tail())

# Label encoding
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

print(y[-5:])

# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

45206    yes
45207    yes
45208    yes
45209     no
45210     no
Name: y, dtype: object
[1 1 1 0 0]


# Dummy encoding (aka one-hot-encoding)

Dummy Encoding converts categorical variables into dummy variables (also known as indicator variables or binary variables), which are fictitious variables that take on values of 0 or 1 to indicate the presence or absence of a particular category.

Basic steps of how Dummy Encoder works:

1. Identification of Categorical Variables: The first step is to identify categorical variables in the dataset. These are variables that represent different categories but do not have an intrinsic order.

2. Creation of Dummy Variables: For each unique category in the categorical variable, the Dummy Encoder creates a new binary variable (dummy). If the observation belongs to a certain category, the corresponding dummy variable receives the value 1; otherwise, it receives the value 0.

3. Elimination of a Reference Category (optional): In many cases, it is desirable to eliminate one of the dummy variables to avoid the so-called "dummy variable trap," which occurs when variables are highly correlated. This is often done by eliminating one of the categories, considering it as the reference. All other dummy variables are then interpreted in relation to this reference.

In [116]:
# Example: Suppose we have a categorical variable "Gender" with two categories: Male and Female.

# Sample data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']}
df = pd.DataFrame(data)

# Binary encoding
binary_encoded = pd.get_dummies(df['Gender'], drop_first=True)

print(binary_encoded)

    Male
0   True
1  False
2   True
3  False
4   True


In [117]:
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True
4  False   True  False


In [118]:
import pandas as pd

temp = df_banking_train.copy()

print(temp.contact.unique())

# Aplying the dummy encoder
X_train_encoded = pd.get_dummies(temp['contact'], columns=['contact'], drop_first=True)

print(X_train_encoded)


['unknown' 'cellular' 'telephone']
       telephone  unknown
0          False     True
1          False     True
2          False     True
3          False     True
4          False     True
...          ...      ...
45206      False    False
45207      False    False
45208      False    False
45209       True    False
45210      False    False

[45211 rows x 2 columns]


# Ordinal Encoding

Ordinal Encoding is used to encode ordinal categorical variables into numerical values. Ordinal categorical variables have an intrinsic meaning of order, meaning there is a relationship of order between the categories, but the distance between them is not defined.

Transformation steps:

1. Numeric Label Assignment: The OrdinalEncoder maps each unique category of the categorical variable to a numerical value based on the specified order. This mapping is usually defined by the user or inferred from the original order of the categories.

2. Variable Encoding: Replaces the categories in the dataset with the numerical labels assigned to each category.

Here are two simple examples using Pandas and scikit-learn:

In [119]:
# Suppose we have an ordinal categorical variable "Size" with three categories: Small, Medium, and Large.

# Sample data
data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# Ordinal encoding
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_encoded'] = df['Size'].map(size_mapping)

print(df)

     Size  Size_encoded
0   Small             1
1  Medium             2
2   Large             3
3  Medium             2
4   Small             1


In [120]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder

selected_columns = ['marital', 'month', 'education']

print(X_train[selected_columns].head())

# Define the order for each categorical column
categories_order = [
    ["single", "married", "divorced"],  # Order for 'marital'
    ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"],  # Order for 'month'
    ["unknown","primary", "secondary", "tertiary"]  # Order for 'education'
]

ct = ColumnTransformer(
     [("enc", OrdinalEncoder(categories=categories_order), selected_columns)], remainder="passthrough"
     )
X_train_encoded = ct.fit_transform(X_train)

# # Get new column names (encoded columns first, then passthrough columns)
encoded_feature_names = selected_columns
passthrough_feature_names = [col for col in X_train.columns if col not in selected_columns]
new_column_names = encoded_feature_names + passthrough_feature_names

X_train_encoded = pd.DataFrame(X_train_encoded, columns=new_column_names)

print()
print(X_train_encoded[selected_columns].head())

        marital month  education
10747    single   jun   tertiary
26054   married   nov  secondary
9125    married   jun  secondary
41659  divorced   oct   tertiary
4443    married   may  secondary

  marital month education
0     0.0   5.0       3.0
1     1.0  10.0       2.0
2     1.0   5.0       2.0
3     2.0   9.0       3.0
4     1.0   4.0       2.0


# Frequency Encoding

Frequency encoding is a method for encoding categorical variables based on the frequency of each category in the dataset. Instead of assigning arbitrary numerical values to categories, this method replaces each category with its relative or absolute occurrence count. It can be useful when categories have some sort of ordinal relationship with the target variable.

When to Use Frequency Encoding
- When dealing with high-cardinality categorical variables.
- In tree-based models, where frequency-based splits can be informative.
- When one-hot encoding would create an excessive number of features.

In [121]:
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# Frequency encoding
frequency_encoded = df['Color'].value_counts(normalize=True)

print(frequency_encoded)

Color
Red      0.4
Green    0.4
Blue     0.2
Name: proportion, dtype: float64


In [122]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.frequency_maps = {}

    def fit(self, X, y=None):
        """Compute frequency of each category in specified columns."""
        if self.columns is None:
            self.columns = X.select_dtypes(include=['object', 'category']).columns

        for col in self.columns:
            self.frequency_maps[col] = X[col].value_counts().to_dict()
        return self

    def transform(self, X):
        """Apply frequency encoding to specified columns."""
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.frequency_maps[col]).fillna(0)  # Handle unseen categories
        return X

selected_columns = ['job', 'marital', 'month']

# Create a pipeline with frequency encoding and a classifier
pipeline = Pipeline([
    ('freq_encoder', FrequencyEncoder(columns=selected_columns)),  # Frequency encoding
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Model
])

# Train the pipeline
pipeline.fit(X_train[selected_columns], y_train)

# Make predictions
y_pred = pipeline.predict(X_test[selected_columns])

# Display transformed test set
print("Transformed Test Set:")
print(pipeline.named_steps['freq_encoder'].transform(X_test[selected_columns]))

Transformed Test Set:
        job  marital  month
3776   6863    19100   9558
9928   2907     8942   3799
33409   649     8942   2043
31885  6573    19100   2043
15738  6573    19100   4830
...     ...      ...    ...
9016   5300     8942   3799
380    6863    19100   9558
7713   3634    19100   9558
12188   649    19100   3799
28550  6573     3605    989

[13564 rows x 3 columns]


# Target Encoding

Target Encoding uses information from the target variable to assign numerical values to the categories. The basic process of `TargetEncoder` involves the following:

1. **Calculate the target variable means per category:** For each category in the categorical variable you are encoding, the `TargetEncoder` calculates the mean of the target variable (which is typically a binary variable indicating the class, e.g., 0 or 1).

2. **Assign the target variable mean to the category:** The numerical value assigned to each category is the mean of the target variable for that category. This means that categories with a high mean of the target variable will receive a higher value, while those with a low mean will receive a lower value.

3. **Encode the categories in the dataset:** Replaces the original categories in the dataset with the values calculated in the previous step.

The idea behind `TargetEncoder` is to capture the relationship between the categorical variable and the target variable, making it useful in predictive models. However, it's important to be cautious when using `TargetEncoder` to avoid information leakage. Leakage occurs when statistics (such as the mean of the target variable) are calculated using information from the test set, which can result in an optimistic evaluation of the model's performance.

Here's an example using the Pandas library:

In [123]:
# Target encoding replaces categorical values with the mean of the target variable for each category. 
# It can be useful for categorical variables with high cardinality.

# Suppose we have a categorical variable "City" and a target variable "Salary".

# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
        'Salary': [80000, 75000, 70000, 85000, 72000]}
df = pd.DataFrame(data)

# Target encoding
city_means = df.groupby('City')['Salary'].mean()
df['City_encoded'] = df['City'].map(city_means)

print(df)

          City  Salary  City_encoded
0     New York   80000       82500.0
1  Los Angeles   75000       75000.0
2      Chicago   70000       71000.0
3     New York   85000       82500.0
4      Chicago   72000       71000.0


The Python library [category_encoders](https://pypi.org/project/category-encoders/) offers a more robust implementation of the `TargetEncoder`. This implementation if provided in the class `TargetEncoder`, that is exemplified in the code below. Make sure to check the [library documentation](https://contrib.scikit-learn.org/category_encoders/targetencoder.html) for specific details about the parameters and options available.

In [124]:
import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder

# Split training data into features and target
df = df_banking_train.copy()
X_train = df.drop(columns=['y'])
y_train = df['y']
y_train = y_train.replace({'yes': 0, 'no': 1})

# Split test data into features and target
df = df_banking_test.copy()
X_test = df.drop(columns=['y'])
y_test = df['y']
y_test = y_test.replace({'yes': 0, 'no': 1})

# Initialize the TargetEncoder
# use target encoding to encode two categorical features, 'job' and 'marital'.
encoder = TargetEncoder(cols=['job', 'marital'])

# Fit the encoder on the training data and transform both the training and testing data
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

# Display part of the encoded data
print("Encoded Training Data:")
print(X_train_encoded.head())
print(X_test_encoded.head())

Encoded Training Data:
   age       job   marital  education default  balance housing loan  contact  \
0   58  0.862444  0.898765   tertiary      no     2143     yes   no  unknown   
1   44  0.889430  0.850508  secondary      no       29     yes   no  unknown   
2   33  0.917283  0.898765  secondary      no        2     yes  yes  unknown   
3   47  0.927250  0.898765    unknown      no     1506     yes   no  unknown   
4   33  0.881944  0.850508    unknown      no        1      no   no  unknown   

   day month  duration  campaign  pdays  previous poutcome  
0    5   may       261         1     -1         0  unknown  
1    5   may       151         1     -1         0  unknown  
2    5   may        76         1     -1         0  unknown  
3    5   may        92         1     -1         0  unknown  
4    5   may       198         1     -1         0  unknown  
   age       job   marital  education default  balance housing loan   contact  \
0   30  0.844973  0.898765    primary      no    

# Summary

| Encoding Method       | Best for                                    | Handles High-Cardinality? | Example |
|----------------------|-------------------------------------------|--------------------------|---------|
| **One-Hot Encoding**  | Nominal variables with few categories     | ❌ No (creates too many columns) | `pd.get_dummies(data)` |
| **Ordinal Encoding**  | Truly ordinal features (e.g., "low" < "medium" < "high") | ✅ Yes | `OrdinalEncoder(categories=[["low", "medium", "high"]])` |
| **Frequency Encoding** | High-cardinality categorical features   | ✅ Yes | `df[col] = df[col].map(df[col].value_counts())` |
| **Target Encoding**   | High-cardinality categorical features with a strong correlation to target | ✅ Yes (risk of data leakage) | `df[col] = df.groupby(col)['target'].transform('mean')` |


# References

- [A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems](https://dl.acm.org/doi/10.1145/507533.507538)