There are several techniques used for encoding categorical features (nominal or ordinal) in machine learning tasks. They are commonly used when dealing with discrete variables in machine learning algorithms that require numerical inputs. Each has its own advantages/disadvantages and is chosen based on the characteristics of the data and the requirements of the model being used.

Code examples shown in this notebook use the following libraries: Scikit-Learn, Pandas, and [Category Encoders](https://contrib.scikit-learn.org/category_encoders/). To install the latter, follow this [link](https://contrib.scikit-learn.org/category_encoders/).

Also, some of these examples use the `banking` dataset.

In [1]:
# pip install category_encoders

In [2]:
import pandas as pd

df_train = pd.read_csv("../data/banking/train.csv", sep=';')
df_test = pd.read_csv("../data/banking/test.csv", sep=';')

In [3]:
# First 10 rows of the training dataset
df_train.head(10)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no
7,42,entrepreneur,divorced,tertiary,yes,2,yes,no,unknown,5,may,380,1,-1,0,unknown,no
8,58,retired,married,primary,no,121,yes,no,unknown,5,may,50,1,-1,0,unknown,no
9,43,technician,single,secondary,no,593,yes,no,unknown,5,may,55,1,-1,0,unknown,no


In [4]:
df_train.job.unique()

array(['management', 'technician', 'entrepreneur', 'blue-collar',
       'unknown', 'retired', 'admin.', 'services', 'self-employed',
       'unemployed', 'housemaid', 'student'], dtype=object)

In [5]:
df_train.marital.unique()

array(['married', 'single', 'divorced'], dtype=object)

In [6]:
df_train.education.unique()

array(['tertiary', 'secondary', 'unknown', 'primary'], dtype=object)

In [7]:
df_train.contact.unique()

array(['unknown', 'cellular', 'telephone'], dtype=object)

In [8]:
df_train.month.unique()

array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'jan', 'feb',
       'mar', 'apr', 'sep'], dtype=object)

In [9]:
df_train.poutcome.unique()

array(['unknown', 'failure', 'other', 'success'], dtype=object)

# Binary Encoding

Binary encoding converts each category into binary digits and then encodes those digits into separate columns. It's a compromise between one-hot encoding (high memory usage) and label encoding (ordinality assumption).

In [10]:
# Example: Suppose we have a categorical variable "Gender" with two categories: Male and Female.

# Sample data
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']}
df = pd.DataFrame(data)

# Binary encoding
binary_encoded = pd.get_dummies(df['Gender'], drop_first=True)

print(binary_encoded)

    Male
0   True
1  False
2   True
3  False
4   True


# Frequency Encoding

Frequency encoding replaces categories with the frequency (count) of each category in the dataset. It can be useful when categories have some sort of ordinal relationship with the target variable.

In [11]:
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# Frequency encoding
frequency_encoded = df['Color'].value_counts(normalize=True)

print(frequency_encoded)

Color
Red      0.4
Green    0.4
Blue     0.2
Name: proportion, dtype: float64


# Label Encoding

Label encoding converts each category into a unique integer.

In [12]:
# Example: Continuing with the "Size" example:

from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# Label encoding
label_encoder = LabelEncoder()
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])

print(df)

     Size  Size_encoded
0   Small             2
1  Medium             1
2   Large             0
3  Medium             1
4   Small             2


# Dummy Encoding

Dummy Encoding converts categorical variables into dummy variables (also known as indicator variables or binary variables), which are fictitious variables that take on values of 0 or 1 to indicate the presence or absence of a particular category.

Basic steps of how Dummy Encoder works:

1. Identification of Categorical Variables: The first step is to identify categorical variables in the dataset. These are variables that represent different categories but do not have an intrinsic order.

2. Creation of Dummy Variables: For each unique category in the categorical variable, the Dummy Encoder creates a new binary variable (dummy). If the observation belongs to a certain category, the corresponding dummy variable receives the value 1; otherwise, it receives the value 0.

3. Elimination of a Reference Category (optional): In many cases, it is desirable to eliminate one of the dummy variables to avoid the so-called "dummy variable trap," which occurs when variables are highly correlated. This is often done by eliminating one of the categories, considering it as the reference. All other dummy variables are then interpreted in relation to this reference.

In [13]:
import pandas as pd

# Sample data
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# One-hot encoding
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True
4  False   True  False


In [14]:
import pandas as pd

df = df_train.copy()
X_train = df['contact']
y_train = df['y']


# Aplicando o Dummy Encoder
df_encoded = pd.get_dummies(X_train, columns=['contact'], drop_first=True)

print(df_encoded)


       telephone  unknown
0          False     True
1          False     True
2          False     True
3          False     True
4          False     True
...          ...      ...
45206      False    False
45207      False    False
45208      False    False
45209       True    False
45210      False    False

[45211 rows x 2 columns]


# Ordinal Encoding

Ordinal Encoding is used to encode ordinal categorical variables into numerical values. Ordinal categorical variables have an intrinsic meaning of order, meaning there is a relationship of order between the categories, but the distance between them is not defined.

Transformation steps:

1. Numeric Label Assignment: The OrdinalEncoder maps each unique category of the categorical variable to a numerical value based on the specified order. This mapping is usually defined by the user or inferred from the original order of the categories.

2. Variable Encoding: Replaces the categories in the dataset with the numerical labels assigned to each category.

Here are two simple examples using Pandas and scikit-learn:

In [15]:
# Suppose we have an ordinal categorical variable "Size" with three categories: Small, Medium, and Large.

# Sample data
data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

# Ordinal encoding
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
df['Size_encoded'] = df['Size'].map(size_mapping)

print(df)

     Size  Size_encoded
0   Small             1
1  Medium             2
2   Large             3
3  Medium             2
4   Small             1


In [16]:
from sklearn.preprocessing import OrdinalEncoder

df = df_train.copy()
X_train = df[['education']]
y_train = df['y']

# Definindo a ordem das categorias
ordem_categorias = [ 'unknown', 'primary', 'secondary', 'tertiary']

# Inicializando o OrdinalEncoder com a ordem especificada
encoder = OrdinalEncoder(categories=[ordem_categorias])

print(X_train)

# Ajustando e transformando os dados
X_train['education_encoded'] = encoder.fit_transform(X_train[['education']])

print(X_train)


       education
0       tertiary
1      secondary
2      secondary
3        unknown
4        unknown
...          ...
45206   tertiary
45207    primary
45208  secondary
45209  secondary
45210  secondary

[45211 rows x 1 columns]
       education  education_encoded
0       tertiary                3.0
1      secondary                2.0
2      secondary                2.0
3        unknown                0.0
4        unknown                0.0
...          ...                ...
45206   tertiary                3.0
45207    primary                1.0
45208  secondary                2.0
45209  secondary                2.0
45210  secondary                2.0

[45211 rows x 2 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['education_encoded'] = encoder.fit_transform(X_train[['education']])


# Target Encoding

Target Encoding uses information from the target variable to assign numerical values to the categories. The basic process of `TargetEncoder` involves the following:

1. **Calculate the target variable means per category:** For each category in the categorical variable you are encoding, the `TargetEncoder` calculates the mean of the target variable (which is typically a binary variable indicating the class, e.g., 0 or 1).

2. **Assign the target variable mean to the category:** The numerical value assigned to each category is the mean of the target variable for that category. This means that categories with a high mean of the target variable will receive a higher value, while those with a low mean will receive a lower value.

3. **Encode the categories in the dataset:** Replaces the original categories in the dataset with the values calculated in the previous step.

The idea behind `TargetEncoder` is to capture the relationship between the categorical variable and the target variable, making it useful in predictive models. However, it's important to be cautious when using `TargetEncoder` to avoid information leakage. Leakage occurs when statistics (such as the mean of the target variable) are calculated using information from the test set, which can result in an optimistic evaluation of the model's performance.

Here's an example using the Pandas library:

In [17]:
# Target encoding replaces categorical values with the mean of the target variable for each category. 
# It can be useful for categorical variables with high cardinality.

# Suppose we have a categorical variable "City" and a target variable "Salary".

# Sample data
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
        'Salary': [80000, 75000, 70000, 85000, 72000]}
df = pd.DataFrame(data)

# Target encoding
city_means = df.groupby('City')['Salary'].mean()
df['City_encoded'] = df['City'].map(city_means)

print(df)

          City  Salary  City_encoded
0     New York   80000       82500.0
1  Los Angeles   75000       75000.0
2      Chicago   70000       71000.0
3     New York   85000       82500.0
4      Chicago   72000       71000.0


The Python library [category_encoders](https://pypi.org/project/category-encoders/) offers an implementation of the `TargetEncoder` that can be used as exemplified in the code below. Make sure to check the [library documentation](https://contrib.scikit-learn.org/category_encoders/targetencoder.html) for specific details about the parameters and options available.

In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder

df = df_train.copy()

# Split the data into features and target
X_train = df['job']
y_train = df['y']

y_train = y_train.replace({'yes': 0, 'no': 1})

# Initialize the TargetEncoder
encoder = TargetEncoder()

# Fit the encoder on the training data and transform both the training and testing data
X_train_encoded = encoder.fit_transform(X_train, y_train)
# X_test_encoded = encoder.transform(X_test)

# Display the encoded data
print("Encoded Training Data:")
print(X_train_encoded)

Encoded Training Data:
            job
0      0.862444
1      0.889430
2      0.917283
3      0.927250
4      0.881944
...         ...
45206  0.889430
45207  0.772085
45208  0.772085
45209  0.927250
45210  0.917283

[45211 rows x 1 columns]


# References

- [A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems](https://dl.acm.org/doi/10.1145/507533.507538)