# Categorical Encoding

**Video reference:** [Ana Chaloska - To One-Hot or Not: A guide to feature encoding and when to use what | PDAMS 2023](https://www.youtube.com/watch?v=4Opsiqj6gcY&list=WL)

**Author:** BrenoAV

**Last date modified:** 11-29-2023

# Libraries

In [1]:
import pandas as pd

# Ordinal Encoding

In [2]:
from sklearn.preprocessing import OrdinalEncoder

In [3]:
df = pd.DataFrame(data=["BCs", "MSc", "PhD", "MSc", "PhD"],
                          columns=["education_level"])
print("--- Before ---")
print(df)

--- Before ---
  education_level
0             BCs
1             MSc
2             PhD
3             MSc
4             PhD


In [4]:
enc = OrdinalEncoder(categories=[["BCs", "MSc", "PhD"]])
ordinal_df = enc.fit_transform(df)
df["ordinal"] = ordinal_df
print("--- After ---")
print(df)

--- After ---
  education_level  ordinal
0             BCs      0.0
1             MSc      1.0
2             PhD      2.0
3             MSc      1.0
4             PhD      2.0


✅ It is good to use when the data have some degree related to the other, for example, having PhD could be more important than BCs. 

🚫 Not good when the information is not ordinal, for example, the color of eyes.

# Frequency Encoder

In [5]:
df = pd.DataFrame(data=["chair", "table", "chair", "chair", "table", "plate"],
                          columns=["sold_product"])
print("--- Before ---")
print(df)

--- Before ---
  sold_product
0        chair
1        table
2        chair
3        chair
4        table
5        plate


In [6]:
df["frequency"] = df["sold_product"].apply(lambda x: df.value_counts()[x])
print("--- After ---")
print(df)

--- After ---
  sold_product  frequency
0        chair          3
1        table          2
2        chair          3
3        chair          3
4        table          2
5        plate          1


**Note: Another way to do &rarr; feature_engine.encoding.CountFrequencyEncoder**

✅ It's good to reduce the high cardinality, to become more simple to the model to find a pattern

🚫 Not good when the data has multiple items has same frequency (worst case is everything occurs in the same amount)

# One-Hot Encoding

In [7]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
df = pd.DataFrame(data=["brown", "blue", "green", "black", "green", "brown"],
                          columns=["eye_color"])
print("--- Before ---")
print(df)

--- Before ---
  eye_color
0     brown
1      blue
2     green
3     black
4     green
5     brown


In [9]:
enc = OneHotEncoder()
onehot_df = enc.fit_transform(df["eye_color"].values.reshape(-1, 1))
onehot_features = enc.get_feature_names_out(["eye_color"])
df[onehot_features] = onehot_df.toarray()
print("--- After ---")
print(df)

--- After ---
  eye_color  eye_color_black  eye_color_blue  eye_color_brown  eye_color_green
0     brown              0.0             0.0              1.0              0.0
1      blue              0.0             1.0              0.0              0.0
2     green              0.0             0.0              0.0              1.0
3     black              1.0             0.0              0.0              0.0
4     green              0.0             0.0              0.0              1.0
5     brown              0.0             0.0              1.0              0.0


**For Pandas, it is practical to use:** [https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

✅ Efficient when the data has low cardinality

🚫 Not efficient and not recommended when the data has high cardinality because create many columns with a sparse matrix. For example, creating using the number of days in a year (1-365) will create 365 columns

# Rare Label Encoding

In [10]:
!pip install feature-engine



In [11]:
from feature_engine.encoding.rare_label import RareLabelEncoder

In [12]:
df = pd.DataFrame(data=["ideal", "klarna", "credit_card", "credit_card", "ideal", "klarna",
                        "EPS", "klarna", "ideal", "Sofort", "credit_card", "Belfius"],
                  columns=["payment_method"])
print("--- Before ---")
print(df)

--- Before ---
   payment_method
0           ideal
1          klarna
2     credit_card
3     credit_card
4           ideal
5          klarna
6             EPS
7          klarna
8           ideal
9          Sofort
10    credit_card
11        Belfius


In [13]:
rle = RareLabelEncoder(n_categories=3, tol=0.1)
rare_df = rle.fit_transform(df[["payment_method"]])
df["rare"] = rare_df
print("--- After ---")
print(df)

--- After ---
   payment_method         rare
0           ideal        ideal
1          klarna       klarna
2     credit_card  credit_card
3     credit_card  credit_card
4           ideal        ideal
5          klarna       klarna
6             EPS         Rare
7          klarna       klarna
8           ideal        ideal
9          Sofort         Rare
10    credit_card  credit_card
11        Belfius         Rare


**Note: you need to after rare encoding apply another encoding because is a categorical encoding**

✅ Efficient when the data has high cardinality (reduce)
✅ It's good for representing the categories that have low observability

🚫 Be careful because the 'Rare' is meaningful, for example, imagine that you have a new product and this new product will be considered as 'Rare' and will not give the chance to be considered. Avoid blindly allocated to 'Rare' case.

# Hash Encoding

In [14]:
!pip install category_encoders



In [15]:
from category_encoders.hashing import HashingEncoder

In [16]:
df = pd.DataFrame(data=["Jane Austen", "J.K. Rowling", "Mark Twain", "George Orwell", 
                        "Agatha Christie", "Agatha Chirstie", "Jane Austen"],
                  columns=["author_name"])
print("--- Before ---")
print(df)

--- Before ---
       author_name
0      Jane Austen
1     J.K. Rowling
2       Mark Twain
3    George Orwell
4  Agatha Christie
5  Agatha Chirstie
6      Jane Austen


In [17]:
enc = HashingEncoder(n_components=3, hash_method="md5")
hash_df = enc.fit_transform(df["author_name"].values)
hash_df = hash_df.add_prefix("author_name_", axis=1)
df = pd.concat([df, hash_df], axis=1)
print("--- After ---")
print(df)

--- After ---
       author_name  author_name_col_0  author_name_col_1  author_name_col_2
0      Jane Austen                  0                  0                  1
1     J.K. Rowling                  1                  0                  0
2       Mark Twain                  1                  0                  0
3    George Orwell                  0                  0                  1
4  Agatha Christie                  1                  0                  0
5  Agatha Chirstie                  1                  0                  0
6      Jane Austen                  0                  0                  1


In [18]:
enc.transform(["BrenoAV"])  # We can use for any other class without be present before on fit

Unnamed: 0,col_0,col_1,col_2
0,0,1,0


✅ Efficient when the data has very high cardinality (reducing dimensionality)

✅ Memory Efficient because we don't need to generate again for new data never seen before

✅ Dimensionality reduction

🚫 Loss of information because we are grouping the data

🚫 Don't use in Ordinal Data because can be aggregate between simple and premium, for example.

# Target (mean) Encoder

In [19]:
from sklearn.preprocessing import TargetEncoder

In [20]:
df = pd.DataFrame(data=[["ideal", 0], ["klarna", 1], ["credit_card", 1], ["credit_card", 1],
                        ["klarna", 0], ["ideal", 0], ["credit_card", 0]],
                  columns=[["payment_method", "fraud"]])
print("--- Before ---")
print(df)

--- Before ---
  payment_method fraud
0          ideal     0
1         klarna     1
2    credit_card     1
3    credit_card     1
4         klarna     0
5          ideal     0
6    credit_card     0


In [21]:
# A high `smooth` parameter puts more weight on global mean on the categorical encodings
# On the other hand, a low `smooth` parameter puts more weight on target conditioned on the value of the categorical
enc = TargetEncoder(target_type="continuous", smooth=1e-8)
enc.fit(df["payment_method"].values, y=df["fraud"].values.reshape(-1))
target_df = enc.transform(df["payment_method"].values)
df["payment_method_target"] = target_df
print("--- After ---")
print(df)

--- After ---
  payment_method fraud payment_method_target
0          ideal     0          2.142857e-09
1         klarna     1          5.000000e-01
2    credit_card     1          6.666667e-01
3    credit_card     1          6.666667e-01
4         klarna     0          5.000000e-01
5          ideal     0          2.142857e-09
6    credit_card     0          6.666667e-01


**Note: it is a Bayesian technique because we use the dependency of categorical/target variables to encode the categorical data,
in my mind is like replacing the categorical data with the posterior**

✅ Used when we have high cardinality

✅ Target is **continuous** (regression) or **binary** (classification)

🚫 When has some dependencies between the categorical features (we can combine features before apply)

🚫 Feature-target has a relationship that changes over the time (because we are taking the mean and cannot be good for this case)


# Weight of Evidence Encoder (WoE)

$$WoE = \log\left(\frac{p(Y=1)}{p(Y=0)}\right)$$

In [22]:
from category_encoders.woe import WOEEncoder

In [23]:
df = pd.DataFrame(data=[["ideal", 0], ["ideal", 1], ["credit_card", 1], ["credit_card", 1],
                        ["ideal", 0], ["ideal", 0], ["credit_card", 0]],
                  columns=[["payment_method", "fraud"]])
print("--- Before ---")
print(df)

--- Before ---
  payment_method fraud
0          ideal     0
1          ideal     1
2    credit_card     1
3    credit_card     1
4          ideal     0
5          ideal     0
6    credit_card     0


In [24]:
enc = WOEEncoder(regularization=1)
enc.fit(df["payment_method"].values, df["fraud"].values.reshape(-1))
woe_df = enc.transform(df["payment_method"].values)
df["woe_payment_method"] = woe_df
print("--- After ---")
print(df)

--- After ---
  payment_method fraud woe_payment_method
0          ideal     0          -0.510826
1          ideal     1          -0.510826
2    credit_card     1           0.587787
3    credit_card     1           0.587787
4          ideal     0          -0.510826
5          ideal     0          -0.510826
6    credit_card     0           0.587787


✅ It's good to apply on nominal data (data mutually excluded), for example, car, bus, train, tram or bicycle.

✅ Logistic Regression models (model that generate discrete outcome of the output)

🚫 Not good for ordinal data

# CatBoost

In [25]:
from category_encoders.cat_boost import CatBoostEncoder

In [26]:
df = pd.DataFrame(data=[["ideal", 0], ["klarna", 1], ["credit_card", 1], ["credit_card", 1],
                        ["klarna", 0], ["ideal", 0], ["credit_card", 0]],
                  columns=[["payment_method", "fraud"]])
print("--- Before ---")
print(df)

--- Before ---
  payment_method fraud
0          ideal     0
1         klarna     1
2    credit_card     1
3    credit_card     1
4         klarna     0
5          ideal     0
6    credit_card     0


In [27]:
enc = CatBoostEncoder()
catboost_df = enc.fit_transform(df["payment_method"].values, df["fraud"].values.reshape(-1))
catboost_df

Unnamed: 0,0
0,0.428571
1,0.428571
2,0.428571
3,0.714286
4,0.714286
5,0.214286
6,0.809524


✅ Target time dependency

✅ There is a relationship between the category and target over the time

🚫 Limited data (small observations of each category)

🚫 Computationally expensive because the computation is make for each row

# END NOTES

This Jupyter Notebook was created by **BrenoAV**. For any inquiries or feedback, please feel free to create an [issue on GitHub](https://github.com/BrenoAV/MachineLearning-Studies/issues) 📣.