# <div align="center">Feature Engineering: Encoding Categorical Variables</div>
<hr>

What is Feature Engineering ?

Feature engineering is a crucial step in the data pre-processing phase of machine learning, where data scientists and machine learning engineers create new features or modify existing ones to improve the performance of machine learning models. The goal of feature engineering is to provide the model with informative, non-redundant, and interpretable data that captures the underlying structure of the dataset. This process can significantly enhance model accuracy and performance by leveraging domain knowledge and mathematical transformations.

Key Aspects of Feature Engineering Include:

1. Creation of New Features: Involves generating new features from the existing data, which might be more relevant to the prediction task. This could include combining two or more features, extracting parts of a date-time stamp (like the day of the week, month, or year), or creating interaction terms that capture the relationship between different variables.

2. Feature Transformation: Applying transformations to features to change their distribution or scale. Common transformations include normalization, standardization, log transformation, and power transformations. These are especially important for algorithms that assume data is normally distributed or algorithms sensitive to the scale of features, like k-nearest Neighbors (KNN) and gradient descent-based algorithms.

3. Feature Selection: Identifying the most relevant features to use in model training. This involves removing irrelevant, redundant, or noisy data that can detract from model performance. Techniques for feature selection include filter methods, wrapper methods, and embedded methods.

4. Feature Extraction: Techniques like Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and t-distributed Stochastic Neighbour Embedding (t-SNE) are used to reduce the number of features in a dataset while retaining as much of the variance in the data as possible. This is particularly useful for high-dimensional data.

5. Handling Missing Values: Developing strategies for dealing with missing data, such as imputation (filling in missing values with the mean, median, mode, or using more complex algorithms), or creating binary indicators that signal whether data was missing.

6. Encoding Categorical Variables: Converting categorical variables into a form that can be provided to ML models to improve performance. This includes using techniques like one-hot encoding, label encoding, and target encoding.

7. Working with different modalities: Feature engineering also includes applying all the above techniques to different modalities of data like temporal, textual and geospatial data.

Feature engineering is often considered more of an art than a science, requiring creativity, intuition, and domain knowledge. The quality and relevance of the features used can often make a more significant difference in the performance of a machine learning model than the choice of model itself. It enables models to learn better from the data, leading to more accurate predictions.

Feature encoding is a crucial step in machine learning that transforms categorical data into a numerical format. This is essential because most machine learning algorithms require numerical input. By encoding categorical features, we convert them into a representation that preserves their information while making them compatible with these algorithms. Common encoding techniques include ordinal encoding, one-hot encoding, target encoding, and frequency encoding.

Second is **Discretization**, in which we take continuous numerical data and convert it into categorical form by dividing it into bins. This process is also called **Binning**.

Third, we will also learn to encode the target feature or output column using techniques like Label Encoding, Label Binarizer, and MultiLabel Binarizer.

Machine learning is all about data, which we majorly classify into two forms: numerical data and categorical data. Numerical data includes features like age, weight, and marks, while categorical data includes categories like gender and state. In categorical data, there are two types:

1. **Ordinal data:** This is categorical data with an intrinsic order, like feedback ratings (poor, good, excellent). If there's an order in categorical data, we call it ordinal data.
2. **Nominal data:** This is categorical data without an intrinsic order, like gender.

In [1873]:
import numpy as np
import pandas as pd

In [1874]:
df = pd.read_csv(r'C:\Feature Engineering\Datasets\customer.csv').drop(columns=['age','gender'])

In [1875]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


### 1. Ordinal Encoding

Ordinal encoding is a data preprocessing technique used in machine learning to convert categorical variables that have a natural, ordered relationship into a numerical format. This method assigns a unique integer to each category based on its rank or position in that order.

How Ordinal Encoding Works
The core principle of ordinal encoding is to replace categorical labels with integers that preserve the inherent sequence of the data. For this to be effective, the variable must be ordinal, meaning its categories have a logical progression.

For instance, consider a variable for clothing size. The categories "Small," "Medium," and "Large" have a clear order. Ordinal encoding would convert them as follows:

- Small → 0

- Medium → 1

- Large → 2

This numerical representation allows machine learning algorithms, which primarily operate on numbers, to interpret the hierarchical nature of the feature.

When to Use Ordinal Encoding
This technique is most appropriate under specific circumstances:

- Presence of Inherent Order: It should only be used for categorical features where the categories have a meaningful, ranked relationship. Examples include educational levels ("High School," "Bachelor's," "Master's"), customer satisfaction ratings ("Dissatisfied," "Neutral," "Satisfied"), or economic status ("Low," "Medium," "High").

- Tree-Based Models: Algorithms like Decision Trees, Random Forests, and Gradient Boosting are well-suited for ordinally encoded data. These models can effectively use the ordered nature of the integers to make splits and decisions.

Key Assumptions and Limitations
While useful, ordinal encoding operates on a crucial assumption that can also be its main limitation:

- Assumption of Equal Intervals: The technique implicitly assumes that the numerical distance between each category is equal. For example, in the "Small, Medium, Large" mapping (0, 1, 2), a model might interpret the difference between "Small" and "Medium" as being the same as the difference between "Medium" and "Large." This may not be true in reality and can mislead certain types of models, such as linear models or Support Vector Machines, which are sensitive to the magnitude of feature values.

- Not for Nominal Data: It is unsuitable for nominal categorical variables, where no intrinsic order exists (e.g., colors like "Red," "Green," "Blue"). Applying ordinal encoding to such data would introduce a false and misleading order, likely harming the model's performance. For nominal data, techniques like one-hot encoding are preferred.

In [1876]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

In [1877]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)

In [1878]:
X_train

Unnamed: 0,review,education
15,Poor,UG
8,Average,UG
18,Good,School
14,Poor,PG
44,Average,UG
28,Poor,School
41,Good,PG
1,Poor,UG
10,Good,UG
3,Good,PG


In [1879]:
# specify order
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

In [1880]:
X_train = oe.fit_transform(X_train)
X_test = oe.transform(X_test)

In [1881]:
X_train

Unnamed: 0,review,education
15,0.0,1.0
8,1.0,1.0
18,2.0,0.0
14,0.0,2.0
44,1.0,1.0
28,0.0,0.0
41,2.0,2.0
1,0.0,1.0
10,2.0,1.0
3,2.0,2.0


In [1882]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

In [1883]:
oe.feature_names_in_

array(['review', 'education'], dtype=object)

In [1884]:
oe.n_features_in_

2

In [1885]:
oe.inverse_transform(np.array([0,2]).reshape(1,2))

array([['Poor', 'PG']], dtype=object)

In [1886]:
oe.get_feature_names_out()

array(['review', 'education'], dtype=object)

In [1887]:
# set unknown value
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']],
                    handle_unknown='use_encoded_value',
                    unknown_value=-1)
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)
X_train = oe.fit_transform(X_train)
oe.transform(np.array(['Poor','college']).reshape(1,2))



Unnamed: 0,review,education
0,0.0,-1.0


Infrequent categories, often referred to as "rare categories," are categories within a categorical variable that appear very seldom in the dataset. These categories are characterized by having a low frequency or count compared to other categories within the same feature.

How to handle:

- Aggregation: Combining rare categories into a single "Other" category to reduce the feature's cardinality and simplify the model.

- Encoding with Special Treatment: Using encoding techniques that specifically account for the rarity of categories, such as setting a min_frequency or max_categories threshold in OrdinalEncoder, or employing target encoding where the influence of rare categories is mitigated.

- Exclusion: In some cases, particularly when a category is extremely rare, it might be justified to exclude those data points from the analysis if it's believed they do not add value or could introduce noise.

In [1888]:
# handling infrequent categories
X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +['snake'] * 3 + ['horse'] * 2], dtype=object).T
X

array([['dog'],
       ['dog'],
       ['dog'],
       ['dog'],
       ['dog'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['cat'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['rabbit'],
       ['snake'],
       ['snake'],
       ['snake'],
       ['horse'],
       ['horse']], dtype=object)

### `max_categories` in OrdinalEncoder (with grouping)

- **Purpose**: Limits the number of categories encoded for a column.
- **How it works**:
  - If a column has **more unique categories than `max_categories`**, the encoder keeps only the **most frequent categories**.
  - All other less frequent categories are **grouped into an "other" category**.
- **Example**:
  - Column has categories: `A, B, C, D, E`
  - `max_categories=3` → encoder keeps top 2 frequent categories (`A`, `B`) as-is.
  - All remaining categories (`C, D, E`) are combined into **one "other" category**.
- **Benefit**: Reduces complexity and prevents rare categories from affecting the model.

In [1889]:
enc = OrdinalEncoder(max_categories=3).fit(X)

In [1890]:
enc.infrequent_categories_

[array(['dog', 'horse', 'snake'], dtype=object)]

In [1891]:
enc.transform(np.array([['cat','rabbit','snake','dog']]).reshape(4,1))

Unnamed: 0,x0
0,0.0
1,1.0
2,2.0
3,2.0


### `min_frequency` in OrdinalEncoder

- **Purpose**: Groups together *rare categories* based on how often they appear in the data.
- **How it works**:
  - Instead of specifying a fixed number of top categories (like `max_categories`),
    you set a **minimum occurrence threshold**.
  - Any category that appears **less than the given threshold** is combined into a single
    **"other" category**.
- **Parameter options**:
  - An **integer** (e.g., `6`) → categories with counts **≥ 6** are kept; all others grouped as “other”.
  - A **float** between `0` and `1` → treated as a fraction of the total sample size
    (e.g., `0.05` keeps categories that occur in at least 5% of the rows).
- **Benefit**: Automatically handles rare or noisy categories without manually counting them.




In [1892]:
enc = OrdinalEncoder(min_frequency=4).fit(X)
enc.infrequent_categories_

[array(['horse', 'snake'], dtype=object)]

In [1893]:
enc.transform(np.array([['cat','rabbit','snake','dog','horse']]).reshape(5,1))

Unnamed: 0,x0
0,0.0
1,2.0
2,3.0
3,1.0
4,3.0


In [1894]:
# handling missing data

# Example categorical data with missing values
data = [['Cat'], [np.nan], ['Dog'], ['Fish'], [np.nan]]

# Setting encoded_missing_value to -1, indicating we want missing values to be encoded as -1
encoder = OrdinalEncoder(encoded_missing_value=-1)

encoded_data = encoder.fit_transform(data)

print(encoded_data)

    x0
0  0.0
1 -1.0
2  1.0
3  2.0
4 -1.0


### Label Encoder

- **Purpose**: Encodes the **target/output feature** in classification problems.
- **Key Points**:
  - Not used for input features; it is **target/label encoding**, not feature encoding.
  - Converts categorical labels into **integer values**.
  - Useful when the output column contains categories like `['cat', 'dog', 'mouse']`.
- **Example Mapping**:

| Original Label | Encoded Value |
|----------------|---------------|
| cat            | 0             |
| dog            | 1             |
| mouse          | 2             |




In [1895]:
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [1896]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:2], df.iloc[:,-1], test_size=0.2)

In [1897]:
from sklearn.preprocessing import LabelEncoder

In [1898]:
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [1899]:
le.classes_

array(['No', 'Yes'], dtype=object)

In [1900]:
le.inverse_transform(np.array([1,1,0]))

array(['Yes', 'Yes', 'No'], dtype=object)

### 2. OneHotEncoder

- **Purpose**: Used for **categorical input features** that are **nominal**, i.e., have **no intrinsic order**.  
  Examples: `brand` = ['Samsung', 'Apple', 'OnePlus']

- **Why not use Ordinal Encoding here?**
  - Ordinal encoding assigns integer values based on order.
  - For nominal data, integers can mislead algorithms into thinking some categories are "greater" or "more important."
  - Some algorithms may give **unintended weight** to certain categories.

- **How OneHotEncoder works**:
  - For each category in a feature, a **new binary column** is created.
  - If a row has that category, the column gets `1`; otherwise, `0`.
  - Example:

| brand      | brand_Samsung | brand_Apple | brand_OnePlus |
|------------|---------------|-------------|---------------|
| Samsung    | 1             | 0           | 0             |
| Apple      | 0             | 1           | 0             |
| OnePlus    | 0             | 0           | 1             |

- **Key Points**:
  - If a feature has `n` categories, it creates **n new features** (can also use `drop='first'` to avoid dummy variable trap).
  - Produces a **sparse matrix** (mostly zeros).
  - Called **"one-hot"** because only **one column has 1** for each row, rest are 0.




In [1901]:
cars = pd.read_csv('C:\Feature Engineering\Datasets\cars.csv').drop(columns=['km_driven','owner'])
cars.head()

  cars = pd.read_csv('C:\Feature Engineering\Datasets\cars.csv').drop(columns=['km_driven','owner'])


Unnamed: 0,brand,fuel,selling_price
0,Maruti,Diesel,450000
1,Skoda,Diesel,370000
2,Honda,Petrol,158000
3,Hyundai,Diesel,225000
4,Maruti,Petrol,130000


In [1902]:
cars.isnull().sum()

brand            0
fuel             0
selling_price    0
dtype: int64

In [1903]:
X = cars.iloc[:,0:2]
y = cars.iloc[:,-1]

In [1904]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [1905]:
X_train

Unnamed: 0,brand,fuel
6518,Tata,Petrol
6144,Honda,Petrol
6381,Hyundai,Diesel
438,Maruti,Diesel
5939,Maruti,Petrol
...,...,...
5226,Mahindra,Diesel
5390,Maruti,Diesel
860,Hyundai,Petrol
7603,Maruti,Diesel


In [1906]:
from sklearn.preprocessing import OneHotEncoder

In [1907]:
ohe = OneHotEncoder()

In [1908]:
ohe = OneHotEncoder(sparse_output=False)

In [1909]:
X_train = ohe.fit_transform(X_train)
X_train

Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
6518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [1910]:
X_train.shape

(6502, 36)

In [1911]:
ohe.categories_

[array(['Ambassador', 'Ashok', 'Audi', 'BMW', 'Chevrolet', 'Daewoo',
        'Datsun', 'Fiat', 'Force', 'Ford', 'Honda', 'Hyundai', 'Isuzu',
        'Jaguar', 'Jeep', 'Kia', 'Land', 'Lexus', 'MG', 'Mahindra',
        'Maruti', 'Mercedes-Benz', 'Mitsubishi', 'Nissan', 'Opel',
        'Peugeot', 'Renault', 'Skoda', 'Tata', 'Toyota', 'Volkswagen',
        'Volvo'], dtype=object),
 array(['CNG', 'Diesel', 'LPG', 'Petrol'], dtype=object)]

In [1912]:
ohe.feature_names_in_

array(['brand', 'fuel'], dtype=object)

In [1913]:
ohe.n_features_in_

2

In [1914]:
ohe.get_feature_names_out()

array(['brand_Ambassador', 'brand_Ashok', 'brand_Audi', 'brand_BMW',
       'brand_Chevrolet', 'brand_Daewoo', 'brand_Datsun', 'brand_Fiat',
       'brand_Force', 'brand_Ford', 'brand_Honda', 'brand_Hyundai',
       'brand_Isuzu', 'brand_Jaguar', 'brand_Jeep', 'brand_Kia',
       'brand_Land', 'brand_Lexus', 'brand_MG', 'brand_Mahindra',
       'brand_Maruti', 'brand_Mercedes-Benz', 'brand_Mitsubishi',
       'brand_Nissan', 'brand_Opel', 'brand_Peugeot', 'brand_Renault',
       'brand_Skoda', 'brand_Tata', 'brand_Toyota', 'brand_Volkswagen',
       'brand_Volvo', 'fuel_CNG', 'fuel_Diesel', 'fuel_LPG',
       'fuel_Petrol'], dtype=object)

In [1915]:
pd.DataFrame(data = X_train,columns = ohe.get_feature_names_out())

Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
6518,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
438,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5390,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [1916]:
ohe.inverse_transform(np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0. , 0. , 1.]).reshape(1,36))

array([['Mercedes-Benz', 'Petrol']], dtype=object)

# The Dummy Variable Trap

The **Dummy Variable Trap** is a scenario where features created by one-hot encoding are perfectly correlated. This perfect correlation, known as **perfect multicollinearity**, can be a problem for statistical models that expect features to be independent, such as Linear and Logistic Regression.

---

## Why Does It Happen?

When we one-hot encode a categorical feature with '$k$' categories, we create '$k$' new binary columns (dummy variables). The trap occurs because the information in these '$k$' columns is redundant—the value of any one column can be perfectly predicted from the values of the others.

For example, if we have a feature "Season" with four categories (Winter, Spring, Summer, Fall), we create four dummy variables.

| Season_Winter | Season_Spring | Season_Summer | Season_Fall |
| :-----------: | :-----------: | :-----------: | :-----------: |
|       1       |       0       |       0       |       0       |
|       0       |       1       |       0       |       0       |
|       0       |       0       |       1       |       0       |

If a row is not Winter, Spring, or Summer, it *must* be Fall. This means:

Season_Winter + Season_Spring + Season_Summer + Season_Fall = 1

This perfect relationship confuses linear models because they can't determine the independent influence of each category on the outcome.

---

## The Solution: Drop One Column

The solution is simple: for a feature with '$k$' categories, we only use **$k-1$** dummy variables.

We drop one of the columns. The dropped category becomes the **baseline** or **reference category**. Its effect is captured by the model's intercept, and it is represented by a row where all the other dummy variables for that feature are zero.

**Example (Dropping `Season_Fall`):**

| Original | Season_Winter | Season_Spring | Season_Summer | Interpretation                       |
| :------: | :-----------: | :-----------: | :-----------: | ------------------------------------ |
|  Winter  |       1       |       0       |       0       | It is Winter.                        |
|  Spring  |       0       |       1       |       0       | It is Spring.                        |
|  Summer  |       0       |       0       |       1       | It is Summer.                        |
|   Fall   |     **0** |     **0** |     **0** | It is not Winter, Spring, or Summer. |

---

## When Should You Care ?

You primarily need to worry about the dummy variable trap when using models that are sensitive to multicollinearity.

* **Affected Models:**
    * Linear Regression
    * Logistic Regression
    * Linear Discriminant Analysis (LDA)

* **Unaffected Models:**
    * Decision Trees
    * Random Forest
    * XGBoost / Gradient Boosting
    * K-Nearest Neighbors (KNN)

Tree-based models are generally immune because they select features one at a time and do not rely on the same mathematical assumptions as linear models.

---

## Practical Implementation

Most data science libraries have a simple parameter to handle this automatically:

* **Pandas:** `pd.get_dummies(df, drop_first=True)`
* **Scikit-learn:** `OneHotEncoder(drop='first')`

In [1917]:
X = cars.iloc[:,0:2]
y = cars.iloc[:,-1]

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

ohe = OneHotEncoder(drop='first',sparse_output=False)
ohe.fit_transform(X_train).shape

(6502, 34)

In [1918]:
ohe.drop_idx_

array([0, 0], dtype=object)

In [1919]:
# handling rare categories
X_train['brand'].value_counts()

brand
Maruti           1953
Hyundai          1127
Mahindra          635
Tata              586
Toyota            391
Honda             369
Ford              320
Chevrolet         185
Renault           183
Volkswagen        154
BMW                96
Skoda              82
Nissan             62
Jaguar             59
Volvo              54
Datsun             48
Mercedes-Benz      43
Fiat               35
Audi               30
Jeep               26
Lexus              22
Mitsubishi         13
Force               6
Land                5
Kia                 4
Daewoo              3
MG                  3
Ambassador          3
Isuzu               2
Ashok               1
Peugeot             1
Opel                1
Name: count, dtype: int64

In [1920]:
cars['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [1921]:
# using min frequency

ohe = OneHotEncoder(sparse_output=False, min_frequency=100)
ohe.fit_transform(X_train).shape

(6502, 14)

In [1922]:
ohe.get_feature_names_out()

array(['brand_Chevrolet', 'brand_Ford', 'brand_Honda', 'brand_Hyundai',
       'brand_Mahindra', 'brand_Maruti', 'brand_Renault', 'brand_Tata',
       'brand_Toyota', 'brand_Volkswagen', 'brand_infrequent_sklearn',
       'fuel_Diesel', 'fuel_Petrol', 'fuel_infrequent_sklearn'],
      dtype=object)

In [1923]:
# using max_categories
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore', max_categories=15)
ohe.fit_transform(X_train).shape

(6502, 19)

In [1924]:
ohe.get_feature_names_out()

array(['brand_BMW', 'brand_Chevrolet', 'brand_Ford', 'brand_Honda',
       'brand_Hyundai', 'brand_Jaguar', 'brand_Mahindra', 'brand_Maruti',
       'brand_Nissan', 'brand_Renault', 'brand_Skoda', 'brand_Tata',
       'brand_Toyota', 'brand_Volkswagen', 'brand_infrequent_sklearn',
       'fuel_CNG', 'fuel_Diesel', 'fuel_LPG', 'fuel_Petrol'], dtype=object)

### Handling Unknown Categories During Prediction (OneHotEncoder)

- **Context**:  
  After training a model, the **OneHotEncoder** is used again during prediction  
  (e.g., when transforming new test data or live user inputs).  
  Sometimes, the new data may contain a **category that was never seen** during training.

- **Default Behavior (`handle_unknown='error')**:  
  - If a new category appears during prediction, the encoder will raise a **ValueError**.
  - Example:
    - Training categories: `Red`, `Blue`, `Green`
    - Prediction data contains: `Yellow`
    -  Error: `Found unknown categories ['Yellow'] in column 0 during transform`

- **Safe Option (`handle_unknown='ignore')**:  
  - When creating the encoder, set:
    ```python
    OneHotEncoder(handle_unknown='ignore')
    ```
  - During prediction, any unseen category is encoded as **all zeros** across the dummy columns.
  - Example:
    - Training categories: `Red`, `Blue`, `Green`
    - Prediction contains: `Yellow`
    - Encoding for `Yellow` → `[0, 0, 0]`

- **Key Takeaways**:
  - Always fit the encoder on the **training data only**, and then reuse it for prediction.
  - `handle_unknown='ignore'` ensures the model will **not crash** when a new category appears.
  - The model interprets an all-zero vector as a **neutral signal** (no known category matched).


In [1925]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe.fit_transform(X_train)

ohe.transform(np.array(['local','Petrol']).reshape(1,2))



Unnamed: 0,brand_Ambassador,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,brand_Fiat,brand_Force,brand_Ford,...,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [1926]:
ohe.inverse_transform(np.array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1.]).reshape(1,36))

array([[None, 'Petrol']], dtype=object)

### LabelBinarizer

- **Purpose**:  
  - `LabelBinarizer` is used to **one-hot encode the target (output) feature**.  
  - While `LabelEncoder` converts target categories into **single numeric labels** (0, 1, 2, …),  
    sometimes we need the output to be **one-hot encoded** instead of a single number.

- **When to Use**:
  - Useful in **multi-class classification** problems where the model expects  
    one-hot encoded targets.
  - Examples:
    - **Deep learning** models with a `softmax` output layer.
    - **Logistic Regression** with a **one-vs-all** strategy.
    - Datasets like **Iris**, where the target feature (`species`) has 3 classes.

- **Example**:
  Suppose the target `y` contains three categories: `['yes', 'no', 'maybe']`.

  - Using `LabelEncoder`:
    ```
    yes → 2
    no  → 0
    maybe → 1
    ```
     Only a **single column** with numeric labels.

  - Using `LabelBinarizer`:
    ```
    yes   → [0, 0, 1]
    no    → [1, 0, 0]
    maybe → [0, 1, 0]
    ```
     **Three new columns** (`y_no`, `y_maybe`, `y_yes`) representing each class.




In [1927]:
from sklearn.preprocessing import LabelBinarizer

# Sample target variable for a multi-class classification problem
y = ['cat', 'dog', 'fish', 'dog', 'cat']

# Initialize the LabelBinarizer
lb = LabelBinarizer()

# Fit and transform the target variable
y_binarized = lb.fit_transform(y)

print("Binarized labels:\n", y_binarized)

# Inverse transform to recover original labels
y_original = lb.inverse_transform(y_binarized)

print("Original labels:\n", y_original)


Binarized labels:
 [[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]
 [1 0 0]]
Original labels:
 ['cat' 'dog' 'fish' 'dog' 'cat']


### MultiLabelBinarizer

- **Purpose**:
  - Used for **multi-label problems**, where each sample can be associated with **multiple target labels** simultaneously.
  - Transforms a list of lists (or similar iterables) of labels into a **binary matrix** where each column corresponds to a unique label and a `1` indicates the presence of that label for a given sample.
- **When to Use**:
  - When the output column is multi-label. For example, if you have movie summaries and need to identify all applicable genres (a movie can have multiple genres).
- **How it Works**:
  - It identifies all unique labels across all samples.
  - It creates a binary column for each unique label.
  - For each sample, it places a `1` in the columns corresponding to the labels present in that sample, and `0` otherwise.

In [1928]:
from sklearn.preprocessing import MultiLabelBinarizer

# Example multi-label data
y = [('red', 'blue'), ('blue', 'green'), ('green',), ('red',)]

# Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Fit and transform the data to binary matrix format
Y = mlb.fit_transform(y)

print("Binary matrix:\n", Y)
print("Class labels:", mlb.classes_)

# Inverse transform to recover original labels
y_inv = mlb.inverse_transform(Y)
print("Inverse transformed labels:", y_inv)


Binary matrix:
 [[1 0 1]
 [1 1 0]
 [0 1 0]
 [0 0 1]]
Class labels: ['blue' 'green' 'red']
Inverse transformed labels: [('blue', 'red'), ('blue', 'green'), ('green',), ('red',)]


### 3. Count Encoder/Frequency Encoder

# Count Encoding vs Frequency Encoding

When working with categorical data in machine learning, we need to convert categories (text labels) into numerical values.  
Two common techniques for this are **Count Encoding** and **Frequency Encoding**.  

Both techniques aim to represent categorical variables in a numeric form that models can understand, but they differ in *how* they assign values.

---

## 1. Count Encoding 

### What It Does
- Each unique category is replaced by the **number of times** it appears in the dataset.  
- This means the encoded value is proportional to how common that category is.  
- Categories that appear often will get higher numbers, and rare categories will get smaller numbers.

### Why It Helps
- Provides a simple way to capture the importance or popularity of a category.  
- Keeps categories that occur frequently distinct from those that occur rarely.  
- Works well when the absolute number of occurrences matters (e.g., "most purchased product IDs").

### Example

Dataset of pets and their breeds:

| Pet ID | Breed     |
| :----: | :-------- |
|   1    | Labrador  |
|   2    | Beagle    |
|   3    | Beagle    |
|   4    | Labrador  |
|   5    | Siamese   |

Steps:
1. Count occurrences in the `Breed` column:
   - Labrador → 2  
   - Beagle → 2  
   - Siamese → 1  
2. Replace each category with its count.

**After Count Encoding:**

| Pet ID | Breed (Count Encoded) |
| :----: | :-------------------- |
|   1    | 2                     |
|   2    | 2                     |
|   3    | 2                     |
|   4    | 2                     |
|   5    | 1                     |

---

## 2. Frequency Encoding 

### What It Does
- Each unique category is replaced by the **relative frequency** of that category.  
- Relative frequency = (Count of category ÷ Total number of rows).  
- Values are between `0` and `1`, making the encoding normalized.  

### Why It Helps
- Useful when dataset sizes differ, since it scales values automatically.  
- Prevents raw counts from dominating just because the dataset is large.  
- Models can interpret encoded values as probabilities.  

### Example

Using the same dataset:

Total rows = 5  

- Labrador → 2/5 = **0.4**  
- Beagle → 2/5 = **0.4**  
- Siamese → 1/5 = **0.2**

**After Frequency Encoding:**

| Pet ID | Breed (Frequency Encoded) |
| :----: | :------------------------ |
|   1    | 0.4                       |
|   2    | 0.4                       |
|   3    | 0.4                       |
|   4    | 0.4                       |
|   5    | 0.2                       |

---

## 3. Relationship Between Count and Frequency Encoding 

At their core, both techniques use the **same idea**:  
> Encode a category based on how often it appears.  

The only difference is in the **scale of numbers**:  

- Count Encoding → Raw counts (e.g., 2, 1, 5).  
- Frequency Encoding → Normalized counts (proportions, e.g., 0.4, 0.2).  

This means **Frequency Encoding is essentially a normalized version of Count Encoding**.

---

## 4. The `normalize` Parameter in Count Encoder 

In libraries like `category_encoders`, the **Count Encoder** has a parameter called `normalize`.  

- `normalize=False` (default):  
  - Outputs raw counts.  
  - Example: Labrador → 2, Beagle → 2, Siamese → 1.  

- `normalize=True`:  
  - Divides counts by the total number of rows, turning them into frequencies.  
  - Example: Labrador → 0.4, Beagle → 0.4, Siamese → 0.2.  
  - In this case, the Count Encoder **behaves exactly like a Frequency Encoder**.

In other words:  
`CountEncoder(normalize=True)` = **Frequency Encoder**

---

## 5. Practical Considerations 

- **Scaling of Values**  
  - Count Encoding → Values can get very large in big datasets.  
  - Frequency Encoding → Always between 0 and 1, making it stable across dataset sizes.  

- **Interpretability**  
  - Count Encoding → Easy to interpret ("how many times did this category occur?").  
  - Frequency Encoding → Easy to compare across datasets ("what proportion of the dataset is this category?").  

- **Model Sensitivity**  
  - Tree-based models (like Random Forest, XGBoost) are usually fine with either encoding.  
  - Linear models may prefer Frequency Encoding since values are normalized and won’t distort scale.

---

- **Count Encoding** → Categories replaced by raw counts.  
- **Frequency Encoding** → Categories replaced by relative frequencies.  
- **`normalize=True` in Count Encoder** → Makes it act like Frequency Encoding.  

Both methods are powerful, and the choice depends on whether you care about **absolute counts** or **relative proportions** in your dataset.


In [1929]:
# dataset generation
import pandas as pd
import numpy as np
import category_encoders as ce

# Simulating a dataset
data = {
    'Age': np.random.randint(20, 60, size=100).astype(float),  # Random ages between 20 and 60
    'State': np.random.choice(['Karnataka', 'Tamil Nadu', 'Maharashtra', 'Delhi', 'Telangana'], size=100),
    'Education': np.random.choice(['High School', 'UG', 'PG'], size=100),
    'Package': np.random.rand(100) * 100  # Random package values for demonstration
}

# Introducing missing values in 'Age' column (5%)
np.random.seed(0)  # For reproducibility
missing_indices = np.random.choice(data['Age'].shape[0], replace=False, size=int(data['Age'].shape[0] * 0.05))
data['Age'][missing_indices] = np.nan

df = pd.DataFrame(data)

df.head()

Unnamed: 0,Age,State,Education,Package
0,38.0,Tamil Nadu,PG,17.465839
1,27.0,Maharashtra,High School,32.7988
2,,Telangana,High School,68.034867
3,21.0,Delhi,UG,6.320762
4,22.0,Karnataka,High School,60.724937


In [1930]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Package']), df['Package'], test_size=0.2, random_state=42)

In [1931]:
X_train.head()

Unnamed: 0,Age,State,Education
55,,Tamil Nadu,High School
88,34.0,Maharashtra,UG
26,,Maharashtra,High School
42,25.0,Karnataka,UG
69,30.0,Delhi,PG


In [1932]:
X_train['State'].value_counts()

State
Maharashtra    25
Telangana      18
Delhi          15
Tamil Nadu     12
Karnataka      10
Name: count, dtype: int64

In [1933]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import sklearn

In [1934]:
class CountEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.count_map = {}

    def fit(self, X, y=None):
        if self.columns is None:
            self.columns = X.columns
        for col in self.columns:
            self.count_map[col] = X[col].value_counts().to_dict()
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.count_map[col]).fillna(0)
        return X

In [1935]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_missing', SimpleImputer(strategy='mean'), ['Age']),
        ('cat_state', CountEncoder(), ['State']),
        ('education_ordinal', OrdinalEncoder(), ['Education'])
    ])

sklearn.set_config(transform_output="pandas")

In [1936]:
preprocessor.fit_transform(X_train)

Unnamed: 0,age_missing__Age,cat_state__State,education_ordinal__Education
55,37.586667,12,0.0
88,34.000000,25,2.0
26,37.586667,25,0.0
42,25.000000,10,2.0
69,30.000000,15,1.0
...,...,...,...
60,26.000000,10,0.0
71,45.000000,15,0.0
14,32.000000,25,0.0
92,49.000000,25,0.0


# using category encoders

In [1937]:
from category_encoders.count import CountEncoder

In [1938]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_missing', SimpleImputer(strategy='mean'), ['Age']),
        ('cat_state', CountEncoder(), ['State']),
        ('education_ordinal', OrdinalEncoder(), ['Education'])
    ])
sklearn.set_config(transform_output="pandas")

In [1939]:
preprocessor.fit_transform(X_train)

Unnamed: 0,age_missing__Age,cat_state__State,education_ordinal__Education
55,37.586667,12,0.0
88,34.000000,25,2.0
26,37.586667,25,0.0
42,25.000000,10,2.0
69,30.000000,15,1.0
...,...,...,...
60,26.000000,10,0.0
71,45.000000,15,0.0
14,32.000000,25,0.0
92,49.000000,25,0.0


In [1940]:
preprocessor = ColumnTransformer(
    transformers=[
        ('age_missing', SimpleImputer(strategy='mean'), ['Age']),
        ('cat_state', CountEncoder(normalize=True), ['State']),
        ('education_ordinal', OrdinalEncoder(), ['Education'])
    ])
sklearn.set_config(transform_output="pandas")

In [1941]:
preprocessor.fit_transform(X_train)

Unnamed: 0,age_missing__Age,cat_state__State,education_ordinal__Education
55,37.586667,0.1500,0.0
88,34.000000,0.3125,2.0
26,37.586667,0.3125,0.0
42,25.000000,0.1250,2.0
69,30.000000,0.1875,1.0
...,...,...,...
60,26.000000,0.1250,0.0
71,45.000000,0.1875,0.0
14,32.000000,0.3125,0.0
92,49.000000,0.3125,0.0


In [1942]:
# parameters
import pandas as pd
import numpy as np
import category_encoders as ce

# Simulating a dataset
np.random.seed(42)  # For reproducibility

data = {
    'State': np.random.choice(
        ['Karnataka', 'Tamil Nadu', 'Maharashtra', 'Delhi', 'Telangana', np.nan],
        size=100
    ),
    'Education': np.random.choice(
        ['High School', 'UG', 'PG', np.nan],
        size=100
    )
}

df = pd.DataFrame(data)
df.replace("nan", np.nan, inplace=True)

df.head(25)

Unnamed: 0,State,Education
0,Delhi,PG
1,Telangana,High School
2,Maharashtra,High School
3,Telangana,High School
4,Telangana,PG
5,Tamil Nadu,High School
6,Maharashtra,
7,Maharashtra,High School
8,Maharashtra,
9,Telangana,


In [1943]:
df.isnull().sum()

State        17
Education    23
dtype: int64

# Count Encoder Parameters: `handle_unknown`, `handle_missing`, `min_group_size`, `min_group_name`, `combine_min_nan_groups`

When applying **Count Encoding** using libraries such as `category_encoders`, we have several important parameters that control how categories are grouped and how special cases are handled.  

---

## 1. `handle_unknown`

This parameter defines the behavior when the model encounters **unseen categories** (i.e., categories that were not present during training but appear during inference).

- **Options:**
  - `"error"` → Raises an error if an unknown category is found.  
  - `"value"` → Encodes the unknown category with a specific value (default is often `None`, which gets treated as 0 or another placeholder).  
  - `"return_nan"` → Returns `NaN` for unknown categories, leaving it for the user to handle later.

- **Example:**  
  If training data has breeds `{Labrador, Beagle, Siamese}` and test data contains `Pug`, the encoding behavior depends on this parameter.

---

## 2. `handle_missing`

This parameter defines how the encoder deals with **missing values** (`NaN`) in the categorical feature.

- **Options:**
  - `"error"` → Raises an error if missing values are present.  
  - `"value"` → Encodes missing values with a specific placeholder (e.g., 0).  
  - `"return_nan"` → Keeps them as `NaN` in the transformed dataset.  

- **Example:**  
  If the `Breed` column has some missing values, they will be encoded differently depending on this setting.

---

## 3. `min_group_size`

This parameter sets a **minimum threshold** for how many times a category must appear in the dataset.  

- If a category occurs fewer times than `min_group_size`, it will not be encoded separately.  
- Instead, such rare categories are grouped together under a single label (defined by `min_group_name`).  

**Why it’s useful:**  
- Prevents overfitting on categories that occur very rarely.  
- Ensures that categories with too few samples don’t produce unstable encodings.  

---

## 4. `min_group_name`

When categories are combined due to `min_group_size`, this parameter specifies the **name of the new grouped category**.  

- Common values: `"__other__"`, `"rare"`, or any custom string you provide.  
- All categories below the frequency threshold will be merged under this name before encoding.  

---

## 5. `combine_min_nan_groups`

This parameter controls whether **rare categories (below `min_group_size`)** should be grouped together **with missing values (`NaN`)**.  

- `True` → Missing values and rare categories are combined into the same group.  
- `False` → Missing values are handled separately from rare categories.  

**Why it matters:**  
- If your dataset has both rare categories and many missing values, combining them can simplify the feature space.  
- But if you want to treat missing values differently from rare categories, keep this set to `False`.

---

##  Why These Parameters Matter

- **`handle_unknown` / `handle_missing`** → Deal with unseen or missing data.  
- **`min_group_size` / `min_group_name` / `combine_min_nan_groups`** → Control how rare categories and missing values are grouped.  

Together, these parameters give you fine-grained control over how categorical data is transformed, making your encodings more **robust**, **consistent**, and **production-ready**.

---

**Best Practice Tip**:  
- Use `min_group_size` to avoid overfitting on rare categories.  
- Set a clear `min_group_name` (e.g., `"rare"`) for transparency.  
- Only use `combine_min_nan_groups=True` if it makes sense to merge missing values with rare categories in your context.


In [1944]:
# Initialize the CountEncoder with various parameters
encoder = ce.CountEncoder(
    cols=['State', 'Education'],  # Specify columns to encode. None would automatically select categorical columns.
    handle_missing='value',  # Treat NaNs as a countable category
    handle_unknown='value',  # Treat unknown categories as NaNs (if seen during transform but not in fit)
)

In [1945]:
# Fit and transform the dataset
encoder.fit_transform(df)

#print(encoded_df.head(25))

Unnamed: 0,State,Education
0,25,34
1,17,27
2,11,27
3,17,27
4,17,34
...,...,...
95,25,27
96,25,16
97,17,23
98,11,23


In [1946]:
encoder.mapping

{'State': State
 Delhi          25
 Tamil Nadu     19
 Telangana      17
 NaN            17
 Maharashtra    11
 Karnataka      11
 Name: count, dtype: int64,
 'Education': Education
 PG             34
 High School    27
 NaN            23
 UG             16
 Name: count, dtype: int64}

In [1947]:
new_data = pd.DataFrame({'State': ['Bihar'], 'Education': ['UG']})

encoder.transform(new_data)

Unnamed: 0,State,Education
0,0.0,16


In [1948]:
np.random.seed(0)  # For reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', np.nan], size=100, p=[0.3, 0.25, 0.15, 0.15, 0.05, 0.05, 0.05]),
    'Value': np.random.rand(100)
}

df = pd.DataFrame(data)

df.sample(10)


Unnamed: 0,Category,Value
91,C,0.209844
29,B,0.290078
2,C,0.735194
50,C,0.149448
44,C,0.806194
78,A,0.704414
33,C,0.298282
65,B,0.855803
75,A,0.223925
45,C,0.703889


In [1949]:
df['Category'].value_counts()

Category
A      34
B      22
C      21
D      12
nan     5
F       4
E       2
Name: count, dtype: int64

In [1950]:
encoder = ce.CountEncoder(
    cols=['Category'],
    min_group_size=5,  # Groups with counts less than 5 will be combined
    min_group_name='others',  # Use default naming for combined minimum groups
)

# Fit and transform the dataset
encoded_df = encoder.fit_transform(df['Category'])

# Display the original and encoded data for comparison
df['Encoded'] = encoded_df
print(df.head(20))

   Category     Value  Encoded
0         B  0.677817       22
1         D  0.270008       12
2         C  0.735194       21
3         B  0.962189       22
4         B  0.248753       22
5         C  0.576157       21
6         B  0.592042       22
7         E  0.572252        6
8       nan  0.223082        5
9         B  0.952749       22
10        D  0.447125       12
11        B  0.846409       22
12        C  0.699479       21
13        F  0.297437        6
14        A  0.813798       34
15        A  0.396506       34
16        A  0.881103       34
17        D  0.581273       12
18        D  0.881735       12
19        E  0.692532        6


In [1951]:
encoder.mapping

{'Category': Category
 A         34
 B         22
 C         21
 D         12
 nan        5
 others     6
 Name: count, dtype: int64}

# Binary Encoding

Binary Encoding is a technique used for converting categorical data into numerical format for machine learning models.  

Unlike **One-Hot Encoding**:
- One-hot creates a new column for every unique category (which can lead to very high dimensionality if categories are many).
- Binary Encoding reduces dimensionality by first converting categories into integers, then representing those integers in **binary form**, and finally splitting the binary digits across multiple columns.

---

## How It Works

1. Assign an **integer value** to each category.  
2. Convert these integer values into **binary numbers**.  
3. Split the binary digits into separate columns (bits).  

This way, fewer columns are created compared to one-hot encoding.

---

## Example

Consider a dataset of fruits:

### Original Dataset

| Item  | Fruit      |
|-------|-----------|
| Item1 | Apple     |
| Item2 | Banana    |
| Item3 | Cherry    |
| Item4 | Date      |
| Item5 | Elderberry|
| Item6 | Fig       |
| Item7 | Grape     |
| Item8 | Honeydew  |

---

### Step 1: Assign Integers

| Fruit      | Integer |
|------------|---------|
| Apple      | 1       |
| Banana     | 2       |
| Cherry     | 3       |
| Date       | 4       |
| Elderberry | 5       |
| Fig        | 6       |
| Grape      | 7       |
| Honeydew   | 8       |

---

### Step 2: Convert Integers → Binary

- 1 → `0001`  
- 2 → `0010`  
- 3 → `0011`  
- 4 → `0100`  
- 5 → `0101`  
- 6 → `0110`  
- 7 → `0111`  
- 8 → `1000`  

---

### Step 3: Split Binary into Columns

| Item  | Fruit      | Bit1 | Bit2 | Bit3 | Bit4 |
|-------|-----------|------|------|------|------|
| Item1 | Apple     | 0    | 0    | 0    | 1    |
| Item2 | Banana    | 0    | 0    | 1    | 0    |
| Item3 | Cherry    | 0    | 0    | 1    | 1    |
| Item4 | Date      | 0    | 1    | 0    | 0    |
| Item5 | Elderberry| 0    | 1    | 0    | 1    |
| Item6 | Fig       | 0    | 1    | 1    | 0    |
| Item7 | Grape     | 0    | 1    | 1    | 1    |
| Item8 | Honeydew  | 1    | 0    | 0    | 0    |

---

## Advantages 
- **Reduces dimensionality** compared to One-Hot Encoding.  
- Works well with **high-cardinality categorical features**.  
- Keeps some **ordinal relationship** through binary digits.  

## Disadvantages 
- Slightly more complex than One-Hot or Label Encoding.  
- Models may **misinterpret binary patterns** as having ordinal meaning.  
- Not as interpretable as One-Hot.  

---

## Practical Use Cases
- Large datasets with **high-cardinality categorical variables** (e.g., user IDs, product codes, zip codes).  
- Works well with **tree-based models** (Random Forest, XGBoost, LightGBM).  
- Useful when **memory and computation cost** are important concerns.  

---


In [1952]:
import pandas as pd
import category_encoders as ce

# Sample dataset
data = {
    'Item': ['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'],
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Elderberry', 'Fig', 'Grape', 'Honeydew']
}
df = pd.DataFrame(data)

df

Unnamed: 0,Item,Fruit
0,Item1,Apple
1,Item2,Banana
2,Item3,Cherry
3,Item4,Date
4,Item5,Elderberry
5,Item6,Fig
6,Item7,Grape
7,Item8,Honeydew


In [1953]:
# Initialize the Binary Encoder
encoder = ce.BinaryEncoder(cols=['Fruit'], return_df=True)

# Fit and transform the data
df_encoded = encoder.fit_transform(df)

# Display the original and encoded data
print(df_encoded)

    Item  Fruit_0  Fruit_1  Fruit_2  Fruit_3
0  Item1        0        0        0        1
1  Item2        0        0        1        0
2  Item3        0        0        1        1
3  Item4        0        1        0        0
4  Item5        0        1        0        1
5  Item6        0        1        1        0
6  Item7        0        1        1        1
7  Item8        1        0        0        0


### Target Encoder/mean encoder 

In [1954]:
# using category_encoder

import pandas as pd
import category_encoders as ce

# Sample data
data = {
    'Feature': ['A', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Target': [1, 0, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Separating the feature and target columns
X = df.drop('Target', axis=1)
y = df['Target']

# Initialize the TargetEncoder
encoder = ce.TargetEncoder(cols=['Feature'])

# Fit the encoder using the feature data and target variable
encoder.fit(X, y)

# Transform the data
encoded = encoder.transform(X)

# Show the original and encoded data
print(pd.concat([df, encoded], axis=1))


   Feature  Target   Feature
0        A       1  0.631436
1        B       0  0.579948
2        A       0  0.631436
3        B       1  0.579948
4        C       1  0.678194
5        A       1  0.631436
6        B       0  0.579948
7        C       1  0.678194


In [1955]:
encoder.mapping

{'Feature': Feature
  1    0.631436
  2    0.579948
  3    0.678194
 -1    0.625000
 -2    0.625000
 dtype: float64}

In [1956]:
# using sklearn
import pandas as pd
from sklearn.preprocessing import TargetEncoder

# Sample data
data = {
    'Feature': ['A', 'B', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Target': [1, 0, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Separating the feature and target columns
X = df.drop('Target', axis=1)
y = df['Target']

# Initialize the TargetEncoder
encoder = TargetEncoder(smooth=0.0)

# Fit the encoder using the feature data and target variable
encoder.fit(X, y)

# Transform the data
encoded = encoder.transform(X)

encoded


Unnamed: 0,Feature
0,0.666667
1,0.333333
2,0.666667
3,0.333333
4,1.0
5,0.666667
6,0.333333
7,1.0


# Weight of Evidence (WoE) Encoding

**Weight of Evidence (WoE)** is a powerful technique used primarily in **credit risk modeling** and other areas of **financial analytics**, but its utility extends to any predictive modeling task that involves categorical variables.  

WoE encoding transforms categorical variables into a **continuous scale**, representing the logarithmic ratio of the distribution of *"good"* outcomes to the distribution of *"bad"* outcomes within each category.  
This makes WoE particularly useful for **binary classification problems**.

---

## Formula for Weight of Evidence


$$
\text{WoE} = \ln \left( \frac{\text{Distribution of Good}}{\text{Distribution of Bad}} \right)
$$

Where:  

- **Distribution of Good** → Proportion of *positive* outcomes within the category  
- **Distribution of Bad** → Proportion of *negative* outcomes within the category  
- **ln** → Natural logarithm  

---

In practice, WoE helps in:  
- Capturing the predictive power of categorical variables  
- Handling variables with strong separation between classes  
- Making features more interpretable in models such as **Logistic Regression**


### WoE Encoding: A Practical Example 

Following the theory, let's walk through a practical example of calculating the **Weight of Evidence (WoE)**. We will encode the `EmploymentStatus` categorical feature based on its relationship with the `LoanOutcome` target variable.

**Our Goal:** Convert the categories 'Employed', 'Unemployed', and 'Student' into meaningful numerical WoE scores.

**Initial Dataset:**
| ApplicationID | EmploymentStatus | LoanOutcome |
|:-------------:|:----------------:|:-----------:|
| 1             | Employed         | Good        |
| 2             | Unemployed       | Bad         |
| 3             | Employed         | Bad         |
| 4             | Student          | Good        |
| 5             | Employed         | Good        |
| 6             | Unemployed       | Bad         |
| 7             | Student          | Bad         |
| 8             | Employed         | Good        |

---

#### Step 1: Calculate Frequency of Good and Bad Outcomes

First, we group by each category in `EmploymentStatus` and count the number of "Good" (event) and "Bad" (non-event) outcomes.

| EmploymentStatus | Good | Bad |
|:-----------------|:----:|:---:|
| Employed         | 3    | 1   |
| Unemployed       | **0**| 2   |
| Student          | 1    | 1   |
| **Total** | **4**| **4**|

Here, we notice the **"Unemployed"** category has zero "Good" outcomes, which requires special handling.

---

#### Step 2: Calculate the Distribution of Good and Bad

Next, we find the proportion of all "Good" and "Bad" outcomes that belong to each category.

* **Distribution of Good** = (Count of 'Good' in Category) / (Total 'Good')
* **Distribution of Bad** = (Count of 'Bad' in Category) / (Total 'Bad')

| EmploymentStatus | Distribution of Good | Distribution of Bad |
|:-----------------|:--------------------:|:-------------------:|
| Employed         | 3 / 4 = **0.75** | 1 / 4 = **0.25** |
| Unemployed       | 0 / 4 = **0.00** | 2 / 4 = **0.50** |
| Student          | 1 / 4 = **0.25** | 1 / 4 = **0.25** |

---

#### Step 3: Apply the WoE Formula & Handle Division by Zero

The WoE is the natural logarithm of the ratio of these distributions.

$$
\text{WoE} = \ln \left( \frac{\text{Distribution of Good}}{\text{Distribution of Bad}} \right)
$$

For the "Unemployed" category, the calculation `ln(0.00 / 0.50)` results in `ln(0)`, which is mathematically undefined.

To solve this, we adjust the original count of `0` to a very small number (e.g., `0.001`) to prevent this error.

**Adjusted Counts:**
* **Good outcomes for "Unemployed"**: `0.001`
* **New Total Good outcomes**: `3 + 1 + 0.001 = 4.001`

---

#### Step 4: Recalculate Distributions and Final WoE Scores

Using the adjusted total, we re-calculate the WoE for all categories to ensure consistency.

| EmploymentStatus | Calculation                                | Final WoE |
|:-----------------|:-------------------------------------------|:---------:|
| Employed         | $\ln \left( (3/4.001) / (1/4) \right)$     | **+1.098**|
| Unemployed       | $\ln \left( (0.001/4.001) / (2/4) \right)$ | **-7.601**|
| Student          | $\ln \left( (1/4.001) / (1/4) \right)$     | **0.000** |


---

#### Step 5: Final Encoded Table and Interpretation 

The final step is to map these WoE scores back to the original categories. This new numerical feature is now ready for modeling.

| EmploymentStatus | Final WoE | Risk Interpretation     |
|:-----------------|:---------:|:------------------------|
| Employed         | +1.098    | **Lower Risk** |
| Unemployed       | -7.601    | **Very High Risk** |
| Student          | 0.000     | **Neutral / No Insight**|

Final Encoded Dataset

Now, replacing `EmploymentStatus` with WoE scores:

| ApplicationID | EmploymentStatus (WoE) | LoanOutcome |
|:-------------:|:----------------------:|:-----------:|
| 1             | +1.098                 | Good        |
| 2             | -7.601                 | Bad         |
| 3             | +1.098                 | Bad         |
| 4             | 0.000                  | Good        |
| 5             | +1.098                 | Good        |
| 6             | -7.601                 | Bad         |
| 7             | 0.000                  | Bad         |
| 8             | +1.098                 | Good        |



###  Monotonic Relationship in Weight of Evidence (WoE)

One of the **biggest advantages** of **WoE encoding** is that it creates (or enforces) a **monotonic relationship** between the encoded feature and the **target variable** (usually "Good" vs. "Bad").  

---

####  What does monotonic mean here?

A **monotonic relationship** means that as the **WoE value increases**, the probability of the target being "Good" (or "Bad") moves in **one consistent direction** — it either always increases or always decreases, but never switches back and forth.

- If **WoE increases → Probability of Good increases** (positive monotonicity).  
- If **WoE decreases → Probability of Good decreases** (negative monotonicity).  

This ensures that the model learns a **clean, ordered relationship** between categories and the outcome.

---

####  Example (from our dataset)

Final WoE scores for `EmploymentStatus`:

| EmploymentStatus | Final WoE | Interpretation     |
|------------------|-----------|--------------------|
| Employed         | +1.098    | Lower Risk (more Good) |
| Student          | 0.000     | Neutral            |
| Unemployed       | -7.601    | Very High Risk (more Bad) |

We can observe the monotonic pattern:

- **Unemployed (-7.601)** → Strongly associated with **Bad** outcomes.  
- **Student (0.000)** → Balanced / neutral.  
- **Employed (+1.098)** → Strongly associated with **Good** outcomes.  

As WoE **increases from -7.601 → 0.000 → +1.098**, the likelihood of "Good" outcomes also **increases monotonically**.  

---

####  Why is this useful?

1. **Model Stability**: Logistic regression and credit risk models prefer features that have a **linear / monotonic relationship** with the log-odds of the target. WoE encoding directly provides this.  
2. **Interpretability**: You can interpret categories in terms of risk. A higher WoE always means "less risky" (or more Good), which makes sense to domain experts.  
3. **Avoids Overfitting**: Unlike one-hot encoding, WoE summarizes categories into an ordered numeric scale, reducing model complexity.  

---

 **In simple terms:**  
WoE transforms messy categories into numbers that line up nicely with the target variable, so the model sees a clean "increasing or decreasing" pattern.  


####  Advantages
1. **Monotonic Relationship**  
   - Ensures a clear, ordered relationship between categories and the target variable.  
   - Very useful for logistic regression and credit risk models.  

2. **Interpretability**  
   - Easy to explain to domain experts (e.g., higher WoE = lower risk).  
   - Values directly relate to "risk" or "likelihood" of an event.  

3. **Handles Categorical Variables Well**  
   - Converts categories into continuous numeric values, reducing dimensionality compared to one-hot encoding.  

4. **Stability**  
   - More stable than raw categories when applied to datasets with many categories.  

---

####  Disadvantages
1. **Requires a Binary Target**  
   - Works best only in **binary classification problems** (e.g., Good vs. Bad).  
   - Not directly suitable for regression or multi-class classification.  

2. **Dependent on Target Information**  
   - WoE uses the target variable for encoding, so it must be applied **after train-test split** to avoid data leakage.  

3. **Handling of Zero Frequencies**  
   - If a category has **0 Good or 0 Bad outcomes**, WoE becomes undefined (`ln(0)` issue).  
   - Requires smoothing (e.g., replacing 0 with 0.001).  

4. **Not Always Generalizable**  
   - WoE encoding may overfit if calculated on small datasets or categories with very few observations.  

---

####  Use Cases
1. **Credit Risk Modeling**  
   - Widely used in banking to score customers (Good = repays loan, Bad = defaults).  

2. **Fraud Detection**  
   - Helps quantify the likelihood of fraud based on categorical attributes.  

3. **Insurance Risk Assessment**  
   - Used to assess customer claim risks based on categorical factors (e.g., job type, region).  

4. **Any Binary Classification Task**  
   - When interpretability and monotonic relationships are desired.  

---

In [1959]:
import pandas as pd
import category_encoders as ce

# Example dataset
data = {
    'Feature': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
    'Target': [1, 0, 0, 1, 1, 0, 1, 0, 1, 0]
}
df = pd.DataFrame(data)

# Define the features and target
X = df[['Feature']]
y = df['Target']

# Initialize and fit the TargetEncoder
encoder = ce.WOEEncoder(cols=['Feature'])
X_encoded = encoder.fit_transform(X, y)

# Display the original and encoded data
df['Feature_Encoded'] = X_encoded
print(df)


  Feature  Target  Feature_Encoded
0       A       1         0.000000
1       B       0        -0.405465
2       A       0         0.000000
3       C       1         0.405465
4       B       1        -0.405465
5       A       0         0.000000
6       C       1         0.405465
7       B       0        -0.405465
8       A       1         0.000000
9       C       0         0.405465


### Choosing the Right Categorical Encoding

---

####  One-Hot Encoder (OHE)
 
- Best for **low cardinality features** (few unique categories).  
- Ideal in **linear models** and **neural networks**, where preserving category distinction without implying order is crucial.  
-  Beware of **dimensionality explosion** if categories are many.  

---

####  Ordinal Encoder (OE)
 
- Suitable when the categorical variable has a **natural, meaningful order** (e.g., ratings, education levels).  
- Ensure your model can properly handle this **imposed ordinality**.  
- Not suitable for **nominal data** where no order exists.  

---

####  Count Encoder

- Captures the **frequency signal** of categories — great for large datasets.  
- Rare categories with **low counts** can mislead interpretations.  
- Often paired with other encoders or ensemble models for **better stability**.  

---

####  Binary Encoder
 
- Useful for **medium to high cardinality features** where OHE is impractical.  
- Reduces dimensionality efficiently while preserving more information than simple ordinal encoding.  
- Ensure binary patterns remain **meaningful to your model**.  

---

####  Target Encoder

- Powerful when there is a **strong relationship** between the category and the target.  
- Essential for **complex models** to capture nuanced patterns.  
- Always use **smoothing** and **cross-validation** to prevent **overfitting** and **leakage**.  

---

####  Weight of Evidence (WoE)

- Excels in **binary classification tasks**, especially in **financial and risk domains**.  
- Transforms categories into a measure of **predictive power**, aiding interpretability.  
- Ensure the **target is binary** and apply smoothing for categories with no events to avoid infinite values.  

---

### General Wisdom

1. **Understand Your Data**  
   - The best encoding depends heavily on your categorical data and problem context.  

2. **Experimentation is Key**  
   - There is no universal best method. Test multiple encoders and compare performance.  

3. **Guard Against Leakage**  
   - Target-based encoders (Target Encoding, WoE) must be applied carefully with proper cross-validation.  

4. **Balance Complexity and Interpretability**  
   - Complex encodings may boost performance but reduce interpretability.  
   - Always align the choice with your project goals.  

---


###  Categorical Encoding Techniques: Overview

---

#### 1. Ordinal Encoder
**How it works:** Converts each category into a unique integer based on order of appearance or alphabetical order.  

**Advantages:**  
- Simple to implement and understand.  
- Preserves order where it might be meaningful.  

**Disadvantages:**  
- Imposes an ordinal relationship that may not exist, potentially misleading the model.  
- Not suitable for non-ordinal data or models sensitive to numerical relationships.  

**Use Cases:**  
- Tree-based models where ordinal relationships are useful.  
- Situations where natural order of categories carries meaningful information.  

---

#### 2. One-Hot Encoder
**How it works:** Creates a separate binary column for each category level, with a 1 indicating presence.  

**Advantages:**  
- Prevents artificial ordinal relationships.  
- Easy to interpret and implement.  

**Disadvantages:**  
- Can lead to large increases in dimensionality with high-cardinality features.  
- Not efficient for models that inherently handle categorical data.  

**Use Cases:**  
- Linear models or neural networks requiring explicit numeric conversion.  
- Datasets with a small number of unique categories.  

---

#### 3. Binary Encoder
**How it works:** Converts categories to ordinal numbers, encodes as binary, and splits digits into separate columns.  

**Advantages:**  
- More space-efficient than OHE for high-cardinality features.  
- Reduces curse of dimensionality.  

**Disadvantages:**  
- Introduces a form of ordinality that may not be inherent.  
- Binary representation can be less intuitive.  

**Use Cases:**  
- High-cardinality features where OHE is impractical.  
- Models benefiting from reduced dimensionality but unable to exploit categorical nature directly.  

---

#### 4. BaseN Encoder
**How it works:** Generalization of binary encoding using base-N representation for flexible dimensionality control.  

**Advantages:**  
- Customizable balance between dimensionality and information retention.  
- Controls expansion of feature space.  

**Disadvantages:**  
- Base selection requires tuning.  
- Interpretation can be complex depending on chosen base.  

**Use Cases:**  
- Datasets where neither binary nor OHE is satisfactory.  
- Scenarios requiring tunable categorical encoding.  

---

#### 5. Target Encoder
**How it works:** Replaces category with mean of target variable for that category.  

**Advantages:**  
- Captures information about the target, improving model performance.  
- Reduces dimensionality without losing information.  

**Disadvantages:**  
- Risk of overfitting and data leakage if not properly regularized.  
- Mean calculation must avoid including validation/test sets.  

**Use Cases:**  
- Models capturing the relationship between features and target.  
- Categories with a direct relationship to the target variable.  

---

#### 6. James-Stein Encoder
**How it works:** Uses James-Stein estimator to shrink category means towards the global mean, regularizing small samples.  

**Advantages:**  
- Improves estimates for small categories.  
- Prevents overfitting by shrinking extreme values.  

**Disadvantages:**  
- More complex and less intuitive than simpler methods.  
- Gains depend on dataset context.  

**Use Cases:**  
- Regression tasks with continuous targets and many small categories.  

---

#### 7. M-estimate Encoder
**How it works:** Simplified target encoding with smoothing to balance category mean and global mean.  

**Advantages:**  
- Reduces overfitting for small sample categories.  
- Simple implementation with a smoothing parameter.  

**Disadvantages:**  
- Requires careful tuning of smoothing.  
- Can still be prone to data leakage without proper cross-validation.  

**Use Cases:**  
- Target encoding with control over small sample influence.  
- Regression and binary classification with categorical features.  

---

#### 8. Weight of Evidence (WoE)
**How it works:** Transforms categories based on log-odds of target = 1 within each category.  

**Advantages:**  
- Interpretable as odds ratio.  
- Handles binary targets naturally.  

**Disadvantages:**  
- Only for binary classification.  
- Zero-event categories require careful handling.  

**Use Cases:**  
- Credit risk assessment and other binary classification tasks.  
- Financial and medical domains requiring interpretability.  

---

#### 9. Leave One Out Encoder
**How it works:** Similar to target encoding but leaves out the current row to reduce overfitting.  

**Advantages:**  
- Reduces overfitting compared to standard target encoding.  
- Maintains category-target relationship.  

**Disadvantages:**  
- Computationally intensive.  
- May still require additional regularization.  

**Use Cases:**  
- Supervised learning with high risk of target leakage.  
- Projects prioritizing predictive accuracy over interpretability.  

---

#### 10. CatBoost Encoder
**How it works:** Advanced leave-one-out encoding with smoothing inspired by CatBoost to reduce overfitting.  

**Advantages:**  
- Strong performance with categorical variables, especially in tree-based models.  
- Sophisticated smoothing to combat overfitting.  

**Disadvantages:**  
- Harder to interpret than simpler encoders.  
- Performance gains may vary by dataset.  

**Use Cases:**  
- Classification/regression with important categorical features.  
- Projects using CatBoost or tree-based models.  

---

#### 11. Generalized Linear Mixed Model (GLMM) Encoder
**How it works:** Uses GLMM to estimate effect of each category on the target, blending encoding with statistical modeling.  

**Advantages:**  
- Captures complex category-target relationships.  
- Statistically rigorous.  

**Disadvantages:**  
- Requires more statistical knowledge.  
- Computationally intensive.  

**Use Cases:**  
- Biostatistics or social sciences predictive modeling.  
- Complex, non-linear relationships between categorical features and target.  

---

#### 12. Sum Encoder (Effect Encoder)
**How it works:** Similar to OHE but uses -1 for reference category, representing effects relative to baseline.  

**Advantages:**  
- Captures relative effects against baseline, useful in linear models.  
- Sometimes more informative than standard OHE.  

**Disadvantages:**  
- Can introduce multicollinearity in linear models.  
- Interpretation less straightforward than OHE.  

**Use Cases:**  
- Linear regression models needing baseline comparison.  
- Situations requiring relative effect analysis.  

---

#### 13. Polynomial Encoder
**How it works:** Encodes categorical variables as orthogonal polynomials to capture non-linear relationships.  

**Advantages:**  
- Models complex, non-linear category-target relationships.  
- Useful for trend analysis in ordered categories.  

**Disadvantages:**  
- Harder to interpret.  
- Best suited for ordered categorical variables.  

**Use Cases:**  
- Analysis where order matters and relationships are non-linear.  
- Regression tasks benefiting from polynomial relationships.  

---

#### 14. Backward Difference Encoder
**How it works:** Encodes categories by calculating difference between each category and preceding one.  

**Advantages:**  
- Highlights differences between adjacent categories.  
- Reduces multicollinearity compared to OHE.  

**Disadvantages:**  
- Assumes ordinal relationship, may not always be valid.  
- Interpretation can be challenging with non-ordinal data.  

**Use Cases:**  
- Focus on changes between adjacent categories.  
- Ordered categorical data with meaningful sequence.  

---

#### 15. Helmert Encoder
**How it works:** Compares each level to mean of subsequent levels for contrast coding.  

**Advantages:**  
- Useful in statistical analysis, experimental designs.  
- Systematic category comparison.  

**Disadvantages:**  
- Less intuitive interpretation.  
- Assumes an order in categories.  

**Use Cases:**  
- ANOVA and contrast-based analysis.  
- Regression modeling in experimental designs.  

---

#### 16. Hashing Encoder
**How it works:** Uses hash function to encode categories into fixed dimensions, handling new categories dynamically.  

**Advantages:**  
- Efficient with high-cardinality and large datasets.  
- Handles unseen categories.  

**Disadvantages:**  
- Possible collisions and loss of information.  
- Encoded features are not interpretable.  

**Use Cases:**  
- Text classification, NLP tasks.  
- Large or dynamic categorical datasets.  

---

#### 17. Quantile Encoder
**How it works:** Encodes categories based on quantile of target distribution instead of mean.  

**Advantages:**  
- Captures distributional information within categories.  
- Useful for skewed or non-normal targets.  

**Disadvantages:**  
- Requires careful handling to avoid overfitting.  
- Interpretation more complex than mean-based encoding.  

**Use Cases:**  
- Regression tasks where target distribution matters.  
- Projects where target mean alone does not capture category-target relationship.  
