<a href="https://colab.research.google.com/github/Ikwuegbu/Data-Science-3mtt/blob/main/02_Data_Preparation_and_Feature_Engineering_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is Data Preparation?

Data preparation is the process of cleaning, transforming, and organizing raw data into a usable format for analysis or machine learning. It is the foundational step in any data-driven project, ensuring that the dataset is accurate, consistent, and suitable for deriving meaningful insights.

### Key Steps:

A. Data Collection: Gathering data from various sources (e.g., databases, APIs, files).

B. Data Cleaning:
   - Handling missing values (e.g., filling, interpolation, or removal).
   - Removing duplicates.
   - Correcting inconsistencies and errors.

C. Data Transformation:
   - Normalization or standardization of data.
   - Encoding categorical variables (e.g., one-hot encoding, label encoding).
   - Log transformations to handle skewed data.

D. Data Integration: Merging datasets from multiple sources into a cohesive structure.

E. Data Reduction:
   - Feature selection (choosing the most relevant variables).
   - Dimensionality reduction techniques (e.g., PCA, t-SNE).

## What is Feature Engineering?

Feature engineering is the process of creating, modifying, or selecting input variables (features) to improve the performance of machine learning models. It transforms raw data into meaningful representations that a model can effectively utilize.

### Key Techniques:

A. Feature Creation:
   - Combining features (e.g., creating a "total purchase" feature from quantity and price).
   - Extracting features (e.g., deriving day, month, or year from a timestamp).
   - Generating interaction terms (e.g., multiplying features to capture combined effects).

B. Feature Transformation:
   - Scaling numerical features to ensure uniform ranges (e.g., Min-Max Scaling, Standard Scaling).
   - Handling outliers using methods like clipping or transformations.

C. Feature Encoding:
   - Converting categorical features into numerical formats (e.g., ordinal, one-hot, target encoding).
   - Hashing for high-cardinality categorical variables.

D. Feature Selection:
   - Filter Methods: Selecting features based on statistical tests (e.g., correlation, chi-squared tests).
   - Wrapper Methods: Using algorithms like Recursive Feature Elimination (RFE).
   - Embedded Methods: Features are selected during model training (e.g., LASSO regression).

E. Handling Temporal Data:
   - Creating lag features or rolling window aggregations for time-series data.
   - Encoding cyclical features like hour of the day or day of the year.

## Why should we bother with Data Preparation and Feature Engineering?

- Improves Model Performance: High-quality data and engineered features enhance the predictive power of models.
- Reduces Complexity: Simplifying data through dimensionality reduction or feature selection makes models more efficient.
- Mitigates Overfitting: By selecting only relevant features, the risk of overfitting decreases.
- Enhances Interpretability: Well-engineered features make it easier to interpret model results and understand data patterns.

# Data transformation == AKA Feature engineering

### **`Numerical data`**

Usually, you would have your numerical inputs on different scales. Some algorithms are sensitive to the scale of the features, they can consider an input with a scale of 0 to 10 as having little significance compared to another with a 0 to 100 scale. Some examples include KNN (relies on distance measures) and Linear regression, logistic regression, SVM, etc which uses gradient-based optimization.

In cases like this, normalization/standardization is applied to bring all features to a common scale.


| Normalization | Standardization |
|:---|:---|
| Scales values to a specific range usually between 0 and 1 | Transforms the features to have a mean of 0 and std of 1 |
| Distance based algorithms benefit here, e.g KNN or clustering algorithms | Algorithms that assume Gaussian (normal)-distributed features use this. E.g., Linear regression, SVM |

NOTE THAT:

Whatsoever normalization/standardization measures and parameters (like min, max, mean, std) that you apply on the train data should be applied on the test data to ensure consistency.

**We will use the following dataset for demonstration:**

In [None]:
import pandas as pd

# Sample dataset
data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [100, 200, 300, 400, 500],
    'Feature3': [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Feature1,Feature2,Feature3
0,10,100,1
1,20,200,0
2,30,300,1
3,40,400,0
4,50,500,1


##### For standardization

In [None]:
#importing the required library
from sklearn.preprocessing import StandardScaler

# Create StandardScaler instance
scaler = StandardScaler()

# Apply standardization
standardized_data = scaler.fit_transform(df[['Feature1', 'Feature2']])

# Add standardized data back to the dataframe
df_standardized = pd.DataFrame(standardized_data, columns=['Feature1_Standardized', 'Feature2_Standardized'])
df1 = pd.concat([df, df_standardized], axis=1)
df1

Unnamed: 0,Feature1,Feature2,Feature3,Feature1_Standardized,Feature2_Standardized
0,10,100,1,-1.414214,-1.414214
1,20,200,0,-0.707107,-0.707107
2,30,300,1,0.0,0.0
3,40,400,0,0.707107,0.707107
4,50,500,1,1.414214,1.414214


In [None]:
standardized_data

array([[-1.41421356, -1.41421356],
       [-0.70710678, -0.70710678],
       [ 0.        ,  0.        ],
       [ 0.70710678,  0.70710678],
       [ 1.41421356,  1.41421356]])

##### For normalization

In [None]:
#importing the required library
from sklearn.preprocessing import MinMaxScaler

# Create MinMaxScaler instance
scaler = MinMaxScaler()

# Apply normalization
normalized_data = scaler.fit_transform(df[['Feature1', 'Feature2']])

# Add normalized data back to the dataframe
df_normalized = pd.DataFrame(normalized_data, columns=['Feature1_Normalized', 'Feature2_Normalized'])
df2 = pd.concat([df, df_normalized], axis=1)
df2

Unnamed: 0,Feature1,Feature2,Feature3,Feature1_Normalized,Feature2_Normalized
0,10,100,1,0.0,0.0
1,20,200,0,0.25,0.25
2,30,300,1,0.5,0.5
3,40,400,0,0.75,0.75
4,50,500,1,1.0,1.0


### **`Categorical data`**

Here, the focus is to convert non-numerical data to numerical representation so that ML algorithms can process it alongside the numerical features. There are several techniques you can use:

- One Hot encoding
- Label encoding
- Ordinal encoding
- Target (Mean) Encoding
- Frequency (Count) encoding
- Binary Encoding
- Hashing Encoding
- Leave-One-Out Encoding (LOO Encoding)
- Polynomial Encoding
- Entity Embedding

For the scope of this course track, we would start with one hot encoding and end at binary encoding.

### A. One-hot encoding vs Label encoding

When working with categorical data in machine learning, one-hot encoding and label encoding are two common techniques to convert categorical variables into a format suitable for machine learning algorithms. While they serve a similar purpose, their application and effects differ significantly.

#### What is One-Hot Encoding?

One-hot encoding transforms each unique category in a categorical feature into a new binary feature column. Each column corresponds to a category and contains a 1 or 0, indicating the presence or absence of that category.

**How It Works:**

Example: A feature Color with categories: Red, Blue, Green.

| Color | Red | Blue | Green |
|-------|-----|------|-------|
| Red   |  1  |  0   |   0   |
| Blue  |  0  |  1   |   0   |
| Green |  0  |  0   |   1   |

**Pros:**

- Avoids Ordinal Relationships: Prevents unintended ordering or ranking between categories.
- Effective for Non-Ordinal Data: Ideal for features where categories are distinct and unordered.

**Cons:**

- High Dimensionality: Can significantly increase the number of features, especially for high-cardinality variables (many unique categories).
- Sparse Data: Results in a sparse matrix with many zero values, which can increase computational cost.

#### What is Label Encoding?

Label encoding assigns a unique integer to each category, converting them into numerical labels.

**How It Works:**

Example: A feature Color with categories: Red, Blue, Green.

| Color | Encoded |
|-------|---------|
| Red   |    0    |
| Blue  |    1    |
| Green |    2    |

**Pros:**

- Memory Efficient: Requires less storage compared to one-hot encoding, especially for large datasets.
- Simpler Representation: Keeps the dataset compact without creating additional columns.

**Cons:**

- Creates Ordinal Bias: Implies an ordinal relationship (e.g., Red < Blue < Green), which may mislead models that interpret numerical values as ordered.
- Not Suitable for Non-Ordinal Data: Can affect performance of algorithms sensitive to numerical magnitude (e.g., linear regression).

#### When to Use One-Hot Encoding vs Label Encoding

| Aspect                  | One-Hot Encoding                    | Label Encoding                     |
|-------------------------|--------------------------------------|-------------------------------------|
| **Type of Data**        | Non-ordinal (no intrinsic order)    | Ordinal (has meaningful order)     |
| **Feature Cardinality** | Best for low to moderate categories | Best for low to moderate categories |
| **Impact on Models**    | Avoids ordinal bias                | May introduce ordinal bias         |
| **Dimensionality**      | Increases dimensionality            | Keeps dimensionality constant      |
| **Examples**            | City, Color                        | Grade, Size (e.g., Small, Medium, Large) |

**We’ll use a small dataset to demonstrate encoding techniques.**

In [None]:
import pandas as pd

# Sample dataset
data = {
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small'],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
    'Product': ['A', 'B', 'C', 'A', 'C'],
    'Sales': [200, 300, 150, 250, 100]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Size,City,Product,Sales
0,Small,New York,A,200
1,Medium,Los Angeles,B,300
2,Large,Chicago,C,150
3,Medium,New York,A,250
4,Small,Chicago,C,100


#### One-Hot Encoding

Library: `pandas or sklearn.preprocessing.OneHotEncoder`

In [None]:
# One-Hot Encoding using pandas
df_one_hot = pd.get_dummies(df, columns=['City'], prefix='City', dtype=int)
df_one_hot

Unnamed: 0,Size,Product,Sales,City_Chicago,City_Los Angeles,City_New York
0,Small,A,200,0,0,1
1,Medium,B,300,0,1,0
2,Large,C,150,1,0,0
3,Medium,A,250,0,0,1
4,Small,C,100,1,0,0


Note that for most algorithms like Logistic Regression, we would need to use the `drop_first=True` argument in `pd.get_dummies`. This helps to prevent multicollinearity by removing one dummy column per feature.

**What is Multicollinearity?**

Multicollinearity occurs when two or more independent variables (features) in a dataset are highly correlated with each other. This means that one variable can be linearly predicted from the others with a high degree of accuracy.

In regression models (e.g., Linear Regression or Logistic Regression), multicollinearity can cause issues because the model struggles to determine the individual effect of each variable on the target variable. This leads to unstable or unreliable coefficient estimates.

Imagine you’re trying to predict a person’s monthly electricity bill based on two factors:

- The number of hours they use their air conditioner (AC).
- The number of hours they use their ceiling fan.

Since people often use an AC and a fan during the same time (e.g., on hot days), these two variables are highly correlated. This creates a redundancy because the model may not be able to distinguish whether the AC usage or the fan usage is contributing more to the electricity bill.

In [None]:
#trying to avoid multicollinearity
df_one_hot2 = pd.get_dummies(df, columns=['City'], prefix='City', dtype=int, drop_first=True)
df_one_hot2

Unnamed: 0,Size,Product,Sales,City_Los Angeles,City_New York
0,Small,A,200,0,1
1,Medium,B,300,1,0
2,Large,C,150,0,0
3,Medium,A,250,0,1
4,Small,C,100,0,0


In [None]:
#importing the required library
from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse_output=False)  # Ensure output is a dense array
one_hot_encoded = one_hot_encoder.fit_transform(df[['City']])

# Combine back into the dataframe
city_columns = one_hot_encoder.get_feature_names_out(['City'])
df_one_hot_sklearn = pd.concat([df, pd.DataFrame(one_hot_encoded, columns=city_columns)], axis=1)
df_one_hot_sklearn

Unnamed: 0,Size,City,Product,Sales,City_Chicago,City_Los Angeles,City_New York
0,Small,New York,A,200,0.0,0.0,1.0
1,Medium,Los Angeles,B,300,0.0,1.0,0.0
2,Large,Chicago,C,150,1.0,0.0,0.0
3,Medium,New York,A,250,0.0,0.0,1.0
4,Small,Chicago,C,100,1.0,0.0,0.0


In [None]:
one_hot_encoded

array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

### B. Ordinal Encoding

Ordinal encoding assigns a unique integer to each category, similar to label encoding, but is explicitly used when the categories have a meaningful order or rank.

**How It Works:**

Example: A feature Size with categories: Small, Medium, Large.

| Size   | Ordinal Encoded |
|--------|-----------------|
| Small  |        1        |
| Medium |        2        |
| Large  |        3        |


**Pros:**

- Preserves Order: Maintains the ordinal relationship between categories.
- Compact Representation: No additional columns are created, keeping the dataset concise.
- Efficient for Models: Works well for algorithms that can understand ordinal relationships, such as tree-based models.

**Cons:**

- Not Suitable for Non-Ordinal Data: Implies relationships where none exist, which can mislead models.
- Bias in Distance-Based Models: Algorithms like k-Nearest Neighbors or linear regression may interpret the numerical values incorrectly as equidistant.

**When to Use:**

Use ordinal encoding when the feature has an inherent order, such as:
- Education levels: High School < Bachelor < Master < PhD
- Customer satisfaction: Very Dissatisfied < Neutral < Very Satisfied

In [None]:
#importing the required library
from sklearn.preprocessing import OrdinalEncoder

# Create OrdinalEncoder instance
ordinal_encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])

# Apply encoding to the 'Size' column
df['Size_Ordinal'] = ordinal_encoder.fit_transform(df[['Size']])
df[['Size', 'Size_Ordinal']]

Unnamed: 0,Size,Size_Ordinal
0,Small,0.0
1,Medium,1.0
2,Large,2.0
3,Medium,1.0
4,Small,0.0


**Label and ordinal encoding seems familar, so what's the difference?**

| **Aspect**            | **Ordinal Encoding**                | **Label Encoding**                   |
|-----------------------|-------------------------------------|-------------------------------------|
| **Order Assumption**   | Assumes a meaningful order exists.  | Does not assume any order.           |
| **Data Type**          | Suitable for ordinal data.         | Suitable for nominal data.           |
| **Example Use Case**   | `low, medium, high` → `0, 1, 2`     | `red, blue, green` → `0, 1, 2`       |
| **Impact on Models**   | Numeric order may affect predictions.| Values are arbitrary; no order bias. |

**Ordinal Data and Nominal Data?**

A. **Ordinal Data**:
   - Refers to categorical data with a specific order or rank.
   - Example: Education levels (`high school < bachelor's < master's < doctorate`).

B. **Nominal Data**:
   - Refers to categorical data with no inherent order or rank.
   - Example: Colors (`red, blue, green`) or species names (`dog, cat, bird`).

### C. Target (Mean) Encoding

Target encoding replaces each category with the mean (or another aggregate statistic) of the target variable for that category.

**How It Works:**

Example: A feature City with corresponding target variable Sales.

| City       | Sales (Target) | Target Encoded (Mean Sales) |
|------------|----------------|-----------------------------|
| New York   | 200, 250       | 225                         |
| Los Angeles| 300, 350, 400  | 350                         |
| Chicago    | 150, 200, 100  | 150                         |

For City = New York, the target encoding value is the average of Sales:

$$\frac {200+250}{2}=225$$

**Detailed explanation on target encoding**

Let's consider an example. The table below shows a table for people with varying favorite colors, their associated height and whether they love (1) or do not(0) the movie Trolls 2.

| Favorite color    | Height   |Loves Troll 2   |
|------------------|-----------|----------------|
| Blue       | 1.77         |1|
| Red         | 1.32         |0|
| Green      | 1.81          |1|
| Blue          | 1.56       |0|
| Green           | 1.64     |1|
| Green            | 1.61    |0|
| Blue         | 1.73        |0|


Note that `Loves Troll 2` is our target variable (on which we can calculate the mean) and `Favorite color ` is the feature we want to encode.

`A. For Blue`

Of the 3 people who like blue, only one of them Love Troll 2. So the mean value for Blue = $\frac{1}{3} = 0.33$


`B. For Red`

Only one person likes Red and they do not love Troll 2, Mean for Red = $\frac{0}{1} = 0$

`C. For Green`

Two people out of three who like green like Troll 2, Mean for Green = $\frac{2}{3}$

Our encoded data would look like this:

| Favorite color    | Height   |Loves Troll 2   |
|------------------|-----------|----------------|
| 0.33       | 1.77         |1|
| 0         | 1.32         |0|
| 0.67      | 1.81          |1|
| 0.33          | 1.56       |0|
| 0.67           | 1.64     |1|
| 0.67            | 1.61    |0|
| 0.33         | 1.73        |0|

However, because less data supports the value we replaced Red with, we have less confidence that we replaced Red with the best value than we have for Blue and Green. To tackle this, target encoding is usually done using a weighted mean that combines the mean for a specific option like Red, with the overall mean of the Target.

$$Weighted \ mean = \frac{(n * option \ mean) + (m * overall \ mean)}{n+m}$$

n = weight for option mean (usually the number of rows)

m = weight for overall mean (user defined)

Where m = is a hyper parameter defined as 2.

**_More on hyper paramters in the AI course track_**

With m=2, it implies that we need at least 3 rows of data before the Option mean (the mean we calculated for blue), becomes more important than the overall mean.

`A. For Blue`

$$Weighted \ mean = \frac{(3 * \frac{1}{3}) + (2 * \frac{3}{7})}{1+2} =0.37$$

`B. For Red`

$$Weighted \ mean = \frac{(1 * \frac{0}{1}) + (2 * \frac{3}{7})}{1+2} =0.29$$


`C. For Green`

$$Weighted \ mean = \frac{(3 * \frac{2}{3}) + (2 * \frac{3}{7})}{3+2} =0.57$$


| Favorite color    | Height   |Loves Troll 2   |
|------------------|-----------|----------------|
| 0.37       | 1.77         |1|
| 0.29         | 1.32         |0|
| 0.57      | 1.81          |1|
| 0.37          | 1.56       |0|
| 0.57           | 1.64     |1|
| 0.57            | 1.61    |0|
| 0.37         | 1.73        |0|

However, using the target variable to as a means to encode the data results in data leakage and therefore overfitting where the model performs excellently on the training data and does worse on the test data. Albeit, they are some ways we can avoid data leakage in order to use target encoding without overfitting the model. One of the best method is `K-Fold Target Encoding`. Note that the word "Fold" in K-Fold Target Encoding refers to splitting the data into equal sized subsets and the K refers to how many subsets we create.


| Favorite color    | Height   |Loves Troll 2   |
|------------------|-----------|----------------|
| Blue       | 1.77         |1|
| Red         | 1.32         |0|
| Green      | 1.81          |1|
| Blue          | 1.56       |0|
|      ...          |     ...      |.. |
| Green           | 1.64     |1|
| Green            | 1.61    |0|
| Blue         | 1.73        |0|

We have subset A and subset B. To target encode Blue in subset A, we ignore the target values in this subset and instead plug the target values from subset B into the weighted mean equation.

`A. Blue present in subset A`

$$Weighted \ mean = \frac{(1 * \frac{0}{1}) + (2 * \frac{1}{3})}{1+2} =0.22$$


`B. Blue present in subset B`

$$Weighted \ mean = \frac{(2 * \frac{1}{2}) + (2 * \frac{2}{4})}{2+2} =0.5$$


`C. Red present in subset A`

$$Weighted \ mean = \frac{(0 * 0) + (2 * \frac{1}{3})}{0+2} =0.33$$


`D. Green present in subset A`

$$Weighted \ mean = \frac{(2 * \frac{1}{2}) + (2 * \frac{1}{3})}{2+2} =0.42$$


`E. Green present in subset A`

$$Weighted \ mean = \frac{(1 * \frac{1}{1}) + (2 * \frac{2}{4})}{1+2} =0.67$$


The final table will be:

| Favorite color    | Height   |Loves Troll 2   |
|------------------|-----------|----------------|
| 0.22       | 1.77         |1|
| 0.33         | 1.32         |0|
| 0.42      | 1.81          |1|
| 0.22          | 1.56       |0|
|      ...          |     ...      |.. |
| 0.67           | 1.64     |1|
| 0.67            | 1.61    |0|
| 0.5         | 1.73        |0|


This process reduces data leakage because the rows do not use their own target values to calculate their encoding.

**Pros:**

- Compact Representation: Keeps the dataset concise without increasing dimensionality.
- Captures Target Relationships: Leverages the relationship between categories and the target variable.
- Effective for High-Cardinality Data: Handles features with many unique categories efficiently.

**Cons:**

- Overfitting Risk: Can overfit to the training data if not used carefully (e.g., without cross-validation).
- Dependency on Target Variable: Cannot be used in unsupervised learning.

**When to Use:**

Use target encoding for high-cardinality features (e.g., Zip Code) in supervised learning tasks.

In [None]:
!pip install category-encoders

Collecting category-encoders
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Downloading category_encoders-2.6.4-py2.py3-none-any.whl (82 kB)
Installing collected packages: category-encoders
Successfully installed category-encoders-2.6.4


In [None]:
#importing the required library
import category_encoders as ce

# Create TargetEncoder instance
target_encoder = ce.TargetEncoder(cols=['City'])

# Apply target encoding to the 'City' column with 'Sales' as the target
df['City_Target'] = target_encoder.fit_transform(df['City'], df['Sales'])
df[['City', 'City_Target', 'Sales']]

Unnamed: 0,City,City_Target,Sales
0,New York,203.546277,200
1,Los Angeles,213.010847,300
2,Chicago,189.36117,150
3,New York,203.546277,250
4,Chicago,189.36117,100


### What is Data Leakage?

In simple terms, data leakage happens when a machine learning model gets access to information during training that it wouldn’t normally have during real-world use or testing. This unfair advantage can make the model look much better during training, but it performs poorly when deployed.


**How It Happens**

Data leakage typically occurs when:

- Future Information is Used: Information from the target (what you're trying to predict) sneaks into the training data. Example: Using data about a patient’s recovery (which happens after treatment) to predict their initial diagnosis.
- Data Preprocessing Issues: Features derived from the entire dataset (training + testing) are used during training. Example: Normalizing data using statistics calculated from both training and testing datasets.

**An Example of Data Leakage**

Suppose you’re predicting whether a customer will buy a product based on their online behavior. If the dataset includes the purchase confirmation as one of the features, the model will learn that a purchase confirmation equals "Yes." This makes the model 100% accurate during training but useless in real-world predictions because it already "knows" the answer.
Why is Data Leakage a Problem?

- Overly Optimistic Performance: The model seems to perform exceptionally well on training or validation data but fails in real-world scenarios.
- Poor Generalization: The model doesn't actually learn meaningful patterns; it learns shortcuts from leaked data.
- Waste of Resources: Time and effort spent training and deploying a model that ultimately fails in production.

**How to Prevent Data Leakage**

- Separate Training and Testing Data:
        Always split your dataset into training and testing sets before preprocessing.
        Ensure no overlap of information between the two sets.

- Be Cautious with Features: Avoid using features that include future information or directly depend on the target variable.

- Pipeline Processing: Use pipelines to ensure preprocessing steps (e.g., scaling, encoding) are applied only to the training data during training.

- Cross-Validation: Use proper cross-validation techniques to detect and prevent leakage during model evaluation.

**Simple Analogy**

Think of data leakage like taking an exam where you accidentally get access to the answer key beforehand. You’ll ace the exam but won’t truly understand the subject, and in real-world scenarios (like a job), you’ll likely fail because you didn’t learn properly.

### D. Frequency (Count) Encoding

Frequency encoding replaces each category with its frequency (count of occurrences) in the dataset.

**How It Works:**

Example: A feature Product with frequency counts.

| Product      | Frequency Encoded |
|--------------|-------------------|
| A            |         5         |
| B            |         3         |
| C            |         2         |


If Product A appears 5 times in the dataset, its encoded value is 5.

**Pros:**

- Simple and Effective: Intuitive and easy to implement.
- Compact Representation: Does not increase dimensionality.
- Useful for Rare Categories: Highlights the importance of frequent vs. infrequent categories.

**Cons:**

- Ignores Category Relationships: Does not account for relationships with the target variable.
- Less Informative for Balanced Data: May not provide much insight if all categories have similar frequencies.

**When to Use:**

Use frequency encoding for categorical features in exploratory data analysis or when the frequency of occurrence carries meaningful information (e.g., product popularity).

In [None]:
# Frequency encoding for 'Product' column
freq_encoding = df['Product'].value_counts()
df['Product_Frequency'] = df['Product'].map(freq_encoding)
df[['Product', 'Product_Frequency']]

Unnamed: 0,Product,Product_Frequency
0,A,2
1,B,1
2,C,2
3,A,2
4,C,2


### E. Binary Encoding

Binary encoding converts categories into binary code and represents them as binary digits across multiple columns.

**How It Works:**

Example: A feature Category with three unique values.

| Category | Binary Representation | Binary Encoded |
|----------|------------------------|----------------|
| A        | 1                      | 0 1            |
| B        | 2                      | 1 0            |
| C        | 3                      | 1 1            |

Each category is assigned a unique integer (e.g., A=1, B=2, C=3), which is then converted into binary (1 -> 01, 2 -> 10, 3 -> 11).

**Pros:**

- Compact Representation: Reduces the number of features compared to one-hot encoding for high-cardinality data.
- Preserves Information: Captures both unique category information and reduces dimensionality.
- Avoids Ordinal Bias: Binary encoding does not imply order.

**Cons:**

- Complexity in Interpretation: Binary-encoded values are less interpretable than other encoding methods.
- Overhead in Conversion: Requires additional computation to convert categories into binary.

**When to Use:**

Use binary encoding for high-cardinality categorical features in scenarios where dimensionality reduction is important but order should not be inferred.


**Library: `pandas and Label Encoder`**


**Steps for Binary Encoding:**

a. Mapping each unique category to an integer.

b. Converting the integer representation to binary.

c. Splitting the binary digits into separate columns.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Dataset
data = {'ID': [1, 2, 3, 4], 'Category': ['A', 'B', 'C', 'D']}
df = pd.DataFrame(data)
df

Unnamed: 0,ID,Category
0,1,A
1,2,B
2,3,C
3,4,D


In [None]:
# Step 1: Label Encoding
label_encoder = LabelEncoder()
df['Integer'] = label_encoder.fit_transform(df['Category']) + 1  # Start integers from 1
df

Unnamed: 0,ID,Category,Integer
0,1,A,1
1,2,B,2
2,3,C,3
3,4,D,4


In [None]:
# Step 2: Binary Encoding
df['Binary'] = df['Integer'].apply(lambda x: format(x, 'b'))  # Convert to binary
binary_cols = df['Binary'].apply(lambda x: list(x.zfill(3)))  # Zero-fill for alignment
df

Unnamed: 0,ID,Category,Integer,Binary
0,1,A,1,1
1,2,B,2,10
2,3,C,3,11
3,4,D,4,100


In [None]:
# Step 3: Split Binary into Columns
binary_df = pd.DataFrame(binary_cols.tolist(), columns=['Binary_1', 'Binary_2', 'Binary_3'])

# Final DataFrame
df = pd.concat([df, binary_df], axis=1)

# Display the result
df

Unnamed: 0,ID,Category,Integer,Binary,Binary_1,Binary_2,Binary_3
0,1,A,1,1,0,0,1
1,2,B,2,10,0,1,0
2,3,C,3,11,0,1,1
3,4,D,4,100,1,0,0


By carefully selecting the appropriate encoding method based on your dataset and machine learning algorithm, you can ensure better model performance and interpretability.