## Encoding Categorical Variables:

Categorical variables are those that represent categories or groups and don't have inherent numerical value. Encoding categorical variables is essential because many machine learning algorithms require numerical input. Here are the commonly used encoding techniques:

#### 1. One-Hot Encoding:

**Concept:**
One-Hot Encoding converts categorical variables into binary vectors where each category is represented by a binary indicator (0 or 1).

**Maths:**
If a categorical variable has $ k $ unique categories, one-hot encoding creates $ k $ binary columns. For each sample, only one column will have a value of 1, indicating the presence of that category.

**Example:**
Suppose we have a categorical variable "Color" with three categories: Red, Blue, Green. After one-hot encoding:
```
| Color_Red | Color_Blue | Color_Green |
|-----------|------------|-------------|
|    1      |     0      |      0      |
|    0      |     1      |      0      |
|    0      |     0      |      1      |
```

**When to Use:**
- When the categorical variable has no inherent order or hierarchy.
- When the number of categories is not excessively large.

**Why to Use:**
- Ensures each category is treated equally.
- Prevents numerical interpretation of ordinality.

**Advantages:**
- Maintains all information from the categorical variable.
- Works well with most machine learning algorithms.

**Disadvantages:**
- Increases dimensionality, especially with high cardinality categorical variables.

#### 2. Label Encoding:

**Concept:**
Label Encoding assigns a unique integer to each category, essentially converting categories into numerical labels.

**Maths:**
Assigning integers incrementally to categories. 

**Example:**
Suppose we have a categorical variable "Size" with categories: Small, Medium, Large.
```
Small  -> 0
Medium -> 1
Large  -> 2
```

**When to Use:**
- When the categorical variable has an inherent order or hierarchy.

**Why to Use:**
- Maintains ordinal information when it exists.

**Advantages:**
- Reduces dimensionality compared to one-hot encoding.
- Preserves ordinal information.

**Disadvantages:**
- Can introduce unintended ordinality to categorical variables without natural order.

#### 3. Ordinal Encoding:

**Concept:**
Ordinal Encoding assigns numerical values to categories based on their order or rank.

**Maths:**
Similar to Label Encoding, but the mapping is based on the order of categories.

**Example:**
Suppose we have a categorical variable "Education" with categories: High School, Bachelor's, Master's, PhD.
```
High School -> 1
Bachelor's  -> 2
Master's    -> 3
PhD         -> 4
```

**When to Use:**
- When the categorical variable has an inherent order or hierarchy, but the numerical difference between categories is not significant.

**Why to Use:**
- Preserves ordinal information.

**Advantages:**
- Reduces dimensionality compared to one-hot encoding.
- Preserves ordinal information.

**Disadvantages:**
- Assumes equal spacing between categories which may not be true.

#### 4. Target Encoding:

**Concept:**
Target Encoding (or Mean Encoding) replaces categories with the mean of the target variable for each category.

**Maths:**
For each category $ i $, replace it with the mean of the target variable $ y $ for that category.

**Example:**
Suppose we have a categorical variable "City" with different categories and a target variable "Salary".
```
|   City   | Salary |
|----------|--------|
|  London  |  5000  |
|  Paris   |  6000  |
|  London  |  4800  |
|  Paris   |  5500  |
```
After target encoding:
```
|   City   | Salary |
|----------|--------|
|  London  |  4900  |
|  Paris   |  5750  |
|  London  |  4900  |
|  Paris   |  5750  |
```

**When to Use:**
- When one-hot encoding would create too many features.
- When preserving the relationship between the category and the target variable is important.

**Why to Use:**
- Captures information about the target variable within the categorical variable.

**Advantages:**
- Reduces dimensionality compared to one-hot encoding.
- Preserves relationship with target variable.

**Disadvantages:**
- Prone to overfitting, especially with small or noisy datasets.
- Sensitive to outliers and imbalanced data.

### Types of Target Encoding:

- **Simple Mean Encoding:** Replace each category with the mean of the target variable.
- **Smoothing Mean Encoding:** Combines the mean of the category with the overall mean to avoid overfitting.
- **Leave-One-Out Encoding:** Similar to mean encoding but excludes the current sample's target value to avoid leakage.
- **Expanding Mean Encoding:** Uses the expanding window mean of the target variable.

These encoding techniques play a crucial role in preparing categorical variables for machine learning models, each with its own set of advantages and considerations depending on the nature of the data and the problem at hand.