# **How to Encode Data**

---

## **Introduction**

Categorical data must be encoded into numerical formats for machine learning models to process them. Encoding ensures that algorithms can interpret these features without introducing unintended biases.

This notebook covers:
1. **Why encoding is necessary.**
2. **Popular encoding methods and their applications.**
3. **Choosing the best encoding approach based on real-world scenarios.**

---

## **1. Why Encoding is Necessary**

### Machine Learning Models and Numerical Data:
Most machine learning models, like Logistic Regression, SVMs, and Decision Trees, expect numerical inputs. Categorical variables need to be encoded to represent their relationships with the target variable accurately.

### Challenges in Encoding:
1. **Preserving relationships**:
   - Nominal data should not imply an order.
   - Ordinal data must respect the natural order.
2. **Avoiding data leakage**:
   - Methods like Target Encoding must be used carefully in cross-validation.

---

## **2. Encoding Methods**

### **2.1 One-Hot Encoding**
- **Purpose**: Converts each category into a binary column.
- **When to Use**:
  - For **nominal variables** (no order), such as colors or product types.
  - When the number of categories is relatively small.



#### **Example**: Nominal Variable

In [6]:
from sklearn.preprocessing import OneHotEncoder

# Nominal feature example
data = {"Color": ["Red", "Blue", "Green", "Blue", "Red", "Green", "Red"]}
df_nominal = pd.DataFrame(data)

# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False, drop="first")  # Drop "Red" to prevent multicollinearity
encoded = ohe.fit_transform(df_nominal[["Color"]])
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out())
result = pd.concat([df_nominal, encoded_df], axis=1)
result

Unnamed: 0,Color,Color_Green,Color_Red
0,Red,0.0,1.0
1,Blue,0.0,0.0
2,Green,1.0,0.0
3,Blue,0.0,0.0
4,Red,0.0,1.0
5,Green,1.0,0.0
6,Red,0.0,1.0


**Each category ("Blue," "Green") is converted into a binary column, with "Red" dropped to avoid multicollinearity.**

---

### **2.2 Label Encoding**
- **Purpose**: Assigns a numeric value to each category.
- **When to Use**:
  - For **ordinal variables** (with a meaningful order), such as "Low," "Medium," "High."

#### **Example**: Ordinal Variable

In [7]:
# Ordinal feature example
data = {"Priority": ["Low", "Medium", "High", "Medium", "Low"]}
df_ordinal = pd.DataFrame(data)

# Map ordinal categories to integers
ordinal_mapping = {"Low": 1, "Medium": 2, "High": 3}
df_ordinal["Encoded_Priority"] = df_ordinal["Priority"].map(ordinal_mapping)
df_ordinal

Unnamed: 0,Priority,Encoded_Priority
0,Low,1
1,Medium,2
2,High,3
3,Medium,2
4,Low,1


**"Low" = 1, "Medium" = 2, "High" = 3, reflecting their natural order.**

---

### **2.3 Target Encoding**
- **Purpose**: Encodes categories based on the mean value of the target variable.
- **When to Use**:
  - For nominal variables with many unique categories.
  - Works well for high-cardinality features.

#### **Example**: High Cardinality

In [8]:
# Example with high-cardinality nominal variable
data = {"Category": ["A", "B", "C", "B", "A", "C", "A"], "Target": [1, 0, 1, 0, 1, 0, 1]}
df_target = pd.DataFrame(data)

# Calculate target mean per category
target_mean = df_target.groupby("Category")["Target"].mean()
df_target["Target_Encoded"] = df_target["Category"].map(target_mean)
df_target

Unnamed: 0,Category,Target,Target_Encoded
0,A,1,1.0
1,B,0,0.0
2,C,1,0.5
3,B,0,0.0
4,A,1,1.0
5,C,0,0.5
6,A,1,1.0


**Each category is replaced with the mean value of the target variable for that category.**

#### **Challenges**:
- Can lead to **data leakage** if not used with cross-validation.
- Use libraries like `category_encoders` for proper implementation.

---

### **2.4 Frequency Encoding**
- **Purpose**: Replaces categories with their frequency in the dataset.
- **When to Use**:
  - For large datasets or when category frequency matters.

In [9]:
# Frequency Encoding
data = {"City": ["NY", "LA", "SF", "NY", "SF", "NY", "LA"]}
df_freq = pd.DataFrame(data)

# Calculate frequency
frequency = df_freq["City"].value_counts(normalize=True)
df_freq["Frequency_Encoded"] = df_freq["City"].map(frequency)
df_freq

Unnamed: 0,City,Frequency_Encoded
0,NY,0.428571
1,LA,0.285714
2,SF,0.285714
3,NY,0.428571
4,SF,0.285714
5,NY,0.428571
6,LA,0.285714


---

### **2.5 Binary Encoding**
- **Purpose**: Converts categories into binary representations.
- **When to Use**:
  - For features with medium cardinality.
  - Reduces dimensionality compared to One-Hot Encoding.

In [12]:
from category_encoders import BinaryEncoder

# Binary Encoding
data = {"State": ["CA", "TX", "NY", "TX", "CA", "NY", "FL"]}
df_binary = pd.DataFrame(data)
encoder = BinaryEncoder(cols=["State"])
binary_encoded = encoder.fit_transform(df_binary)
binary_encoded

Unnamed: 0,State_0,State_1,State_2
0,0,0,1
1,0,1,0
2,0,1,1
3,0,1,0
4,0,0,1
5,0,1,1
6,1,0,0


---

## **3. Choosing the Right Encoding**

### **Guidelines**:
| Encoding Type     | When to Use                               | Advantages                     | Disadvantages               |
|--------------------|-------------------------------------------|--------------------------------|-----------------------------|
| **One-Hot**       | Nominal, few categories                  | Easy to implement             | Increases dimensionality    |
| **Label**         | Ordinal                                  | Simple, preserves order        | Misleading for nominal data |
| **Target**        | High-cardinality nominal                 | Captures target information    | Risk of data leakage        |
| **Frequency**     | High-cardinality nominal                 | Captures distribution info     | Limited interpretability    |
| **Binary**        | Medium-cardinality                       | Reduces dimensions             | Complex encoding            |

---

## **4. Implementation on Synthetic Dataset**

### **Dataset**:

In [13]:
data = {
    "Nominal_Category": ["Red", "Blue", "Green", "Blue", "Red", "Green", "Red"],
    "Ordinal_Category": ["Low", "Medium", "High", "Medium", "Low", "High", "Medium"],
    "Target": [1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

### **Step-by-Step Encoding**:
1. **One-Hot Encoding** for Nominal_Category.
2. **Label Encoding** for Ordinal_Category.
3. **Target Encoding** for Nominal_Category based on Target.

---

## **5. Summary**

- Encoding categorical variables is crucial for preparing data for machine learning.
- The choice of encoding depends on the type of variable (nominal vs. ordinal) and the dataset's size.
- Always consider trade-offs like dimensionality and data leakage when selecting an encoding method.

---
---