### **Challenge 5: Encoding Categorical Variables with LabelEncoder and OneHotEncoder**

**Topic:** Encoding Categorical Variables (LabelEncoder, OneHotEncoder)  

---

### **Problem Description**
You are given a dataset with **categorical features** and a **categorical target variable**. Your task is to:
1. **Apply One-Hot Encoding** to categorical features (`X`).
2. **Apply Label Encoding** to the target variable (`y`).

Your function should return **two transformed DataFrames**:
- A dataset where **categorical features** are **one-hot encoded**.
- A **separate series with the target variable** label-encoded.

---

### **Function Signature**
```python
def encode_features_and_target(data: pd.DataFrame, target_column: str) -> dict:
    """
    Converts categorical features into numerical form using One-Hot Encoding (OHE) 
    and encodes the target variable using Label Encoding.

    Args:
    data (pd.DataFrame): The input dataset containing categorical features and a categorical target.
    target_column (str): The name of the target column.

    Returns:
    dict: A dictionary containing:
          - 'features_encoded': DataFrame with one-hot encoding applied to categorical features.
          - 'target_encoded': Series with label encoding applied to the target column.
    """
```

---

### **Constraints**
1. The dataset contains **categorical features (`X`)** and a **categorical target (`y`)**.
2. **Only categorical features should be one-hot encoded**.
3. The **target column should be label-encoded**.
4. If a categorical feature **only has two unique values**, use **Label Encoding instead of One-Hot Encoding** to reduce dimensionality.

---

### **Example Input**
```python
import pandas as pd

data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],  # Categorical feature
    'Size': ['Small', 'Large', 'Medium', 'Large', 'Small'],  # Categorical feature
    'Category': ['Dog', 'Cat', 'Bird', 'Dog', 'Cat']  # Target column (classification labels)
})

target_column = 'Category'
```

---

### **Expected Output**
#### **1️⃣ One-Hot Encoded Features (`X`)**
```plaintext
   Color_Blue  Color_Green  Color_Red  Size_Large  Size_Medium  Size_Small
0          0           0         1          0           0          1
1          1           0         0          1           0          0
2          0           1         0          0           1          0
3          0           0         1          1           0          0
4          1           0         0          0           0          1
```

#### **2️⃣ Label Encoded Target (`y`)**
```plaintext
0    1
1    0
2    2
3    1
4    0
dtype: int64
```
Here, the **target labels** (`Dog, Cat, Bird`) have been **converted to integers** (`Dog=1, Cat=0, Bird=2`).

---

### **Hints**
1. Use `select_dtypes` to identify **categorical feature columns** (`X`).
2. Use **`OneHotEncoder`** for features **with more than two categories**.
3. Use **`LabelEncoder`** for the target column (`y`).
4. If a categorical feature **has exactly 2 unique values**, apply **Label Encoding instead of One-Hot Encoding**.

---

# Solution

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def encode_features_and_target(data: pd.DataFrame, target_column: str) -> dict:
    """
    Converts categorical features into numerical form using One-Hot Encoding (OHE) 
    and encodes the target variable using Label Encoding.

    Args:
    data (pd.DataFrame): The input dataset containing categorical features and a categorical target.
    target_column (str): The name of the target column.

    Returns:
    dict: A dictionary containing:
          - 'features_encoded': DataFrame with one-hot encoding applied to categorical features.
          - 'target_encoded': Series with label encoding applied to the target column.
    """
    # Step 1: Identify categorical feature columns (excluding target column)
    categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
    categorical_cols.remove(target_column)  # Ensure target is not treated as a feature

    # Step 2: Apply One-Hot Encoding to categorical features
    ohe = OneHotEncoder(sparse_output=False, drop='if_binary')  # Drop one column for binary features
    ohe_encoded = ohe.fit_transform(data[categorical_cols])

    # Convert to DataFrame with proper column names
    ohe_feature_names = ohe.get_feature_names_out(categorical_cols)
    ohe_df = pd.DataFrame(ohe_encoded, columns=ohe_feature_names)

    # Step 3: Apply Label Encoding to the target column
    le = LabelEncoder()
    target_encoded = le.fit_transform(data[target_column])

    # Step 4: Concatenate One-Hot Encoded features with existing numeric features
    numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
    final_features = pd.concat([data[numeric_cols].reset_index(drop=True), ohe_df], axis=1)

    return {
        'features_encoded': final_features,
        'target_encoded': pd.Series(target_encoded, name=target_column)
    }

# Example Execution

In [2]:
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],  # Categorical feature
    'Size': ['Small', 'Large', 'Medium', 'Large', 'Small'],  # Categorical feature
    'Price': [10, 20, 15, 25, 30],  # Numeric feature (remains unchanged)
    'Category': ['Dog', 'Cat', 'Bird', 'Dog', 'Cat']  # Target (classification labels)
})

target_column = 'Category'

In [3]:
result = encode_features_and_target(data, target_column)
result['features_encoded']

Unnamed: 0,Price,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,10,0.0,0.0,1.0,0.0,0.0,1.0
1,20,1.0,0.0,0.0,1.0,0.0,0.0
2,15,0.0,1.0,0.0,0.0,1.0,0.0
3,25,0.0,0.0,1.0,1.0,0.0,0.0
4,30,1.0,0.0,0.0,0.0,0.0,1.0


In [4]:
result['target_encoded']

0    2
1    1
2    0
3    2
4    1
Name: Category, dtype: int32

# Alternative Solution 1 (modular approach for data science)

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def get_categorical_features(data: pd.DataFrame, target_column: str):
    """Identifies categorical features, excluding the target column."""
    categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
    if target_column in categorical_cols:
        categorical_cols.remove(target_column)
    return categorical_cols

def one_hot_encode_features(data: pd.DataFrame, categorical_cols: list) -> pd.DataFrame:
    """Applies One-Hot Encoding to categorical features."""
    ohe = OneHotEncoder(sparse_output=False, drop='if_binary')
    ohe_encoded = ohe.fit_transform(data[categorical_cols])
    
    # Convert to DataFrame with proper column names
    ohe_feature_names = ohe.get_feature_names_out(categorical_cols)
    return pd.DataFrame(ohe_encoded, columns=ohe_feature_names)

def label_encode_target(data: pd.DataFrame, target_column: str) -> pd.Series:
    """Encodes the target variable using Label Encoding."""
    le = LabelEncoder()
    return pd.Series(le.fit_transform(data[target_column]), name=target_column)

def encode_features_and_target(data: pd.DataFrame, target_column: str) -> dict:
    """
    Converts categorical features into numerical form using One-Hot Encoding (OHE) 
    and encodes the target variable using Label Encoding.

    Args:
    data (pd.DataFrame): The input dataset containing categorical features and a categorical target.
    target_column (str): The name of the target column.

    Returns:
    dict: A dictionary containing:
          - 'features_encoded': DataFrame with one-hot encoding applied to categorical features.
          - 'target_encoded': Series with label encoding applied to the target column.
    """
    categorical_cols = get_categorical_features(data, target_column)
    numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
    
    ohe_df = one_hot_encode_features(data, categorical_cols)
    target_encoded = label_encode_target(data, target_column)

    # Combine numeric and one-hot encoded features
    final_features = pd.concat([data[numeric_cols].reset_index(drop=True), ohe_df], axis=1)

    return {
        'features_encoded': final_features,
        'target_encoded': target_encoded
    }

In [6]:
result = encode_features_and_target(data, target_column)
result['features_encoded']

Unnamed: 0,Price,Color_Blue,Color_Green,Color_Red,Size_Large,Size_Medium,Size_Small
0,10,0.0,0.0,1.0,0.0,0.0,1.0
1,20,1.0,0.0,0.0,1.0,0.0,0.0
2,15,0.0,1.0,0.0,0.0,1.0,0.0
3,25,0.0,0.0,1.0,1.0,0.0,0.0
4,30,1.0,0.0,0.0,0.0,0.0,1.0


In [7]:
result['target_encoded']

0    2
1    1
2    0
3    2
4    1
Name: Category, dtype: int32

### **🚀 Why This is Better for Data Science**
✅ **Modular functions** for encoding features and targets.  
✅ **Encapsulation of logic** into separate functions.  
✅ **Improved reusability** (e.g., you can use `one_hot_encode_features()` for other datasets).  
✅ **Easier testing & debugging**.  

# Alternative Solution 2 (production-ready approach)

In [8]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

class CategoricalEncoder:
    """Encodes categorical features using One-Hot Encoding and encodes the target variable using Label Encoding."""

    def __init__(self, target_column: str):
        self.target_column = target_column
        self.ohe = OneHotEncoder(sparse_output=False, drop='if_binary')
        self.le = LabelEncoder()
        self.fitted = False  # Track if encoder is already fitted
        self.ohe_feature_names = None  # Store feature names after fitting

    def fit(self, data: pd.DataFrame):
        """Fits One-Hot Encoder on categorical features and Label Encoder on the target variable."""
        categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
        categorical_cols.remove(self.target_column)  # Ensure target is not treated as a feature
        
        self.ohe.fit(data[categorical_cols])
        self.ohe_feature_names = self.ohe.get_feature_names_out(categorical_cols)  # Save feature names
        self.le.fit(data[self.target_column])
        self.fitted = True

    def transform(self, data: pd.DataFrame):
        """Transforms categorical features and target using fitted encoders."""
        if not self.fitted:
            raise ValueError("Encoders must be fitted before transforming data.")
        
        categorical_cols = data.select_dtypes(include=['object']).columns.tolist()
        categorical_cols.remove(self.target_column)

        # Apply One-Hot Encoding to features
        ohe_encoded = self.ohe.transform(data[categorical_cols])
        ohe_df = pd.DataFrame(ohe_encoded, columns=self.ohe_feature_names)

        # Apply Label Encoding to target
        target_encoded = self.le.transform(data[self.target_column])

        # Preserve numeric features
        numeric_cols = data.select_dtypes(include=['number']).columns.tolist()
        final_features = pd.concat([data[numeric_cols].reset_index(drop=True), ohe_df], axis=1)

        return {
            'features_encoded': final_features,
            'target_encoded': pd.Series(target_encoded, name=self.target_column)
        }

    def fit_transform(self, data: pd.DataFrame):
        """Fits the encoders and transforms the data."""
        self.fit(data)
        return self.transform(data)

### **🚀 Why This is Better for Production**
✅ **Encapsulates logic in a class (`CategoricalEncoder`)**  
✅ **Allows `fit()` and `transform()` separately**, which prevents data leakage.  
✅ **Ensures feature names remain consistent**, useful for handling real-time data.  
✅ **Integrates with ML pipelines for streamlined pre-processing.**  
✅ **Scalable** – Can be reused across multiple datasets.  

---

---

## **🔹 How Encoding Works in the Real World**
1️⃣ **One-Hot Encoding (`OneHotEncoder`)**  
   - Used for **categorical features** (independent variables, `X`) in ML models.
   - It **prevents the model from learning unintended ordinal relationships**.
   - Example: `'Red'`, `'Blue'`, `'Green'` should not have an ordinal relationship (so we OHE them).

2️⃣ **Label Encoding (`LabelEncoder`)**  
   - Mostly used for **target labels (dependent variable, `y`)** in classification tasks.
   - Converts categorical class labels (`"Dog", "Cat", "Bird"`) into numbers (`0, 1, 2`).
   - Can be used on features **only when they have ordinal meaning** (e.g., `'Small' < 'Medium' < 'Large'`).

---