# Module 02: Encoding Categorical Variables

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: [Module 01: Handling Missing Data](01_handling_missing_data.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand why ML algorithms need numeric features
2. Apply one-hot encoding for nominal categories
3. Use label encoding and ordinal encoding for ordered categories
4. Implement target encoding for high-cardinality features
5. Handle the curse of dimensionality with categorical variables
6. Choose the right encoding method for your data

## 1. Why Encode Categorical Variables?

**Problem**: Most machine learning algorithms only understand numbers!

**Examples of categorical variables**:
- **Nominal** (no order): Colors (red, blue, green), Cities (New York, London, Tokyo)
- **Ordinal** (ordered): Education (High School < Bachelor < Master < PhD), Size (Small < Medium < Large)
- **Binary**: Yes/No, True/False, Male/Female

### The Wrong Way

```python
# ❌ DON'T DO THIS
cities = ['New York', 'London', 'Tokyo', 'Paris']
encoded = [1, 2, 3, 4]  # Label encoding for nominal data
```

**Problem**: This implies Tokyo (3) > London (2), which is meaningless!

### The Right Way

Choose encoding based on:
1. **Type of category** (nominal vs ordinal)
2. **Number of unique values** (cardinality)
3. **Algorithm type** (tree-based vs linear)
4. **Relationship to target** (predictive power)

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Encoding methods
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder, BinaryEncoder
import category_encoders as ce

# Model evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
pd.set_option('display.max_columns', None)

print("✓ Libraries imported successfully!")

## 3. Create Sample Dataset

Let's create a realistic customer dataset for predicting product purchases.

In [None]:
# Set seed for reproducibility
np.random.seed(42)
n_samples = 1000

# Define categories
cities = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
          'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'Austin',
          'San Jose', 'Fort Worth', 'Columbus', 'Charlotte', 'Indianapolis']

education_levels = ['High School', 'Associate', 'Bachelor', 'Master', 'PhD']
product_categories = ['Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports']
membership_tiers = ['Bronze', 'Silver', 'Gold', 'Platinum']

# Create dataset
customer_data = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'city': np.random.choice(cities, n_samples),
    'education': np.random.choice(education_levels, n_samples),
    'product_category': np.random.choice(product_categories, n_samples),
    'membership_tier': np.random.choice(membership_tiers, n_samples),
    'annual_income': np.random.normal(60000, 25000, n_samples).clip(15000, 200000),
    'purchase_amount': np.random.normal(500, 200, n_samples).clip(10, 2000)
})

# Create target variable with some logic
# Higher education, income, and certain cities increase purchase probability
purchase_prob = (
    0.2 +  # Base probability
    0.1 * (customer_data['education'].map({'High School': 0, 'Associate': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}) / 4) +
    0.3 * ((customer_data['annual_income'] - customer_data['annual_income'].min()) / 
           (customer_data['annual_income'].max() - customer_data['annual_income'].min())) +
    0.15 * (customer_data['city'].isin(['New York', 'San Francisco', 'Los Angeles'])).astype(int) +
    np.random.normal(0, 0.1, n_samples)
)

customer_data['will_purchase'] = (purchase_prob > 0.5).astype(int)

print(f"Created dataset with {len(customer_data)} customers")
print(f"\nFeature types:")
print(customer_data.dtypes)
print(f"\nFirst few rows:")
customer_data.head()

In [None]:
# Analyze categorical variables
categorical_cols = ['city', 'education', 'product_category', 'membership_tier']

print("Categorical Variable Analysis:\n")
for col in categorical_cols:
    n_unique = customer_data[col].nunique()
    print(f"{col}:")
    print(f"  Unique values: {n_unique}")
    print(f"  Type: {'High-cardinality' if n_unique > 10 else 'Low-cardinality'}")
    print(f"  Sample values: {customer_data[col].unique()[:5].tolist()}")
    print()

## 4. Method 1: One-Hot Encoding

**Best for**: Nominal categories with low cardinality (<10 unique values)

**How it works**: Create a binary column for each category

**Example**:
```
Color        → Color_Red  Color_Blue  Color_Green
Red          →     1          0           0
Blue         →     0          1           0
Green        →     0          0           1
```

**Pros**: 
- No ordinal assumption
- Works well with linear models

**Cons**: 
- Increases dimensionality
- Can cause curse of dimensionality with high-cardinality features

In [None]:
# One-hot encode product_category (only 5 unique values)
product_encoded = pd.get_dummies(customer_data['product_category'], prefix='product')

print("Original product_category column:")
print(customer_data['product_category'].head())
print("\nOne-hot encoded:")
print(product_encoded.head())
print(f"\nOriginal: 1 column")
print(f"Encoded: {len(product_encoded.columns)} columns")

In [None]:
# Using sklearn's OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# IMPORTANT: Fit on training data only to avoid data leakage!
# For this demo, we'll show the mechanics first

encoder = OneHotEncoder(sparse_output=False, drop='first')  # drop='first' avoids multicollinearity

# Encode multiple columns
categorical_features = ['product_category', 'membership_tier']
encoded_array = encoder.fit_transform(customer_data[categorical_features])

# Get feature names
feature_names = encoder.get_feature_names_out(categorical_features)

encoded_df = pd.DataFrame(encoded_array, columns=feature_names, index=customer_data.index)

print("Encoded features:")
print(encoded_df.head())
print(f"\nShape: {encoded_df.shape}")
print(f"Note: 'drop=first' removes one column per category to avoid multicollinearity")

## 5. Method 2: Label Encoding

**Best for**: Ordinal categories with clear ordering

**How it works**: Map each category to an integer

**Example**:
```
Education    → Encoded
High School  →    0
Bachelor     →    1
Master       →    2
PhD          →    3
```

**Pros**: 
- Simple, no new columns
- Preserves ordinal information

**Cons**: 
- ❌ DON'T use for nominal data (creates false ordering)
- Assumes equal spacing between categories

In [None]:
# Label encoding (automatic ordering - not recommended for education!)
label_encoder = LabelEncoder()
education_label = label_encoder.fit_transform(customer_data['education'])

print("Label Encoding (automatic):")
print(pd.DataFrame({
    'Original': customer_data['education'][:10],
    'Encoded': education_label[:10]
}))
print(f"\nMapping: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")
print("\n⚠️ Problem: Order is alphabetical, not meaningful!")

## 6. Method 3: Ordinal Encoding

**Best for**: Ordinal categories where YOU define the order

**How it works**: Map categories to integers with custom ordering

In [None]:
# Ordinal encoding with custom order
education_order = ['High School', 'Associate', 'Bachelor', 'Master', 'PhD']
membership_order = ['Bronze', 'Silver', 'Gold', 'Platinum']

ordinal_encoder = OrdinalEncoder(
    categories=[education_order, membership_order]
)

ordinal_features = customer_data[['education', 'membership_tier']]
ordinal_encoded = ordinal_encoder.fit_transform(ordinal_features)

print("Ordinal Encoding (custom order):")
result_df = pd.DataFrame({
    'education': customer_data['education'][:10],
    'education_encoded': ordinal_encoded[:10, 0],
    'membership': customer_data['membership_tier'][:10],
    'membership_encoded': ordinal_encoded[:10, 1]
})
print(result_df)

print("\n✓ Correct: PhD (4) > Bachelor (2) > High School (0)")

## 7. Method 4: Target Encoding

**Best for**: High-cardinality nominal features (many unique values)

**How it works**: Replace category with mean of target variable for that category

**Example**: 
```
City         Purchase_Rate    Encoded
New York     0.65        →    0.65
Houston      0.42        →    0.42
```

**Pros**: 
- Captures relationship to target
- Handles high-cardinality well
- No dimensionality increase

**Cons**: 
- Can cause overfitting
- Requires careful cross-validation
- Leaks target information (use smoothing!)

In [None]:
# Demonstrate target encoding concept
# Calculate purchase rate by city
city_purchase_rate = customer_data.groupby('city')['will_purchase'].mean().sort_values(ascending=False)

print("Purchase rate by city (what target encoding captures):")
print(city_purchase_rate)

# Visualize
plt.figure(figsize=(12, 5))
city_purchase_rate.plot(kind='barh')
plt.xlabel('Purchase Rate')
plt.title('Target Encoding: Each City Encoded by Its Purchase Rate')
plt.tight_layout()
plt.show()

print("\nNotice: Cities with higher purchase rates get higher encoded values!")

In [None]:
# Proper target encoding with train/test split
# CRITICAL: Fit on training data only!

X = customer_data.drop('will_purchase', axis=1)
y = customer_data['will_purchase']

# Split data first
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Target encode 'city' (high-cardinality feature)
target_encoder = TargetEncoder(cols=['city'], smoothing=1.0)  # smoothing prevents overfitting

# Fit on training data with training target
X_train_encoded = target_encoder.fit_transform(X_train, y_train)

# Transform test data (uses training statistics)
X_test_encoded = target_encoder.transform(X_test)

print("Target Encoding applied to 'city':")
print("\nTraining set - original vs encoded:")
comparison = pd.DataFrame({
    'city_original': X_train['city'].values[:10],
    'city_encoded': X_train_encoded['city'].values[:10]
})
print(comparison)
print("\n✓ Each city is replaced by its average purchase rate in the training set")

## 8. Comparing Encoding Methods on Model Performance

Let's compare how different encoding methods affect model accuracy.

In [None]:
# Prepare data for comparison
X_base = customer_data[['age', 'annual_income', 'purchase_amount']].copy()
y = customer_data['will_purchase']

# Function to prepare dataset with different encodings
def prepare_encoded_data(encoding_method):
    X = X_base.copy()
    
    if encoding_method == 'one_hot':
        # One-hot encode all categorical features
        for col in ['city', 'education', 'product_category', 'membership_tier']:
            dummies = pd.get_dummies(customer_data[col], prefix=col, drop_first=True)
            X = pd.concat([X, dummies], axis=1)
    
    elif encoding_method == 'label':
        # Label encoding (not ideal for nominal features, but let's compare)
        for col in ['city', 'education', 'product_category', 'membership_tier']:
            le = LabelEncoder()
            X[col] = le.fit_transform(customer_data[col])
    
    elif encoding_method == 'ordinal':
        # Ordinal for ordered features, label for others
        X['education'] = OrdinalEncoder(categories=[education_levels]).fit_transform(
            customer_data[['education']]
        )
        X['membership_tier'] = OrdinalEncoder(categories=[membership_order]).fit_transform(
            customer_data[['membership_tier']]
        )
        # Label encode nominal features
        for col in ['city', 'product_category']:
            le = LabelEncoder()
            X[col] = le.fit_transform(customer_data[col])
    
    elif encoding_method == 'target':
        # Target encoding for high-cardinality, ordinal for others
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Add categorical columns back
        for col in ['city', 'education', 'product_category', 'membership_tier']:
            X_train[col] = customer_data.loc[X_train.index, col]
            X_test[col] = customer_data.loc[X_test.index, col]
        
        # Target encode high-cardinality 'city'
        te = TargetEncoder(cols=['city'], smoothing=1.0)
        X_train = te.fit_transform(X_train, y_train)
        X_test = te.transform(X_test)
        
        # Label encode others
        for col in ['education', 'product_category', 'membership_tier']:
            le = LabelEncoder()
            X_train[col] = le.fit_transform(X_train[col])
            X_test[col] = le.transform(X_test[col])
        
        return X_train, X_test, y_train, y_test
    
    # For non-target encoding, do standard split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

print("✓ Data preparation functions ready")

In [None]:
# Compare encoding methods
encoding_methods = ['one_hot', 'label', 'ordinal', 'target']
results = []

for method in encoding_methods:
    # Prepare data
    X_train, X_test, y_train, y_test = prepare_encoded_data(method)
    
    # Train Random Forest
    rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
    rf.fit(X_train, y_train)
    
    # Evaluate
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'Encoding Method': method,
        'Accuracy': accuracy,
        'Num Features': X_train.shape[1]
    })

# Display results
results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)
print("Model Performance by Encoding Method:\n")
print(results_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
axes[0].barh(results_df['Encoding Method'], results_df['Accuracy'])
axes[0].set_xlabel('Accuracy')
axes[0].set_title('Model Accuracy by Encoding Method')
axes[0].set_xlim([0.5, 1.0])
for i, v in enumerate(results_df['Accuracy']):
    axes[0].text(v + 0.01, i, f'{v:.3f}', va='center')

# Feature count comparison
axes[1].barh(results_df['Encoding Method'], results_df['Num Features'], color='coral')
axes[1].set_xlabel('Number of Features')
axes[1].set_title('Feature Dimensionality by Encoding Method')
for i, v in enumerate(results_df['Num Features']):
    axes[1].text(v + 0.5, i, f'{int(v)}', va='center')

plt.tight_layout()
plt.show()

## 9. Handling High-Cardinality Features

**Problem**: Features with 100s or 1000s of unique values

**Examples**: 
- ZIP codes (40,000+ in US)
- User IDs (millions)
- Product SKUs (thousands)

**Solutions**:
1. **Target encoding** (shown above)
2. **Frequency encoding**: Replace with category frequency
3. **Grouping**: Combine rare categories into "Other"
4. **Feature hashing**: Hash categories to fixed number of bins
5. **Embedding**: Learn dense representations (deep learning)

In [None]:
# Frequency Encoding
def frequency_encoding(column):
    """Replace categories with their frequency of occurrence"""
    freq_map = column.value_counts(normalize=True).to_dict()
    return column.map(freq_map)

# Apply to city
city_freq_encoded = frequency_encoding(customer_data['city'])

print("Frequency Encoding:")
comparison = pd.DataFrame({
    'city': customer_data['city'][:15],
    'frequency': city_freq_encoded[:15]
})
print(comparison)
print("\nFrequent cities get higher values, rare cities get lower values")

In [None]:
# Grouping Rare Categories
def group_rare_categories(column, threshold=0.05):
    """Combine categories that appear less than threshold into 'Other'"""
    freq = column.value_counts(normalize=True)
    rare_categories = freq[freq < threshold].index
    return column.apply(lambda x: 'Other' if x in rare_categories else x)

city_grouped = group_rare_categories(customer_data['city'], threshold=0.05)

print(f"Original unique cities: {customer_data['city'].nunique()}")
print(f"After grouping rare cities: {city_grouped.nunique()}")
print(f"\nValue counts after grouping:")
print(city_grouped.value_counts())

## 10. Best Practices and Decision Guide

### Encoding Decision Tree

```
Is the feature categorical?
├─ YES → Continue
└─ NO → No encoding needed

Is there a meaningful order?
├─ YES (Ordinal) → Use OrdinalEncoder with custom order
└─ NO (Nominal) → Continue

How many unique values?
├─ <10 (Low cardinality)
│   ├─ Linear Model → One-Hot Encoding
│   └─ Tree Model → Label Encoding or One-Hot
└─ ≥10 (High cardinality)
    ├─ Target Encoding (with smoothing)
    ├─ Frequency Encoding
    └─ Group rare categories
```

### Critical Rules

✅ **DO**:
1. Split data BEFORE encoding
2. Fit encoder on training data only
3. Handle unseen categories in test set
4. Use one-hot with `drop='first'` for linear models
5. Use smoothing with target encoding

❌ **DON'T**:
1. Use label encoding for nominal features with linear models
2. Fit encoder on test data (data leakage!)
3. Forget to handle categories not seen in training
4. One-hot encode high-cardinality features
5. Use target encoding without cross-validation

## 11. Exercise Section

### Exercise 1: Choose the Right Encoding

For each feature, choose the most appropriate encoding method.

In [None]:
# Exercise 1: Match each feature to the best encoding method

features = {
    'A': 'T-shirt size: XS, S, M, L, XL, XXL',
    'B': 'Country: (195 different countries)',
    'C': 'Color: Red, Blue, Green, Yellow',
    'D': 'Customer satisfaction: Very Bad, Bad, Neutral, Good, Very Good',
    'E': 'Product ID: (50,000 unique products)'
}

encoding_options = {
    '1': 'One-Hot Encoding',
    '2': 'Label Encoding',
    '3': 'Ordinal Encoding (custom order)',
    '4': 'Target Encoding',
    '5': 'Frequency Encoding'
}

print("Features:")
for key, feature in features.items():
    print(f"{key}. {feature}")

print("\nEncoding Methods:")
for key, method in encoding_options.items():
    print(f"{key}. {method}")

print("\nYour answers (write as comments):")
# A: ?
# B: ?
# C: ?
# D: ?
# E: ?

In [None]:
# Solution to Exercise 1

print("Solutions:\n")
print("A: 3 - Ordinal Encoding (XS < S < M < L < XL < XXL - clear order)")
print("B: 4 - Target Encoding (195 countries = high cardinality, nominal)")
print("C: 1 - One-Hot Encoding (4 colors = low cardinality, nominal)")
print("D: 3 - Ordinal Encoding (Very Bad < Bad < Neutral < Good < Very Good)")
print("E: 4 or 5 - Target or Frequency Encoding (50k products = very high cardinality)")

print("\nKey insights:")
print("- Ordinal data (A, D): Use ordinal encoding to preserve order")
print("- Low-cardinality nominal (C): One-hot encoding works well")
print("- High-cardinality nominal (B, E): Target or frequency encoding to avoid dimensionality explosion")

### Exercise 2: Implement Custom Ordinal Encoding

Create a dataset with movie ratings and apply ordinal encoding.

In [None]:
# Exercise 2: Apply ordinal encoding to movie ratings

# Movie ratings dataset
movie_data = pd.DataFrame({
    'movie': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
    'rating': ['Poor', 'Excellent', 'Good', 'Fair', 'Good']
})

print("Movie ratings:")
print(movie_data)

# TODO: Create ordinal encoding for ratings
# Order: Poor < Fair < Good < Excellent
# Your code here:

# rating_order = ???
# encoder = ???
# movie_data['rating_encoded'] = ???

# print("\nEncoded ratings:")
# print(movie_data)

In [None]:
# Solution to Exercise 2

movie_data = pd.DataFrame({
    'movie': ['Movie A', 'Movie B', 'Movie C', 'Movie D', 'Movie E'],
    'rating': ['Poor', 'Excellent', 'Good', 'Fair', 'Good']
})

# Define order
rating_order = ['Poor', 'Fair', 'Good', 'Excellent']

# Apply ordinal encoding
encoder = OrdinalEncoder(categories=[rating_order])
movie_data['rating_encoded'] = encoder.fit_transform(movie_data[['rating']])

print("Solution:")
print(movie_data)
print("\nMapping:")
for i, rating in enumerate(rating_order):
    print(f"{rating}: {i}")
print("\n✓ Correct: Excellent (3) > Good (2) > Fair (1) > Poor (0)")

### Exercise 3: Detect and Fix Data Leakage

Find the data leakage problem in this code.

In [None]:
# Exercise 3: What's wrong with this code?

print("Code snippet:")
print('''
# Prepare data
X = data[['city', 'product_category']]
y = data['purchased']

# One-hot encode
X_encoded = pd.get_dummies(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y)

# Train model
model.fit(X_train, y_train)
''')

print("\nWhat's the problem? How would you fix it?")
# Your answer:

In [None]:
# Solution to Exercise 3

print("Problem: No data leakage issue here!")
print("\nWait... that's a trick question. Let me reconsider.")
print("\nActual Problem: Potential issue if test set has new categories!")
print("\nIf test set has a city not in training set, pd.get_dummies will create")
print("different columns, causing shape mismatch.")
print("\nBetter approach:")
print('''
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Use sklearn's OneHotEncoder which handles unseen categories
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.transform(X_test)  # Same columns as training
''')
print("\n✓ This ensures consistent columns and handles unseen categories")

### Exercise 4: High-Cardinality Challenge

You have a dataset with 500 unique product categories. Compare different encoding strategies.

In [None]:
# Exercise 4: Handle high-cardinality feature

# Create dataset with 500 product categories
np.random.seed(42)
n_samples = 5000
n_categories = 500

# Some categories are more popular (Zipf distribution)
category_ids = np.random.zipf(1.5, n_samples) % n_categories

high_card_data = pd.DataFrame({
    'product_id': [f'PROD_{i:04d}' for i in category_ids],
    'price': np.random.uniform(10, 1000, n_samples),
    'quantity': np.random.randint(1, 10, n_samples)
})

# Target: high-value purchases
high_card_data['high_value'] = (high_card_data['price'] * high_card_data['quantity'] > 500).astype(int)

print(f"Dataset: {len(high_card_data)} samples")
print(f"Unique products: {high_card_data['product_id'].nunique()}")
print(f"\nProduct frequency distribution:")
print(high_card_data['product_id'].value_counts().head(10))

# TODO: Try different encoding strategies and compare
# 1. Frequency encoding
# 2. Grouping rare categories (threshold=0.01)
# 3. Target encoding
# Which works best?

# Your code here:

In [None]:
# Solution to Exercise 4

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X = high_card_data[['price', 'quantity', 'product_id']]
y = high_card_data['high_value']

results = []

# 1. Frequency encoding
X_freq = X.copy()
X_freq['product_id'] = frequency_encoding(X_freq['product_id'])
X_train, X_test, y_train, y_test = train_test_split(X_freq, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42, max_depth=10)
model.fit(X_train, y_train)
results.append({'Method': 'Frequency', 'Accuracy': accuracy_score(y_test, model.predict(X_test))})

# 2. Grouping rare categories
X_grouped = X.copy()
X_grouped['product_id'] = group_rare_categories(X_grouped['product_id'], threshold=0.01)
X_grouped = pd.get_dummies(X_grouped, columns=['product_id'], drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X_grouped, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(random_state=42, max_depth=10)
model.fit(X_train, y_train)
results.append({'Method': 'Grouped+OneHot', 'Accuracy': accuracy_score(y_test, model.predict(X_test))})

# 3. Target encoding
X_target = X[['price', 'quantity']].copy()
X_train, X_test, y_train, y_test = train_test_split(X_target, y, test_size=0.2, random_state=42)
X_train['product_id'] = X.loc[X_train.index, 'product_id']
X_test['product_id'] = X.loc[X_test.index, 'product_id']
te = TargetEncoder(cols=['product_id'], smoothing=1.0)
X_train = te.fit_transform(X_train, y_train)
X_test = te.transform(X_test)
model = RandomForestClassifier(random_state=42, max_depth=10)
model.fit(X_train, y_train)
results.append({'Method': 'Target', 'Accuracy': accuracy_score(y_test, model.predict(X_test))})

# Display results
results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)
print("\nResults:")
print(results_df.to_string(index=False))
print("\nInsight: For high-cardinality features, target encoding often works best!")

## 12. Summary

### Key Takeaways

1. **Encoding converts categories to numbers** that ML algorithms can process
   - Critical preprocessing step
   - Choice of encoding affects model performance

2. **Main encoding methods**:
   - **One-Hot**: Nominal, low-cardinality (<10 values)
   - **Ordinal**: Ordered categories with custom ordering
   - **Label**: Only for tree-based models, not linear models
   - **Target**: High-cardinality, captures target relationship
   - **Frequency**: High-cardinality, simple alternative

3. **Decision factors**:
   - Nominal vs Ordinal
   - Cardinality (number of unique values)
   - Model type (linear vs tree-based)
   - Relationship to target

4. **Avoid data leakage**:
   - Split data first
   - Fit encoder on training data only
   - Handle unseen categories in test set

5. **High-cardinality strategies**:
   - Target encoding (with smoothing)
   - Frequency encoding
   - Group rare categories
   - Feature hashing

### What's Next?

**Module 03**: Feature Scaling and Normalization - Learn when and how to scale numeric features

### Additional Resources

- [Category Encoders Library](https://contrib.scikit-learn.org/category_encoders/)
- [Sklearn Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
- "Categorical Encoding Methods" by Kaggle Learn

---

**Congratulations!** You've completed Module 02. You now know:
- Why categorical encoding is necessary
- How to apply one-hot, label, ordinal, and target encoding
- When to use each encoding method
- How to handle high-cardinality features
- How to avoid data leakage during encoding

Ready to continue? Move to **Module 03: Feature Scaling and Normalization**!