# üè∑Ô∏è Categorical Features Handling

<div style="background-color: #e3f2fd; padding: 15px; border-radius: 5px; border-left: 5px solid #2196F3;">
<b>üìì Notebook Information</b><br>
<b>Level:</b> Intermediate<br>
<b>Estimated Time:</b> 15 minutes<br>
<b>Prerequisites:</b> 01_basic_usage.ipynb<br>
<b>Dataset:</b> Synthetic customer data
</div>

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Understand how DBDataset handles categorical features
- ‚úÖ Use automatic categorical detection
- ‚úÖ Apply different encoding strategies
- ‚úÖ Handle high-cardinality categorical features
- ‚úÖ Integrate categorical preprocessing with pipelines
- ‚úÖ Avoid common categorical feature pitfalls

---

## üìö Table of Contents

1. [Introduction](#intro)
2. [Setup](#setup)
3. [Auto Detection](#auto)
4. [Encoding Strategies](#encoding)
5. [High Cardinality Features](#cardinality)
6. [Best Practices](#practices)
7. [Conclusion](#conclusion)

<a id="intro"></a>
## 1. üìñ Introduction

### What Are Categorical Features?

> **Categorical features** represent discrete values from a fixed set of categories.

**Examples:**
- üé® **Nominal**: Color (red, blue, green), country, product type
- üìä **Ordinal**: Education (HS, Bachelor, Master, PhD), rating (1-5 stars)
- üî¢ **Binary**: Gender (M/F), yes/no questions

### Why Do Categorical Features Matter?

**The Challenge:**
- ü§ñ **ML models need numbers** - Can't directly process "red", "blue"
- üîÑ **Encoding required** - Must convert categories to numeric
- ‚ö†Ô∏è **Wrong encoding = poor performance** - Different strategies for different cases

**Real-world impact:**
```python
# ‚ùå BAD: Treating 'country' as numeric
USA = 1, UK = 2, France = 3  # Implies USA < UK < France!

# ‚úÖ GOOD: One-hot encoding
USA = [1,0,0], UK = [0,1,0], France = [0,0,1]
```

### Common Encoding Methods

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Label Encoding** | Ordinal features | Simple, fast | Implies ordering |
| **One-Hot Encoding** | Low cardinality (<20) | No false ordering | High dimensionality |
| **Target Encoding** | High cardinality | Compact | Risk of overfitting |
| **Frequency Encoding** | High cardinality | Simple | Loses category identity |
| **Binary Encoding** | Medium cardinality | Compact | Less interpretable |

### DBDataset Approach

DBDataset provides:
- üîç **Automatic detection** - Identifies categorical columns
- üõ†Ô∏è **Flexible encoding** - Works with any sklearn preprocessor
- üîí **Safe handling** - Prevents common encoding mistakes

**Let's see how!** üöÄ

<a id="setup"></a>
## 2. üõ†Ô∏è Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import (
    LabelEncoder,
    OneHotEncoder,
    OrdinalEncoder
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report

# DeepBridge
from deepbridge import DBDataset, Experiment

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("‚úÖ Setup complete!")
print("üè∑Ô∏è  Topic: Categorical Features with DBDataset")

### Create Synthetic Dataset with Categorical Features

In [None]:
print("üìä Creating synthetic customer churn dataset...\n")

# Set random seed for reproducibility
np.random.seed(RANDOM_STATE)

n_samples = 2000

# Categorical features
df = pd.DataFrame({
    # Low cardinality
    'contract_type': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], n_samples, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Credit Card', 'Bank Transfer', 'Electronic Check', 'Mailed Check'], n_samples),
    'internet_service': np.random.choice(['DSL', 'Fiber Optic', 'No'], n_samples, p=[0.4, 0.4, 0.2]),
    
    # Ordinal
    'customer_level': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], n_samples, p=[0.4, 0.3, 0.2, 0.1]),
    
    # Binary
    'paperless_billing': np.random.choice(['Yes', 'No'], n_samples),
    'phone_service': np.random.choice(['Yes', 'No'], n_samples, p=[0.9, 0.1]),
    
    # High cardinality (simulated)
    'city': np.random.choice([f'City_{i}' for i in range(50)], n_samples),
    
    # Numeric features
    'tenure_months': np.random.randint(1, 73, n_samples),
    'monthly_charges': np.random.uniform(20, 120, n_samples),
    'total_charges': np.random.uniform(20, 8000, n_samples),
})

# Create target based on features (simulate business logic)
churn_prob = (
    (df['contract_type'] == 'Month-to-Month').astype(int) * 0.3 +
    (df['tenure_months'] < 12).astype(int) * 0.25 +
    (df['monthly_charges'] > 80).astype(int) * 0.2 +
    (df['payment_method'] == 'Electronic Check').astype(int) * 0.15 +
    np.random.uniform(0, 0.1, n_samples)  # Random noise
)

df['churn'] = (churn_prob > 0.5).astype(int)

print(f"‚úÖ Dataset created: {df.shape}")
print(f"\nüìã Feature types:")
print(f"   Categorical: {df.select_dtypes(include='object').columns.tolist()}")
print(f"   Numeric: {df.select_dtypes(include=['int64', 'float64']).drop('churn', axis=1).columns.tolist()}")
print(f"\nüìä Churn rate: {df['churn'].mean():.1%}")

# Show sample
print("\nüëÄ Sample data:")
display(df.head())

### Explore Categorical Features

In [None]:
print("üîç Categorical Feature Analysis\n")
print("=" * 70)

categorical_cols = df.select_dtypes(include='object').columns

cat_info = []
for col in categorical_cols:
    n_unique = df[col].nunique()
    top_value = df[col].value_counts().index[0]
    top_freq = df[col].value_counts().iloc[0] / len(df)
    
    # Classify cardinality
    if n_unique <= 5:
        cardinality = 'üü¢ Low'
    elif n_unique <= 20:
        cardinality = 'üü° Medium'
    else:
        cardinality = 'üî¥ High'
    
    cat_info.append({
        'Feature': col,
        'Unique Values': n_unique,
        'Cardinality': cardinality,
        'Top Value': top_value,
        'Top Frequency': f"{top_freq:.1%}"
    })

cat_df = pd.DataFrame(cat_info)
display(cat_df)

print("\nüí° Cardinality Guide:")
print("   üü¢ Low (‚â§5): Use one-hot encoding")
print("   üü° Medium (6-20): Use one-hot or binary encoding")
print("   üî¥ High (>20): Use target/frequency encoding")

<a id="auto"></a>
## 3. üîç Auto Detection

### DBDataset Automatic Categorical Detection

In [None]:
print("üîç Auto-Detection of Categorical Features\n")
print("   DBDataset can automatically identify categorical columns (object dtype)")

# For DBDataset, we need to encode categorical features first
# Let's use one-hot encoding for low cardinality features
print("\n‚ö†Ô∏è  Important: sklearn models need numeric input!")
print("   You must encode categorical features before creating DBDataset")
print("\n‚úÖ Best practice: Use sklearn pipelines for encoding")

<a id="encoding"></a>
## 4. üîß Encoding Strategies

### Strategy 1: One-Hot Encoding (Low Cardinality)

In [None]:
print("üîß Strategy 1: One-Hot Encoding\n")
print("   Best for: Low cardinality features (‚â§10 categories)")

# Select low cardinality categorical features
low_card_features = ['contract_type', 'payment_method', 'internet_service', 
                     'paperless_billing', 'phone_service']

# One-hot encode
df_encoded = pd.get_dummies(df, columns=low_card_features, drop_first=True)

print(f"\nüìä Before encoding: {len(df.columns)} columns")
print(f"   After encoding: {len(df_encoded.columns)} columns")
print(f"   Added: {len(df_encoded.columns) - len(df.columns)} dummy columns")

# Show encoded columns
encoded_cols = [col for col in df_encoded.columns if any(feat in col for feat in low_card_features)]
print(f"\n‚úÖ Encoded columns (sample):")
for col in encoded_cols[:8]:
    print(f"   ‚Ä¢ {col}")
if len(encoded_cols) > 8:
    print(f"   ... and {len(encoded_cols) - 8} more")

### Strategy 2: Ordinal Encoding (Ordinal Features)

In [None]:
print("üîß Strategy 2: Ordinal Encoding\n")
print("   Best for: Features with natural ordering")

# Define ordering for customer_level
level_order = ['Bronze', 'Silver', 'Gold', 'Platinum']

# Create ordinal mapping
level_mapping = {level: i for i, level in enumerate(level_order)}

df_encoded['customer_level_encoded'] = df['customer_level'].map(level_mapping)

print(f"‚úÖ Ordinal encoding for 'customer_level':")
print(f"\n   Mapping:")
for level, code in level_mapping.items():
    print(f"   {level:10s} ‚Üí {code}")

# Show distribution
comparison = pd.DataFrame({
    'Original': df['customer_level'].value_counts().sort_index(),
    'Encoded': df_encoded['customer_level_encoded'].value_counts().sort_index()
})

print(f"\nüìä Distribution comparison:")
display(comparison)

print("\nüí° Ordinal encoding preserves the ordering: Bronze < Silver < Gold < Platinum")

### Strategy 3: Frequency Encoding (High Cardinality)

In [None]:
print("üîß Strategy 3: Frequency Encoding\n")
print("   Best for: High cardinality features (>20 categories)")

# Calculate frequency for 'city'
city_freq = df['city'].value_counts(normalize=True)

# Map frequencies
df_encoded['city_frequency'] = df['city'].map(city_freq)

print(f"‚úÖ Frequency encoding for 'city' (50 unique values):")
print(f"\n   Top 5 cities by frequency:")
for city, freq in city_freq.head().items():
    print(f"   {city:12s} ‚Üí {freq:.3f} ({freq*100:.1f}%)")

print(f"\nüìä Frequency encoding reduces:")
print(f"   From: 50 unique cities (would need 49 one-hot columns)")
print(f"   To: 1 frequency column")
print(f"\nüí° Huge dimensionality reduction for high cardinality features!")

### Strategy 4: Using sklearn Pipeline (Recommended)

In [None]:
print("üîß Strategy 4: sklearn Pipeline (Production-Ready)\n")
print("   Best practice: Encapsulate all preprocessing in a pipeline")

# Separate features by type
numeric_features = ['tenure_months', 'monthly_charges', 'total_charges']
categorical_low = ['contract_type', 'payment_method', 'internet_service']
categorical_binary = ['paperless_billing', 'phone_service']

# Create preprocessing pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(drop='first', sparse_output=False), categorical_low),
    (OneHotEncoder(drop='first', sparse_output=False), categorical_binary),
    remainder='drop'  # Drop high cardinality features for this example
)

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE))
])

print("‚úÖ Pipeline created:")
print("\n   Step 1: StandardScaler for numeric features")
print(f"           {numeric_features}")
print("\n   Step 2: OneHotEncoder for categorical features")
print(f"           {categorical_low + categorical_binary}")
print("\n   Step 3: RandomForest classifier")

# Prepare data
X = df.drop('churn', axis=1)
y = df['churn']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

# Fit pipeline
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f"\n‚úÖ Pipeline trained!")
print(f"   Accuracy: {acc:.3f}")
print(f"\nüí° Pipeline handles all encoding automatically!")

### Use Pipeline with DBDataset

In [None]:
print("üî¨ Using Pipeline with DBDataset\n")

# Create DBDataset with the pipeline
dataset = DBDataset(
    data=df,
    target_column='churn',
    model=pipeline,  # Pass the entire pipeline!
    test_size=0.2,
    random_state=RANDOM_STATE
)

print("‚úÖ DBDataset created with preprocessing pipeline")
print(f"   Total features: {len(dataset.features)}")
print(f"   Target: {dataset.target_column}")

# Run a quick test
exp = Experiment(
    dataset=dataset,
    experiment_type='binary_classification',
    experiment_name='Churn Prediction with Categorical Features',
    random_state=RANDOM_STATE
)

print("\nüî¨ Running robustness test...")
result = exp.run_test('robustness', config='quick')

print("\n‚úÖ Test complete!")
print("   Pipeline correctly handles categorical features during testing")
print("\nüí° This is the recommended approach for production!")

<a id="cardinality"></a>
## 5. üî¢ High Cardinality Features

### The High Cardinality Problem

In [None]:
print("‚ö†Ô∏è  The High Cardinality Problem\n")
print("=" * 70)

# Simulate different cardinality levels
cardinality_scenarios = [
    {'name': 'Contract Type', 'unique': 3, 'encoding': 'One-Hot', 'columns_created': 2},
    {'name': 'Country', 'unique': 195, 'encoding': 'One-Hot', 'columns_created': 194},
    {'name': 'Zip Code', 'unique': 40000, 'encoding': 'One-Hot', 'columns_created': 39999},
    {'name': 'Customer ID', 'unique': 1000000, 'encoding': 'One-Hot', 'columns_created': 999999},
]

cardinality_df = pd.DataFrame(cardinality_scenarios)
cardinality_df['Memory (MB)'] = cardinality_df['columns_created'] * 8 * 10000 / (1024 * 1024)  # Rough estimate

print("üìä One-Hot Encoding Memory Impact:\n")
display(cardinality_df.style.background_gradient(
    cmap='Reds', subset=['columns_created', 'Memory (MB)']
).format({
    'Memory (MB)': '{:.1f}'
}))

print("\n‚ö†Ô∏è  Problems with high cardinality one-hot encoding:")
print("   1. üíæ Huge memory consumption")
print("   2. ‚è±Ô∏è  Slow training and inference")
print("   3. üìâ Curse of dimensionality (sparse data)")
print("   4. üéØ Overfitting risk")
print("\n‚úÖ Solution: Use alternative encoding methods!")

### Solutions for High Cardinality

In [None]:
print("‚úÖ Solutions for High Cardinality Features\n")
print("=" * 70)

solutions = pd.DataFrame({
    'Method': [
        'Frequency Encoding',
        'Target Encoding',
        'Feature Hashing',
        'Category Grouping',
        'Embeddings (Deep Learning)'
    ],
    'Columns Created': [1, 1, 'Fixed (e.g., 32)', '1 per group', 'Fixed (e.g., 50)'],
    'Best For': [
        'Any cardinality',
        'Very high cardinality',
        'Extremely high (millions)',
        'Domain knowledge available',
        'Deep learning models'
    ],
    'Pros': [
        'Simple, fast, no overfitting',
        'Captures target relationship',
        'Handles unseen categories',
        'Interpretable, reduces noise',
        'Learns optimal representation'
    ],
    'Cons': [
        'Loses category identity',
        'Risk of overfitting',
        'Hash collisions',
        'Requires domain expertise',
        'Complex, needs more data'
    ]
})

display(solutions.style.set_properties(**{
    'text-align': 'left',
    'white-space': 'pre-wrap'
}))

print("\nüí° Rule of Thumb:")
print("   ‚Ä¢ <10 categories: One-Hot Encoding")
print("   ‚Ä¢ 10-50 categories: Frequency or Target Encoding")
print("   ‚Ä¢ 50-1000 categories: Target Encoding or Hashing")
print("   ‚Ä¢ >1000 categories: Feature Hashing or Embeddings")

<a id="practices"></a>
## 6. ‚ú® Best Practices

<div style="background-color: #e8f5e9; padding: 15px; border-radius: 5px; border-left: 5px solid #4CAF50;">
<b>‚úÖ DO</b><br><br>

1. **Use Pipelines**
   ```python
   # ‚úÖ GOOD: Encoding in pipeline
   pipeline = Pipeline([
       ('encoder', OneHotEncoder()),
       ('model', RandomForestClassifier())
   ])
   pipeline.fit(X_train, y_train)  # Fits encoder on train only!
   ```

2. **Handle Unknown Categories**
   ```python
   # ‚úÖ GOOD: Handle unseen categories in production
   encoder = OneHotEncoder(handle_unknown='ignore')
   ```

3. **Check Cardinality**
   ```python
   # ‚úÖ GOOD: Choose encoding based on cardinality
   n_unique = df['city'].nunique()
   if n_unique < 10:
       encoding = 'one-hot'
   else:
       encoding = 'frequency'
   ```

4. **Preserve Ordinal Relationships**
   - Use OrdinalEncoder for features with natural ordering
   - Define explicit ordering to avoid arbitrary assignment

5. **Drop First Category (One-Hot)**
   - Prevents multicollinearity
   - Reduces dimensionality by 1 per feature

</div>

<div style="background-color: #ffebee; padding: 15px; border-radius: 5px; border-left: 5px solid #f44336; margin-top: 15px;">
<b>‚ùå DON'T</b><br><br>

1. **Encode on Full Dataset**
   ```python
   # ‚ùå BAD: Fit on all data (leakage!)
   df_encoded = pd.get_dummies(df)
   X_train, X_test = train_test_split(df_encoded)
   ```
   ```python
   # ‚úÖ GOOD: Fit on train only
   X_train, X_test = train_test_split(df)
   encoder.fit(X_train)
   X_train_enc = encoder.transform(X_train)
   X_test_enc = encoder.transform(X_test)
   ```

2. **Use Label Encoding for Nominal Features**
   ```python
   # ‚ùå BAD: Implies red < blue < green!
   df['color'] = LabelEncoder().fit_transform(df['color'])
   ```

3. **One-Hot Encode High Cardinality**
   ```python
   # ‚ùå BAD: 1000 columns!
   df_encoded = pd.get_dummies(df['zip_code'])
   ```

4. **Ignore Unknown Categories**
   - Always handle unseen categories in production
   - Use `handle_unknown='ignore'` or frequency encoding

5. **Encode Target Variable**
   - Target encoding is for features, not the target itself
   - For classification, sklearn handles string labels automatically

</div>

<a id="conclusion"></a>
## 7. üéì Conclusion

### What You Learned

- ‚úÖ **Categorical types** - Nominal, ordinal, binary
- ‚úÖ **Encoding methods** - One-hot, ordinal, frequency, target
- ‚úÖ **Cardinality handling** - Different strategies for different scales
- ‚úÖ **Pipeline integration** - Production-ready preprocessing
- ‚úÖ **DBDataset usage** - Pass pipelines for automatic handling
- ‚úÖ **Best practices** - Avoid leakage, handle unknowns

### Key Takeaways

1. üéØ **Choose encoding wisely** - Based on cardinality and feature type
2. üîí **Prevent data leakage** - Fit encoders on training data only
3. ‚ö° **Watch cardinality** - High cardinality = high dimensionality
4. üõ†Ô∏è **Use pipelines** - Encapsulate all preprocessing
5. üîÑ **Handle unknowns** - Production will see new categories
6. üìä **Preserve relationships** - Use ordinal encoding when appropriate

### Encoding Decision Tree

```
Is the feature categorical?
‚îú‚îÄ‚îÄ No ‚Üí Use as-is (numeric)
‚îî‚îÄ‚îÄ Yes
    ‚îú‚îÄ‚îÄ Is it ordinal (natural order)?
    ‚îÇ   ‚îú‚îÄ‚îÄ Yes ‚Üí Ordinal Encoding
    ‚îÇ   ‚îî‚îÄ‚îÄ No ‚Üí Check cardinality
    ‚îÇ       ‚îú‚îÄ‚îÄ Low (<10) ‚Üí One-Hot Encoding
    ‚îÇ       ‚îú‚îÄ‚îÄ Medium (10-50) ‚Üí Frequency/Target Encoding
    ‚îÇ       ‚îî‚îÄ‚îÄ High (>50) ‚Üí Target Encoding / Hashing / Embeddings
```

### Next Steps

1. **Try target encoding** (category_encoders library)
2. **Experiment with embeddings** (for deep learning)
3. **Handle missing values** in categorical features
4. **Feature engineering** (create category interactions)

---

**Remember: The right encoding can make or break your model!** üè∑Ô∏è‚ú®