# Module 05: Binning and Discretization

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 50 minutes  
**Prerequisites**: [Module 04: Polynomial Features and Interactions](04_polynomial_features_interactions.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand when and why to discretize continuous features
2. Apply equal-width binning to create uniform intervals
3. Use equal-frequency binning for balanced distribution
4. Create custom bins based on domain knowledge
5. Use sklearn's KBinsDiscretizer for automatic binning
6. Recognize when discretization helps vs hurts model performance
7. Choose appropriate number of bins for your data

## 1. What is Binning/Discretization?

**Binning** (also called **discretization**) converts continuous features into categorical bins.

### Example: Age

**Continuous**: 23, 45, 67, 18, 52, 71...

**Binned**:
```
18-30: Young Adult
31-50: Middle Aged
51-70: Senior
70+:   Elderly
```

### Why Discretize?

#### ✅ Benefits:

1. **Capture non-linear patterns**: Age 25 vs 27 might not matter, but 25 vs 65 does
2. **Handle outliers**: Extreme values get grouped with similar values
3. **Interpretability**: "Senior citizens" is clearer than "age > 60"
4. **Match domain knowledge**: Income brackets, age groups already exist
5. **Reduce noise**: Small variations within bins are ignored
6. **Help linear models**: Can approximate non-linear relationships

#### ❌ Drawbacks:

1. **Loss of information**: 25 and 29 treated identically in "20-30" bin
2. **Arbitrary boundaries**: Is 30 really different from 31?
3. **Fewer features for learning**: Less granular data
4. **May not help tree models**: Trees already split optimally

### When to Use Binning?

- ✅ **Linear models** with non-linear relationships
- ✅ **Domain knowledge** suggests natural categories
- ✅ **Outliers** are problematic
- ✅ **Interpretability** is important
- ❌ **Tree-based models** (usually not needed)
- ❌ **Small datasets** (losing information is costly)

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Binning methods
from sklearn.preprocessing import KBinsDiscretizer

# Models for comparison
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)

print("✓ Libraries imported successfully!")

## 3. Create Sample Dataset

Let's create a customer dataset for credit card approval prediction.

In [None]:
# Set seed for reproducibility
np.random.seed(42)
n_samples = 1000

# Create customer data
customer_data = pd.DataFrame({
    'age': np.random.randint(18, 80, n_samples),
    'income': np.random.gamma(5, 10000, n_samples).clip(15000, 200000),
    'credit_score': np.random.randint(400, 850, n_samples),
    'years_employed': np.random.exponential(5, n_samples).clip(0, 40),
    'debt': np.random.gamma(3, 5000, n_samples).clip(0, 100000)
})

# Create target: credit card approval
# Non-linear relationship: middle-aged with stable income most likely to be approved
approval_prob = (
    # Age: peak approval at 30-50
    0.3 * np.exp(-((customer_data['age'] - 40)**2) / 500) +
    # Income: higher is better
    0.3 * (customer_data['income'] - customer_data['income'].min()) / 
          (customer_data['income'].max() - customer_data['income'].min()) +
    # Credit score: higher is better
    0.2 * (customer_data['credit_score'] - 400) / 450 +
    # Employment: more years is better
    0.1 * (customer_data['years_employed'] / 40) +
    # Debt: lower is better
    0.1 * (1 - customer_data['debt'] / customer_data['debt'].max()) +
    np.random.normal(0, 0.1, n_samples)
)

customer_data['approved'] = (approval_prob > 0.5).astype(int)

print(f"Created dataset with {len(customer_data)} customers")
print(f"Approval rate: {customer_data['approved'].mean():.1%}")
print(f"\nFirst few rows:")
customer_data.head()

In [None]:
# Visualize continuous features
features = ['age', 'income', 'credit_score', 'years_employed', 'debt']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

for idx, feature in enumerate(features):
    axes[idx].hist(customer_data[feature], bins=30, edgecolor='black', alpha=0.7)
    axes[idx].set_xlabel(feature)
    axes[idx].set_ylabel('Frequency')
    axes[idx].set_title(f'Distribution of {feature}')
    axes[idx].grid(True, alpha=0.3)

axes[-1].axis('off')  # Hide last subplot

plt.tight_layout()
plt.show()

print("These continuous features will be binned into categories.")

## 4. Method 1: Equal-Width Binning

**Strategy**: Divide the range into bins of equal width

**Formula**: 
```
bin_width = (max - min) / n_bins
```

**Example**: Age 18-80 with 4 bins
- Bin 1: 18-33.5
- Bin 2: 33.5-49
- Bin 3: 49-64.5
- Bin 4: 64.5-80

**Pros**: 
- Simple and intuitive
- Equal-sized intervals

**Cons**: 
- May create imbalanced bins (some bins have very few samples)
- Sensitive to outliers

In [None]:
# Equal-width binning for age using pandas.cut()
age_bins_uniform = pd.cut(
    customer_data['age'], 
    bins=4,  # 4 equal-width bins
    labels=['Young', 'Adult', 'Middle-Aged', 'Senior']
)

print("Equal-Width Binning for Age:\n")
print("Value counts:")
print(age_bins_uniform.value_counts().sort_index())

print("\nBin intervals:")
print(pd.cut(customer_data['age'], bins=4).value_counts().sort_index())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original distribution
axes[0].hist(customer_data['age'], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Original Age Distribution')
axes[0].grid(True, alpha=0.3)

# After binning
age_bins_uniform.value_counts().sort_index().plot(kind='bar', ax=axes[1], edgecolor='black')
axes[1].set_xlabel('Age Group')
axes[1].set_ylabel('Count')
axes[1].set_title('Equal-Width Binning (4 bins)')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nNotice: Bins have roughly similar counts because age is uniformly distributed.")

In [None]:
# Equal-width binning for income (skewed distribution)
income_bins_uniform = pd.cut(
    customer_data['income'],
    bins=5,
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

print("Equal-Width Binning for Income:\n")
print("Value counts:")
print(income_bins_uniform.value_counts().sort_index())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(customer_data['income'], bins=30, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Income ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Original Income Distribution\n(Right-skewed)')
axes[0].grid(True, alpha=0.3)

income_bins_uniform.value_counts().sort_index().plot(kind='bar', ax=axes[1], edgecolor='black')
axes[1].set_xlabel('Income Group')
axes[1].set_ylabel('Count')
axes[1].set_title('Equal-Width Binning (5 bins)')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n⚠️  Problem: Most data in first few bins, last bins nearly empty!")
print("This happens with skewed distributions.")

## 5. Method 2: Equal-Frequency Binning (Quantile-Based)

**Strategy**: Create bins with approximately equal number of samples

**How**: Use quantiles/percentiles as bin edges

**Example**: 1000 samples, 4 bins → each bin has ~250 samples

**Pros**: 
- Balanced bins
- Works well with skewed data
- Each bin has enough samples

**Cons**: 
- Bin widths vary (can be counterintuitive)
- May group very different values together

In [None]:
# Equal-frequency binning for income using pandas.qcut()
income_bins_quantile = pd.qcut(
    customer_data['income'],
    q=5,  # 5 bins with equal frequency
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

print("Equal-Frequency Binning for Income:\n")
print("Value counts:")
print(income_bins_quantile.value_counts().sort_index())

print("\nBin intervals:")
print(pd.qcut(customer_data['income'], q=5).value_counts().sort_index())

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Equal-width
income_bins_uniform.value_counts().sort_index().plot(kind='bar', ax=axes[0], edgecolor='black', color='coral')
axes[0].set_xlabel('Income Group')
axes[0].set_ylabel('Count')
axes[0].set_title('Equal-Width Binning\n(Imbalanced)')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Equal-frequency
income_bins_quantile.value_counts().sort_index().plot(kind='bar', ax=axes[1], edgecolor='black', color='lightgreen')
axes[1].set_xlabel('Income Group')
axes[1].set_ylabel('Count')
axes[1].set_title('Equal-Frequency Binning\n(Balanced)')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n✓ Equal-frequency creates balanced bins even with skewed data!")

## 6. Method 3: Custom Binning (Domain Knowledge)

**Strategy**: Use domain knowledge to create meaningful bins

**Examples**:
- **Age**: 0-17 (minor), 18-64 (adult), 65+ (senior)
- **Income**: Based on tax brackets
- **Credit score**: Poor (<580), Fair (580-669), Good (670-739), Very Good (740-799), Excellent (800+)

**Pros**: 
- Most interpretable
- Aligned with business logic
- Can leverage expert knowledge

**Cons**: 
- Requires domain expertise
- May not be optimal for model performance

In [None]:
# Custom binning for credit score based on industry standards
credit_score_bins = pd.cut(
    customer_data['credit_score'],
    bins=[400, 580, 670, 740, 800, 850],
    labels=['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'],
    include_lowest=True
)

print("Custom Credit Score Binning (Industry Standard):\n")
print("Value counts:")
print(credit_score_bins.value_counts().sort_index())

# Show relationship to approval
approval_by_credit = pd.crosstab(
    credit_score_bins, 
    customer_data['approved'],
    normalize='index'
) * 100

print("\nApproval rate by credit score category:")
print(approval_by_credit[1].sort_index())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution
credit_score_bins.value_counts().sort_index().plot(kind='bar', ax=axes[0], edgecolor='black')
axes[0].set_xlabel('Credit Score Category')
axes[0].set_ylabel('Count')
axes[0].set_title('Custom Binning: Credit Score Distribution')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

# Approval rate
approval_by_credit[1].sort_index().plot(kind='bar', ax=axes[1], color='green', edgecolor='black')
axes[1].set_xlabel('Credit Score Category')
axes[1].set_ylabel('Approval Rate (%)')
axes[1].set_title('Approval Rate by Credit Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\n✓ Custom bins align with domain knowledge and show clear relationship to target!")

In [None]:
# Custom age groups
age_bins_custom = pd.cut(
    customer_data['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['Young (18-25)', 'Adult (26-40)', 'Middle Age (41-60)', 'Senior (60+)']
)

print("Custom Age Grouping:\n")
print(age_bins_custom.value_counts().sort_index())

# Approval rate by age group
approval_by_age = customer_data.groupby(age_bins_custom)['approved'].agg(['mean', 'count'])
approval_by_age['approval_rate_%'] = approval_by_age['mean'] * 100

print("\nApproval statistics by age group:")
print(approval_by_age[['count', 'approval_rate_%']])

## 7. Method 4: KBinsDiscretizer from sklearn

**sklearn's unified interface** for binning with multiple strategies.

**Strategies**:
1. `uniform`: Equal-width binning
2. `quantile`: Equal-frequency binning
3. `kmeans`: K-means clustering to find optimal bin edges

**Encoding options**:
- `ordinal`: 0, 1, 2, 3... (preserves order)
- `onehot`: Binary columns for each bin
- `onehot-dense`: Same as onehot but dense array

In [None]:
# Prepare data
X = customer_data[['age', 'income', 'credit_score', 'years_employed', 'debt']]
y = customer_data['approved']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# KBinsDiscretizer with uniform strategy
binner_uniform = KBinsDiscretizer(
    n_bins=5,
    encode='ordinal',
    strategy='uniform'
)

X_train_binned_uniform = binner_uniform.fit_transform(X_train)
X_test_binned_uniform = binner_uniform.transform(X_test)

print("KBinsDiscretizer (uniform strategy):\n")
print("Original data (first 5 rows):")
print(X_train.head())
print("\nBinned data (first 5 rows):")
print(pd.DataFrame(X_train_binned_uniform[:5], columns=X.columns))
print("\nValues are now 0, 1, 2, 3, 4 representing the bin index.")

In [None]:
# Compare different strategies
strategies = ['uniform', 'quantile', 'kmeans']

fig, axes = plt.subplots(len(strategies), 1, figsize=(12, 12))

for idx, strategy in enumerate(strategies):
    binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy=strategy)
    X_binned = binner.fit_transform(X_train[['income']])
    
    # Plot
    axes[idx].hist(X_binned, bins=5, edgecolor='black', alpha=0.7)
    axes[idx].set_xlabel('Bin Index')
    axes[idx].set_ylabel('Count')
    axes[idx].set_title(f'Strategy: {strategy}')
    axes[idx].grid(True, alpha=0.3, axis='y')
    axes[idx].set_xticks([0, 1, 2, 3, 4])

plt.tight_layout()
plt.show()

print("Observations:")
print("- uniform: Imbalanced (most in first bins)")
print("- quantile: Balanced (equal counts per bin)")
print("- kmeans: Clusters data points, somewhat balanced")

## 8. Impact on Model Performance

Let's compare how binning affects different models.

In [None]:
# Compare: No binning vs Binning for different models

# Prepare binned data (quantile strategy, 5 bins)
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_train_binned = binner.fit_transform(X_train)
X_test_binned = binner.transform(X_test)

# Models to test
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
}

# Store results
results = []

for model_name, model in models.items():
    # Without binning
    model.fit(X_train, y_train)
    accuracy_no_bin = model.score(X_test, y_test)
    
    # With binning
    model_binned = models[model_name]  # Create new instance
    model_binned.fit(X_train_binned, y_train)
    accuracy_with_bin = model_binned.score(X_test_binned, y_test)
    
    results.append({
        'Model': model_name,
        'Without Binning': accuracy_no_bin,
        'With Binning': accuracy_with_bin,
        'Difference': accuracy_with_bin - accuracy_no_bin
    })

results_df = pd.DataFrame(results)
print("Impact of Binning on Model Performance:\n")
print(results_df.to_string(index=False))

# Visualize
results_df.set_index('Model')[['Without Binning', 'With Binning']].plot(
    kind='bar', figsize=(10, 6), edgecolor='black'
)
plt.ylabel('Accuracy')
plt.title('Model Performance: Binning vs No Binning')
plt.xticks(rotation=45, ha='right')
plt.ylim([0.5, 1.0])
plt.legend(['Without Binning', 'With Binning'])
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("1. Logistic Regression may benefit from binning (handles non-linearity)")
print("2. Tree-based models usually don't need binning (split optimally anyway)")
print("3. Binning can help with outliers and noise")

## 9. Choosing the Number of Bins

How many bins should you create?

### Rules of Thumb:

1. **Sturges' Rule**: $bins = \lceil \log_2(n) + 1 \rceil$
   - Good for normal distributions
   
2. **Rice's Rule**: $bins = \lceil 2n^{1/3} \rceil$
   - More bins than Sturges'
   
3. **Domain Knowledge**: Use meaningful categories
   - Age: Young, Adult, Middle-aged, Senior (4 bins)
   - Income: Based on tax brackets
   
4. **Cross-Validation**: Try different numbers, pick best

### Guidelines:

- **Too few bins** (2-3): May lose too much information
- **Too many bins** (20+): Defeats the purpose of binning
- **Sweet spot**: Usually 4-10 bins
- **Consider sample size**: More data → can support more bins

In [None]:
# Test different numbers of bins
n_bins_range = [3, 5, 7, 10, 15, 20]
scores = []

for n_bins in n_bins_range:
    # Create bins
    binner = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')
    X_train_temp = binner.fit_transform(X_train)
    X_test_temp = binner.transform(X_test)
    
    # Train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train_temp, y_train)
    
    # Evaluate
    accuracy = model.score(X_test_temp, y_test)
    scores.append(accuracy)
    
    print(f"n_bins = {n_bins:2d}: Accuracy = {accuracy:.4f}")

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(n_bins_range, scores, 'o-', linewidth=2, markersize=8)
plt.xlabel('Number of Bins')
plt.ylabel('Accuracy')
plt.title('Model Performance vs Number of Bins')
plt.grid(True, alpha=0.3)
plt.xticks(n_bins_range)
plt.ylim([min(scores) - 0.01, max(scores) + 0.01])
plt.show()

optimal_bins = n_bins_range[np.argmax(scores)]
print(f"\nOptimal number of bins: {optimal_bins}")
print("\nNote: Performance usually plateaus after certain number of bins.")

## 10. Best Practices

### ✅ DO:

1. **Use domain knowledge**
   - Create bins that make business sense
   - Align with industry standards

2. **Choose appropriate binning strategy**
   - Uniform: When data is evenly distributed
   - Quantile: When data is skewed
   - Custom: When domain knowledge exists

3. **Bin AFTER splitting**
   - Fit binner on training data
   - Transform both train and test
   - Avoid data leakage

4. **Experiment with number of bins**
   - Try 3-10 bins
   - Use cross-validation
   - Balance granularity and simplicity

5. **Consider for linear models**
   - Helps capture non-linearity
   - Makes features more interpretable

### ❌ DON'T:

1. **Don't bin with tree models** (usually)
   - Trees find optimal splits
   - Binning may reduce performance

2. **Don't create too many bins**
   - Defeats purpose of discretization
   - May lead to overfitting

3. **Don't ignore bin distribution**
   - Check if bins are balanced
   - Avoid bins with very few samples

4. **Don't forget to encode**
   - Use ordinal for ordered bins
   - Use one-hot for nominal bins

5. **Don't bin everything**
   - Only bin when it makes sense
   - Keep continuous when relationship is linear

## 11. Exercise Section

### Exercise 1: Choose the Right Binning Strategy

For each scenario, choose the best binning approach.

In [None]:
# Exercise 1: Match scenarios to binning strategies

scenarios = {
    'A': 'Age for insurance risk categories (industry has standard groups)',
    'B': 'Website response time in milliseconds (heavily right-skewed)',
    'C': 'Test scores from 0-100 (evenly distributed)',
    'D': 'Income for tax bracket analysis (specific thresholds exist)',
    'E': 'Temperature readings (normally distributed)'
}

strategies_options = {
    '1': 'Equal-width (uniform)',
    '2': 'Equal-frequency (quantile)',
    '3': 'Custom bins (domain knowledge)',
    '4': "Don't bin (keep continuous)"
}

print("Scenarios:")
for key, scenario in scenarios.items():
    print(f"{key}. {scenario}")

print("\nBinning Strategies:")
for key, strategy in strategies_options.items():
    print(f"{key}. {strategy}")

print("\nYour answers:")
# A: ?
# B: ?
# C: ?
# D: ?
# E: ?

In [None]:
# Solution to Exercise 1

print("Solutions:\n")
print("A: 3 - Custom bins")
print("   Insurance industry has standard age groups (e.g., 18-25, 26-40, 41-60, 60+)\n")

print("B: 2 - Equal-frequency (quantile)")
print("   Skewed data needs quantile binning to avoid imbalanced bins\n")

print("C: 1 - Equal-width (uniform)")
print("   Evenly distributed data works well with uniform bins (e.g., 0-20, 21-40, 41-60, 61-80, 81-100)\n")

print("D: 3 - Custom bins")
print("   Tax brackets are legally defined, use those exact thresholds\n")

print("E: 1 or 4 - Equal-width or Don't bin")
print("   Normally distributed data doesn't need special handling. May not need binning at all.\n")

print("Key principle: Use domain knowledge when available, otherwise choose based on distribution!")

### Exercise 2: Create Custom Bins

Create meaningful income brackets for financial analysis.

In [None]:
# Exercise 2: Create custom income brackets

# Income data
incomes = customer_data['income'].copy()

print("Income statistics:")
print(incomes.describe())

# TODO: Create meaningful income brackets
# Consider: Low, Lower-Middle, Middle, Upper-Middle, High
# Choose thresholds that make sense (e.g., poverty line, median, etc.)

# Your code here:
# income_brackets = pd.cut(
#     incomes,
#     bins=[???, ???, ???, ???, ???, ???],
#     labels=[???, ???, ???, ???, ???]
# )

# print("\nIncome bracket distribution:")
# print(income_brackets.value_counts().sort_index())

In [None]:
# Solution to Exercise 2

# Create meaningful income brackets
income_brackets = pd.cut(
    incomes,
    bins=[0, 30000, 50000, 75000, 100000, 300000],
    labels=['Low (<30k)', 'Lower-Middle (30-50k)', 'Middle (50-75k)', 
            'Upper-Middle (75-100k)', 'High (>100k)'],
    include_lowest=True
)

print("Solution: Income Brackets\n")
print("Distribution:")
print(income_brackets.value_counts().sort_index())

# Analyze approval rate by income bracket
approval_by_income = pd.crosstab(
    income_brackets, 
    customer_data['approved'],
    normalize='index'
) * 100

print("\nApproval rate by income bracket:")
print(approval_by_income[1].sort_index())

# Visualize
approval_by_income[1].sort_index().plot(
    kind='bar', figsize=(10, 6), color='green', edgecolor='black'
)
plt.xlabel('Income Bracket')
plt.ylabel('Approval Rate (%)')
plt.title('Credit Card Approval Rate by Income Bracket')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nThresholds chosen based on:")
print("- $30k: Around poverty threshold")
print("- $50k: Near median household income")
print("- $75k: Upper-middle class threshold")
print("- $100k: High income threshold")

### Exercise 3: Compare Binning Strategies

Implement and compare all three binning strategies on the same data.

In [None]:
# Exercise 3: Compare equal-width vs equal-frequency

# Use debt data (likely skewed)
debt_data = customer_data['debt'].copy()

print("Debt statistics:")
print(debt_data.describe())

# TODO:
# 1. Create equal-width bins (5 bins)
# 2. Create equal-frequency bins (5 bins)
# 3. Compare the distributions
# 4. Which is more balanced?

# Your code here:

In [None]:
# Solution to Exercise 3

# Equal-width binning
debt_uniform = pd.cut(debt_data, bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Equal-frequency binning
debt_quantile = pd.qcut(debt_data, q=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

print("Comparison of Binning Strategies:\n")
print("Equal-Width Binning:")
print(debt_uniform.value_counts().sort_index())
print(f"\nStandard deviation of counts: {debt_uniform.value_counts().std():.1f}")

print("\nEqual-Frequency Binning:")
print(debt_quantile.value_counts().sort_index())
print(f"\nStandard deviation of counts: {debt_quantile.value_counts().std():.1f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

debt_uniform.value_counts().sort_index().plot(kind='bar', ax=axes[0], edgecolor='black', color='coral')
axes[0].set_xlabel('Debt Category')
axes[0].set_ylabel('Count')
axes[0].set_title('Equal-Width Binning')
axes[0].tick_params(axis='x', rotation=45)
axes[0].grid(True, alpha=0.3, axis='y')

debt_quantile.value_counts().sort_index().plot(kind='bar', ax=axes[1], edgecolor='black', color='lightgreen')
axes[1].set_xlabel('Debt Category')
axes[1].set_ylabel('Count')
axes[1].set_title('Equal-Frequency Binning')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nConclusion:")
print("- Equal-width: Imbalanced, most data in first bin")
print("- Equal-frequency: Balanced, each bin has similar count")
print("- For skewed data, equal-frequency is usually better!")

### Exercise 4: Prevent Data Leakage

Identify and fix the data leakage problem.

In [None]:
# Exercise 4: Fix the data leakage

print("Problematic code:")
print('''
# Bin the data
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_binned = binner.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_binned, y)

# Train model
model.fit(X_train, y_train)
''')

print("\nWhat's wrong? How would you fix it?")
# Your answer:

In [None]:
# Solution to Exercise 4

print("Problem: DATA LEAKAGE!\n")
print("The binner is fit on ALL data before splitting.")
print("For quantile binning, bin edges are calculated using test set!\n")

print("Why this is bad:")
print("- Test set statistics leak into training")
print("- Quantiles calculated using entire dataset")
print("- Model sees information it shouldn't have")
print("- Performance estimates are overly optimistic\n")

print("Correct approach:")
print('''
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Fit binner on training data only
binner = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
X_train_binned = binner.fit_transform(X_train)

# Apply same binning to test data
X_test_binned = binner.transform(X_test)  # Note: transform, not fit_transform!

# Train model
model.fit(X_train_binned, y_train)
''')

print("\n✓ Always: Split → Fit on train → Transform both")

## 12. Summary

### Key Takeaways

1. **Binning converts continuous features to categorical**
   - Useful for capturing non-linear patterns
   - Makes features more interpretable
   - Can handle outliers better

2. **Three main binning strategies**:
   - **Equal-width**: Uniform intervals, good for evenly distributed data
   - **Equal-frequency**: Balanced bins, better for skewed data
   - **Custom bins**: Use domain knowledge for meaningful categories

3. **sklearn's KBinsDiscretizer** provides unified interface
   - Strategies: uniform, quantile, kmeans
   - Encoding: ordinal, onehot
   - Easy to integrate in pipelines

4. **When binning helps**:
   - ✅ Linear models with non-linear data
   - ✅ Domain knowledge suggests categories
   - ✅ Outliers are problematic
   - ✅ Interpretability is important
   - ❌ Tree-based models (usually don't need it)

5. **Choosing number of bins**:
   - Too few (2-3): Loss of information
   - Too many (20+): Defeats purpose
   - Sweet spot: 4-10 bins
   - Use cross-validation to optimize

6. **Always avoid data leakage**:
   - Split data first
   - Fit binner on training data only
   - Transform both train and test

### What's Next?

**Module 06**: Date and Time Features - Learn how to extract temporal features from timestamps

### Additional Resources

- [Sklearn KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)
- [Pandas cut and qcut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html)
- "Feature Engineering for Machine Learning" by Alice Zheng & Amanda Casari

---

**Congratulations!** You've completed Module 05. You now know:
- What binning/discretization is and when to use it
- How to apply equal-width and equal-frequency binning
- How to create custom bins based on domain knowledge
- How to use sklearn's KBinsDiscretizer
- When binning helps vs hurts model performance
- How to choose the right number of bins

Ready to work with temporal data? Move to **Module 06: Date and Time Features**!