# Binning & Binarization

## Binning (Discretization)

Binning is the process of converting **continuous numerical data into discrete bins or intervals**. It reduces the impact of minor observation errors and can reveal patterns in noisy data.

### When to Use Binning:
- When you have **continuous variables** that would benefit from being **categorical**
- To **reduce the effect of outliers** by grouping extreme values
- When relationships are **non-linear** and binning can capture patterns better
- To **simplify models** by reducing the number of distinct values
- When domain knowledge suggests **natural groupings** (e.g., age groups, income brackets)

### Types of Binning:

## 1. Unsupervised Binning

Unsupervised binning methods **do not use the target variable** and only consider the distribution of the feature itself.

### i. Equal Width Binning (Uniform Binning)

- Divides the **entire range** of values into **equal-sized intervals**
- Each bin has the **same width** (max - min) / n_bins
- **Useful for handling outliers** by grouping them into edge bins
- **Visual characteristic**: The histogram shape remains similar before and after binning (no changes in distribution shape)
- **Limitation**: Can create **imbalanced bins** with very few observations in some bins if data is skewed

**Example**: For ages 0-100 with 5 bins → [0-20], [20-40], [40-60], [60-80], [80-100]

**Implementation**: `pd.cut()` or `KBinsDiscretizer(strategy='uniform')`

---

### ii. Equal Frequency Binning (Quantile Binning) ⭐ HIGHLY USED

- Each bin contains approximately the **same number of observations**
- If you want 10 bins, each interval must contain **10% of total observations**
- The **width of intervals may NOT be equal** (varies based on data density)
- **Useful for handling outliers** and skewed distributions
- **Visual characteristic**: Makes the distribution **uniform** - all histogram bars will have **equal height**
- **Most commonly used** because it ensures balanced representation across all bins

**Example**: For 100 observations with 4 bins → Each bin contains exactly 25 observations, but bin widths differ

**Implementation**: `pd.qcut()` or `KBinsDiscretizer(strategy='quantile')`

---

### iii. K-Means Binning

- Uses the **K-Means clustering algorithm** to create bins
- The intervals are based on **cluster centroids** (cluster centers)
- Each data point is assigned to the **nearest centroid**
- **Best used when** data naturally divides into **multiple clusters** or has multi-modal distribution
- More sophisticated than equal-width or equal-frequency
- If you want 5 bins → K-Means finds 5 optimal cluster centers and assigns points to nearest center

**Example**: Salary data with natural clusters (entry-level, mid-level, senior, executive)

**Implementation**: `KBinsDiscretizer(strategy='kmeans')`

---

## 2. Supervised Binning

Supervised binning methods **use the target variable** to create optimal bin boundaries.

### i. Decision Tree Binning

- Uses a **decision tree algorithm** to determine optimal split points
- Bins are created based on how well they **separate different target classes**
- The algorithm finds split points that **maximize information gain** or minimize impurity
- **Best for classification problems** where you want bins that discriminate between classes
- Results in bins that are **most predictive** of the target variable
- Can handle **non-linear relationships** between feature and target

**Example**: Binning age based on how well different age ranges predict loan default (Yes/No)

**Implementation**: Train a `DecisionTreeClassifier` with `max_leaf_nodes` parameter, then use the leaf assignments as bins

---

## 3. Custom Binning

- Based on **domain knowledge**, **business logic**, or **expert judgment**
- You manually define bin boundaries that make sense for your specific use case
- **Most interpretable** because bins have real-world meaning
- Requires understanding of the domain and business context

**Examples**:
- Income: $0-30k → Low, $30k-70k → Middle, $70k-150k → Upper Middle, $150k+ → High
- Age: 0-18 → Minor, 18-65 → Working Age, 65+ → Retired
- Credit Score: <580 → Poor, 580-669 → Fair, 670-739 → Good, 740-799 → Very Good, 800+ → Exceptional

**Implementation**: `pd.cut()` with custom bin edges

---

### Scikit-learn Tools:
- `KBinsDiscretizer`: Flexible binning with strategies (uniform, quantile, kmeans)
- `pd.cut()`: Pandas function for equal-width and custom binning
- `pd.qcut()`: Pandas function for equal-frequency binning

---

## Binarization

Binarization converts **numerical features into binary (0 or 1)** based on a threshold value.

### When to Use Binarization:
- When you only care about **whether a value exceeds a threshold**, not the actual value
- For **presence/absence** scenarios (e.g., customer purchased: yes/no)
- To create **indicator variables** (e.g., high risk vs low risk)
- When **simplifying decision boundaries** improves model performance
- For **text processing** (word present or not in a document)

### How It Works:
- Choose a **threshold value**
- Values **≤ threshold** → 0
- Values **> threshold** → 1

### Scikit-learn Tool:
- `Binarizer`: Converts continuous values to binary based on threshold

---

## Key Differences

| Aspect | Binning | Binarization |
|--------|---------|-------------|
| **Output** | Multiple categories (3+ bins) | Binary (0 or 1) |
| **Use Case** | Group continuous data into ranges | Simple threshold-based classification |
| **Complexity** | More flexible, multiple bins | Simple, only 2 values |
| **Example** | Age → Child/Teen/Adult/Senior | Temperature > 30°C → Hot (1) or Not (0) |

---

## Important Considerations

### For Binning:
- **Information loss**: Binning reduces granularity of data
- **Bin selection matters**: Wrong bin boundaries can hide patterns
- **Encoding needed**: After binning, you may need one-hot encoding for models
- **Reversibility**: Original values cannot be recovered after binning

### For Binarization:
- **Threshold selection is critical**: Wrong threshold leads to poor results
- **Maximum information loss**: Only preserves above/below threshold info
- **Domain expertise needed**: Threshold should have business/scientific meaning
- **Not always appropriate**: Use only when binary distinction is meaningful

---

## Practical Examples

**Binning Example:**
- Income: \$0-30k → Low, \$30k-70k → Middle, \$70k+ → High
- Age: 0-18 → Minor, 18-65 → Adult, 65+ → Senior

**Binarization Example:**
- Blood Pressure > 140 → Hypertensive (1) or Normal (0)
- Transaction Amount > \$1000 → High Value (1) or Low Value (0)
- Test Score > 75% → Pass (1) or Fail (0)