# Binning and Binarization in Feature Transformation | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2016%20Binning%20and%20Binarization)

When working with continuous data, it’s often useful to convert it into discrete intervals (bins) or binary values. This process is known as **binning** (or discretization) and **binarization**, respectively. These techniques can help reduce noise, handle outliers, and simplify models.

---

## 1. Binning and Binarization

### Binarization

**Binarization** converts a continuous feature into binary (0/1) values based on a threshold \( t \). The transformation is given by:  

$$  
f(x) =   
\begin{cases}  
1 & \text{if } x > t, \\
0 & \text{if } x \leq t.  
\end{cases}  
$$  

### Python Code: Binarization Example

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {'Feature': [0.2, 0.5, 1.5, 2.0, 3.5, 4.0, 5.2]}
df = pd.DataFrame(data)

# Set threshold value
threshold = 2.0

# Binarize the feature based on the threshold
df['Binarized'] = (df['Feature'] > threshold).astype(int)

print("Binarization Example:")
print(df)

# Plot original and binarized data
plt.figure(figsize=(8, 4))
plt.plot(df['Feature'], 'o-', label='Original Feature')
plt.step(range(len(df)), df['Binarized'], where='mid', label='Binarized', color='red')
plt.xlabel('Index')
plt.ylabel('Value')
plt.legend()
plt.title("Binarization of Feature")
plt.show()
```

---

## 2. Discretization

<p><strong>Discretization</strong> (or binning) converts a continuous feature into categorical bins. For a feature <i>x</i> and a set of bin boundaries <i>{b<sub>0</sub>, b<sub>1</sub>, &hellip;, b<sub>k</sub>}</i>, the bin index <i>i</i> is defined as:</p>  

$$  
\text{bin}(x) = i \quad \text{if} \quad b_i \leq x < b_{i+1}  
$$  

### Python Code: Discretization using Fixed Bins

```python
# Define bin edges
bins = [0, 1, 2, 4, 6]
labels = ['Very Low', 'Low', 'Medium', 'High']

# Use pd.cut to discretize the feature
df['Discretized'] = pd.cut(df['Feature'], bins=bins, labels=labels, include_lowest=True)

print("\nDiscretization (Fixed Bins) Example:")
print(df)
```

---

## 3. Quantile Binning

**Quantile Binning** divides the data into bins such that each bin has (approximately) the same number of observations. The bin boundaries are defined by the quantiles of the feature’s distribution.

For a dataset of size \( n \) divided into \( k \) bins, the boundaries are given by the quantiles at \( \frac{100}{k} \) percentile increments.  
For example, if using 4 bins (quartiles), the boundaries are the 25th, 50th, and 75th percentiles.

### Python Code: Quantile Binning

```python
# Use pd.qcut to perform quantile binning
df['Quantile_Bin'] = pd.qcut(df['Feature'], q=4, labels=False)

print("\nQuantile Binning Example:")
print(df)
```

---

## 4. KMeans Binning

**KMeans Binning** uses clustering to group similar values together. The idea is to apply KMeans clustering to the one-dimensional feature and then use the cluster labels as bin indices.

### Steps:
1. Reshape the feature into a 2D array.
2. Fit KMeans with the desired number of bins \( k \).
3. Use the cluster labels as the bin assignment.

### Python Code: KMeans Binning

```python
from sklearn.cluster import KMeans

# Reshape the feature to a 2D array (required by KMeans)
X = df['Feature'].values.reshape(-1, 1)

# Set the desired number of bins (clusters)
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
df['KMeans_Bin'] = kmeans.fit_predict(X)

print("\nKMeans Binning Example:")
print(df)

# Plot the clusters (bins)
plt.figure(figsize=(8, 4))
plt.scatter(range(len(df)), df['Feature'], c=df['KMeans_Bin'], cmap='viridis', s=100)
plt.xlabel('Index')
plt.ylabel('Feature Value')
plt.title("KMeans Binning of Feature")
plt.colorbar(label='Cluster Label (Bin)')
plt.show()
```

---

## Conclusion

- **Binarization** converts a continuous variable into a binary value based on a threshold.
- **Discretization** groups continuous data into predefined bins using fixed boundaries.
- **Quantile Binning** ensures equal frequency in each bin by setting boundaries at data quantiles.
- **KMeans Binning** applies clustering to determine bin assignments based on natural groupings in the data.
