# Binning and Discretization Practice
This notebook demonstrates common discretization strategies using scikit-learn on a small customer dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer, Binarizer

In [2]:
def summarize_bins(discretizer, feature_names):
    """Return bin edge summaries for each feature."""
    summaries = {}
    for feature, edges in zip(feature_names, discretizer.bin_edges_):
        summaries[feature] = np.round(edges, 2).tolist()
    return pd.DataFrame.from_dict(summaries, orient="index", columns=[f"edge_{i}" for i in range(len(discretizer.bin_edges_[0]))])

## Inspect the Dataset
We start by loading the synthetic dataset and reviewing its structure before applying any transformations.

In [3]:
df = pd.read_csv("dataset.csv")
df.head()

Unnamed: 0,CustomerID,Age,AnnualIncome,PurchaseFrequency,LoyaltyScore
0,1,24,32000,4,0.35
1,2,31,54000,9,0.58
2,3,28,42000,6,0.44
3,4,45,86000,12,0.73
4,5,52,91000,8,0.69


### Data Summary
Check data types, missing values, and basic statistics to understand scale before binning.

In [4]:
df.info()
print("\nMissing values per column:")
print(df.isna().sum())
df.describe().T

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CustomerID         20 non-null     int64  
 1   Age                20 non-null     int64  
 2   AnnualIncome       20 non-null     int64  
 3   PurchaseFrequency  20 non-null     int64  
 4   LoyaltyScore       20 non-null     float64
dtypes: float64(1), int64(4)
memory usage: 932.0 bytes

Missing values per column:
CustomerID           0
Age                  0
AnnualIncome         0
PurchaseFrequency    0
LoyaltyScore         0
dtype: int64


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustomerID,20.0,10.5,5.91608,1.0,5.75,10.5,15.25,20.0
Age,20.0,37.9,9.497368,24.0,30.5,37.0,45.5,56.0
AnnualIncome,20.0,61950.0,19874.54071,32000.0,46500.0,60000.0,78750.0,99000.0
PurchaseFrequency,20.0,7.9,2.918183,3.0,5.75,8.0,10.0,13.0
LoyaltyScore,20.0,0.5725,0.146714,0.3,0.4625,0.595,0.6825,0.81


## Equal-Width Binning (Uniform)
Each bin spans the same value range. Useful when spread is roughly even but sensitive to outliers.

In [5]:
features = ["Age", "AnnualIncome"]
uniform_discretizer = KBinsDiscretizer(n_bins=4, encode="ordinal", strategy="uniform")
uniform_binned = uniform_discretizer.fit_transform(df[features])
uniform_df = pd.DataFrame(uniform_binned, columns=[f"{col}_uniform_bin" for col in features])

pd.concat([df[features], uniform_df], axis=1).head()

Unnamed: 0,Age,AnnualIncome,Age_uniform_bin,AnnualIncome_uniform_bin
0,24,32000,0.0,0.0
1,31,54000,0.0,1.0
2,28,42000,0.0,0.0
3,45,86000,2.0,3.0
4,52,91000,3.0,3.0


In [6]:
summarize_bins(uniform_discretizer, features)

Unnamed: 0,edge_0,edge_1,edge_2,edge_3,edge_4
Age,24.0,32.0,40.0,48.0,56.0
AnnualIncome,32000.0,48750.0,65500.0,82250.0,99000.0


## Equal-Frequency Binning (Quantile)
Bins contain (roughly) the same number of samples, making them robust to skewed distributions.

In [7]:
freq_features = ["Age", "PurchaseFrequency"]
quantile_discretizer = KBinsDiscretizer(n_bins=4, encode="ordinal", strategy="quantile")
quantile_binned = quantile_discretizer.fit_transform(df[freq_features])
quantile_df = pd.DataFrame(quantile_binned, columns=[f"{col}_quantile_bin" for col in freq_features])

pd.concat([df[freq_features], quantile_df], axis=1).head()



Unnamed: 0,Age,PurchaseFrequency,Age_quantile_bin,PurchaseFrequency_quantile_bin
0,24,4,0.0,0.0
1,31,9,1.0,2.0
2,28,6,0.0,1.0
3,45,12,2.0,3.0
4,52,8,3.0,2.0


In [8]:
summarize_bins(quantile_discretizer, freq_features)

Unnamed: 0,edge_0,edge_1,edge_2,edge_3,edge_4
Age,24.0,30.5,37.0,45.5,56.0
PurchaseFrequency,3.0,5.75,8.0,10.0,13.0


## K-Means Binning
Cluster-based bins adapt to data density and can capture multi-modal distributions. Bin centers come from 1D k-means clustering.

In [9]:
kmeans_features = ["AnnualIncome", "LoyaltyScore"]
kmeans_discretizer = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="kmeans", random_state=42)
kmeans_binned = kmeans_discretizer.fit_transform(df[kmeans_features])
kmeans_df = pd.DataFrame(kmeans_binned, columns=[f"{col}_kmeans_bin" for col in kmeans_features])

pd.concat([df[kmeans_features], kmeans_df], axis=1).head()

Unnamed: 0,AnnualIncome,LoyaltyScore,AnnualIncome_kmeans_bin,LoyaltyScore_kmeans_bin
0,32000,0.35,0.0,0.0
1,54000,0.58,0.0,1.0
2,42000,0.44,0.0,0.0
3,86000,0.73,2.0,2.0
4,91000,0.69,2.0,2.0


In [10]:
summarize_bins(kmeans_discretizer, kmeans_features)

Unnamed: 0,edge_0,edge_1,edge_2,edge_3
AnnualIncome,32000.0,54344.44,75733.33,99000.0
LoyaltyScore,0.3,0.49,0.65,0.81


## Feature Binarization
Convert continuous features into binary indicators using a threshold. Handy for rule-based models or engineered flags.

In [11]:
binarizer = Binarizer(threshold=60000)
df_binarized = df.copy()
df_binarized["HighIncomeFlag"] = binarizer.fit_transform(df_binarized[["AnnualIncome"]])
df_binarized[["AnnualIncome", "HighIncomeFlag"]].head()

Unnamed: 0,AnnualIncome,HighIncomeFlag
0,32000,0
1,54000,0
2,42000,0
3,86000,1
4,91000,1


### Next Steps
Experiment with different numbers of bins, thresholds, and features to see how discretization affects downstream models.