## Equal Size Kbin for numeric values

Say we want to transform a numeric values into k bins with equal sizes. Two approach are tested here

1. use `pd.cut`
2. use `KBinsDiscretizer` from sklearn

#### `pd.cut` approach

With this approach

* missing is handled natively
* Easy to switch between string vs num bin labels
    * string labels: easy to extract the eage info as label
    * num labels: just order the bins in numerical orders

In [6]:
# both approach can utilize this function to get labels
def create_labels(cutoff_values):
  """Creates a list of labels based on a list of cutoff values.
  """

  labels = []
  labels.append(f"[{cutoff_values[0]}, {cutoff_values[1]}]")
  for i in range(1, len(cutoff_values) - 1):
    labels.append(f"({cutoff_values[i]}, {cutoff_values[i + 1]}]")
  return labels


In [7]:
import pandas as pd
import numpy as np

# Sample training data
train_df = pd.DataFrame({'value': [0, 1, np.nan, 3, 4, 5, 6, 7, np.nan, 9, 10, 11]})
# New data
new_df = pd.DataFrame({'value': [-1, 1, 5, 8, 12, 16, np.nan]})

# Define the number of bins (n) and labels
n = 4

# FIT: Create equal-width bins for training data, get bin_edges and labels to apply to new data
train_bins, bin_edges = pd.cut(train_df['value'], bins=n, retbins=True)
train_df['bin'] = train_bins
bin_edges[0] = -np.inf
bin_edges[-1] = np.inf

label_type = 'string' # number
if label_type == 'string':
    bin_labels = create_labels(bin_edges)
    print(bin_labels)
elif label_type == 'number':
    bin_labels = [i for i in range(n)]

# TRANSFORM: Apply the same binning scheme to the new data using the bin edges from training
new_bins = pd.cut(new_df['value'], bins=bin_edges, labels=bin_labels)
new_df['bin'] = new_bins

# Print the new DataFrame with bins
print("\nNew Data:")
print(new_df)


['[-inf, 2.75]', '(2.75, 5.5]', '(5.5, 8.25]', '(8.25, inf]']

New Data:
   value           bin
0   -1.0  [-inf, 2.75]
1    1.0  [-inf, 2.75]
2    5.0   (2.75, 5.5]
3    8.0   (5.5, 8.25]
4   12.0   (8.25, inf]
5   16.0   (8.25, inf]
6    NaN           NaN


#### `kBinDiscretizer` approach

One issue here is that `KBinsDiscretizer` does not accept missing values encoded as NaN natively. One simple solution is to mask the missing values before transformation and add them back afterwards, like shown below. 

In [8]:
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
import pandas as pd

# Sample training data
train_df = pd.DataFrame({'value': [0, 1, np.nan, 3, 4, 5, 6, 7, np.nan, 9, 10, 11]})
                        
# New data
new_df = pd.DataFrame({'value': [-1, 1, 5, 8, 12, 16, np.nan]})

# Create a mask for missing values in both training and new data
train_mask = train_df['value'].isna()
new_mask = new_df['value'].isna()

# Apply KBinsDiscretizer to masked data
kbins = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')

# Fit and transform the training data (excluding missing values)
binned_values_train = kbins.fit_transform(train_df.loc[~train_mask, ['value']])
train_df.loc[~train_mask, 'binned_value'] = binned_values_train

# Transform the new data (excluding missing values) using the same transformer
binned_values_new = kbins.transform(new_df.loc[~new_mask, ['value']])
new_df.loc[~new_mask, 'binned_value'] = binned_values_new

# The transformed data with numeric bin labels, and missing values are represented as NaN
print("\nNew Data:")
print(new_df)



New Data:
   value  binned_value
0   -1.0           0.0
1    1.0           0.0
2    5.0           1.0
3    8.0           2.0
4   12.0           3.0
5   16.0           3.0
6    NaN           NaN


In [10]:
train_df = pd.DataFrame({'value': [0, 1, np.nan, 3, 4, 5, 6, 7, np.nan, 9, 10, 11]})
train_df.index = range(2000, 2000+train_df.shape[0])

train_mask = train_df['value'].isna()
print(train_mask)

2000    False
2001    False
2002     True
2003    False
2004    False
2005    False
2006    False
2007    False
2008     True
2009    False
2010    False
2011    False
Name: value, dtype: bool


Some more work needs to be done to get the treshold label if that's desired. we can get the bin_edges from kbins to create threshold like we had from pd.cut but one issue is the lower end is 0, rather than -inf as we had from pd.cut, similarly the higher end is shown to be 11, but should be inf as we had from pd.cut. 

In [None]:
kbins.bin_edges_[0]

array([-inf, 2.75, 5.5 , 8.25,  inf])

In [None]:
# the above issues can be addressed easily
kbins.bin_edges_[0][0] = -np.inf
kbins.bin_edges_[0][-1] = np.inf
kbins.bin_edges_[0]

In [None]:
# these labels then can easily be mapped to the numeric values created above
labels = create_labels(kbins.bin_edges_[0])
print(labels)

['[-inf, 2.75]', '(2.75, 5.5]', '(5.5, 8.25]', '(8.25, inf]']
