# Binning

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

__Converts a numerical column (bill_length_mm) to a matrix of binary variables__

1. Execute the code 
   
2. Understand what is happening

3. Search on internet the benefit

4. Explain to the rest of the group what you did



### Import library

In [1]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

### Read data into a dataframe df

In [2]:
df = pd.read_csv('../data/penguins.csv')
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
337,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
338,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
339,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
340,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


### Check Missing value

In [3]:
df.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        3
flipper_length_mm    2
body_mass_g          0
sex                  9
dtype: int64

#### Drop all missing values

In [4]:
df = df.dropna()
df.isna().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

### Discritize a numerical column

In [5]:
# Define the transformer
kbins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='quantile')
kbins

In [6]:
# Numerical columns to be transformed
columns = df[['bill_length_mm']]
columns

Unnamed: 0,bill_length_mm
0,39.1
1,39.5
2,40.3
3,36.7
4,39.3
...,...
337,47.2
338,46.8
339,50.4
340,45.2


#### "Fitting" the transformer kbins on the numerical columns
During the fitting step of KBinsDiscretizer, only the following happens:

1. __Calculation of Bin Edges__:
    + For each feature (e.g., bill_length_mm), __the discretizer calculates the edges of the bins based on the selected strategy:__
        + __Quantile__ (in this case):
            + The data is divided into n_bins (e.g., 5) such that each bin contains approximately the same number of samples.
            + Percentiles (e.g., 20%, 40%, 60%, 80%) are calculated to determine where to place the bin edges.
    + These edges are stored in the **bin_edges_** attribute of the KBinsDiscretizer object.
2. __No Transformation of Data__:
    + During fitting, the data itself is not transformed. The discretizer only learns the bin boundaries.
3. __Storage of Bin Information__:
    + The discretizer memorizes the bin edges for each feature. This information will later be used during the transform step to assign each value to a bin.

In [7]:
kbins.fit(columns)

In [8]:
print(kbins.bin_edges_)

[array([32.1 , 38.6 , 42.02, 46.1 , 49.5 , 59.6 ])]


In [9]:
# the bins goes from 
edges = kbins.bin_edges_[0]
for i in range(len(edges)-1):
    edge1 = edges[i]
    edge2 = edges[i+1]
    print(f'bin{i+1}: {edge2}_to_{edge1}')


bin1: 38.6_to_32.1
bin2: 42.02_to_38.6
bin3: 46.1_to_42.02
bin4: 49.5_to_46.1
bin5: 59.6_to_49.5


Wait _**kbins.fit(columns)**_ is equivalen to:

In [10]:
print(columns.quantile(q=0.).values)  # this is equivalent to the min
print(columns.quantile(q=0.20).values)
print(columns.quantile(q=0.40).values)
print(columns.quantile(q=0.60).values)
print(columns.quantile(q=0.80).values)
print(columns.quantile(q=1.00).values)

[32.1]
[38.6]
[42.02]
[46.1]
[49.5]
[59.6]


#### Transforming the numericol column
The bins are transformed into 5 one-hot encoded columns, each representing membership in one bin (0 or 1).

In [11]:
t = kbins.transform(columns)
print(t.shape)
print(t)

(329, 5)
[[0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1.]]


---
#### 🌶️ Bonus: Create nice labels

In [12]:
edges = kbins.bin_edges_[0]
labels = []
for i in range(len(edges)-1):
    edge1 = edges[i]
    edge2 = edges[i+1]
    labels.append(f"{edge1}_to_{edge2}")

# create a DataFrame
df_bins = pd.DataFrame(t, columns=labels)
df_bins.head()

Unnamed: 0,32.1_to_38.6,38.6_to_42.02,42.02_to_46.1,46.1_to_49.5,49.5_to_59.6
0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0


---

__Hint__: You may have noticed that the output of the transformations with sklearn Feature Engineering methods is a numpy array. In case you want/need a DataFrame as output you can add to your code:
```python
from sklearn import set_config
set_config(transform_output="pandas")
```


---
🌶️🌶️🌶️ __Bonus__: set the strategy parameter to 'uniform' and see how the edges change

In [17]:
# define the transformer
kbins_uniform = KBinsDiscretizer(n_bins=5,  strategy = 'uniform')
kbins_uniform

In [20]:
# learn the bins edge on the column 
kbins_uniform

In [22]:
# Display the bins edge
print(kbins)

KBinsDiscretizer(encode='onehot-dense')
