# Binning

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

__Converts a numerical column (bill_length_mm) to a matrix of binary variables__

1. Execute the code 
   
2. Understand what is happening

3. Search on internet the benefit

4. Explain to the rest of the group what you did



### Import library

In [None]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

### Read data into a dataframe df

In [None]:
df = pd.read_csv('../data/penguins.csv')
df

### Check Missing value

In [None]:
df.isna().sum()

#### Drop all missing values

In [None]:
df = df.dropna()
df.isna().sum()

### Discritize a numerical column

In [None]:
# Define the transformer
kbins = KBinsDiscretizer(n_bins=5, encode='onehot-dense', strategy='quantile')
kbins

In [None]:
# Numerical columns to be transformed
columns = df[['bill_length_mm']]
columns

#### "Fitting" the transformer kbins on the numerical columns
During the fitting step of KBinsDiscretizer, only the following happens:

1. __Calculation of Bin Edges__:
    + For each feature (e.g., bill_length_mm), __the discretizer calculates the edges of the bins based on the selected strategy:__
        + __Quantile__ (in this case):
            + The data is divided into n_bins (e.g., 5) such that each bin contains approximately the same number of samples.
            + Percentiles (e.g., 20%, 40%, 60%, 80%) are calculated to determine where to place the bin edges.
    + These edges are stored in the **bin_edges_** attribute of the KBinsDiscretizer object.
2. __No Transformation of Data__:
    + During fitting, the data itself is not transformed. The discretizer only learns the bin boundaries.
3. __Storage of Bin Information__:
    + The discretizer memorizes the bin edges for each feature. This information will later be used during the transform step to assign each value to a bin.

In [None]:
kbins.fit(columns)

In [None]:
print(kbins.bin_edges_)

In [None]:
# the bins goes from 
edges = kbins.bin_edges_[0]
for i in range(len(edges)-1):
    edge1 = edges[i]
    edge2 = edges[i+1]
    print(f'bin{i+1}: {edge2}_to_{edge1}')


Wait _**kbins.fit(columns)**_ is equivalen to:

In [None]:
print(columns.quantile(q=0.).values)  # this is equivalent to the min
print(columns.quantile(q=0.20).values)
print(columns.quantile(q=0.40).values)
print(columns.quantile(q=0.60).values)
print(columns.quantile(q=0.80).values)
print(columns.quantile(q=1.00).values)

#### Transforming the numericol column
The bins are transformed into 5 one-hot encoded columns, each representing membership in one bin (0 or 1).

In [None]:
t = kbins.transform(columns)
print(t.shape)
print(t)

---
#### 🌶️ Bonus: Create nice labels

In [None]:
edges = kbins.bin_edges_[0]
labels = []
for i in range(len(edges)-1):
    edge1 = edges[i]
    edge2 = edges[i+1]
    labels.append(f"{edge1}_to_{edge2}")

# create a DataFrame
df_bins = pd.DataFrame(t, columns=labels)
df_bins.head()

---

__Hint__: You may have noticed that the output of the transformations with sklearn Feature Engineering methods is a numpy array. In case you want/need a DataFrame as output you can add to your code:
```python
from sklearn import set_config
set_config(transform_output="pandas")
```


---
🌶️🌶️🌶️ __Bonus__: set the strategy parameter to 'uniform' and see how the edges change

In [None]:
# define the transformer
kbins_uniform = KBinsDiscretizer(n_bins=5, strategy='onehot-dense', strategy = ... )
kbins_uniform

In [None]:
# learn the bins edge on the column 
kbins_uniform.?

In [None]:
# Display the bins edge
print(kbins.?)