<div style="
  max-width:100%;
  overflow-x:hidden;
  background-color:#C19A6B;
  text-align:center;
  padding:20px;
  border-radius:10px;
  box-shadow:0 0 8px rgba(0,0,0,0.2);
">
  <h1 style="margin:0;">Binning</h1>
  <p style="text-align:center; color:green; font-size:18px; margin:10px 0 0 0;">
    Changes the continuous data into discrete format
  </p>
</div>


![ytss](assets/Binning.png) <br><br>

### 1. Equal Width Binning  
1-10,10-20,20-30...90-10: bin-size is 10. equal width
* Outliers can be handled in a way, cuz they will be added to rear bins.
* No change in spread of data. stays as it was distributed.

## Quantile Binning  
![ytss](assets/Quantile.png)

the number of intervals are given as input again. but each interval contains that much percentile values.. for example if 10 was given as number for number of intervals... each interval will have 10 percentile value... 1-10% of values will lie in first interval, now that can be 50 values or 20 or 10 that depends on dataset.

* Handles Outliers
* makes the data uniform... the histogram will show same sized bars. because every bin contains almost similar number of percentile values.

# **KMeans Binning**  
Uses KMean algorithm to form bins.  <br><br>
![ytss](assets/KMeans.png)  
<br><br>
Here is it whats going down:  
1. Just like other binning techniques here firstly you choose the number of intervals.
2. After that you choose random spots between all of the data (see the image)
3. Then you calculate each data point's distance from each centroid (those randomly chosen spots).
4. Then data points are assigned to their closest centroid making a temproray bin.
5. After that you take mean of all the data points in every cluster/bin and then the centroid is moved to that mean spot.
6. and then,from step3 to step5 are revised again and again until there is no more shifting a data points from one cluster to other and thats onlypossible when centroids become stationary and they eventually will.

# Let's Code all of these

In [13]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import KBinsDiscretizer

In [6]:
df = pd.read_csv("assets/Titanic-Dataset.csv",usecols=["Age",'Fare','Survived'])

In [7]:
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Age       714 non-null    float64
 2   Fare      891 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 21.0 KB


In [9]:
df.dropna(inplace=True) # imputation isn't focus here so doing this

In [10]:
X = df.drop('Survived',axis=1)
Y= df['Survived']

In [11]:
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=42)

In [12]:
X_train.head()

Unnamed: 0,Age,Fare
328,31.0,20.525
73,26.0,14.4542
253,30.0,16.1
719,33.0,7.775
666,25.0,13.0


In [60]:
kbin_age = KBinsDiscretizer(n_bins = 10,encode='ordinal',strategy='quantile')# tried again and again with changed parameters..
kbin_fare = KBinsDiscretizer(n_bins= 10,encode='ordinal',strategy='quantile')

In [61]:
trf = ColumnTransformer([
    ('first',kbin_age,[0]),
    ('second',kbin_fare,[1])
])

In [62]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [63]:
trf.named_transformers_['first'].bin_edges_[0].tolist()

[0.42, 14.0, 19.0, 22.0, 25.0, 28.5, 32.0, 36.0, 42.0, 50.0, 80.0]

In [64]:
X_train_trf[:,0].shape

(571,)

In [65]:
output = pd.DataFrame({'age':X_train['Age'],'age_trf':X_train_trf[:,0],'Fare':X_train['Fare'],'Fare_trf':X_train_trf[:,1]})
output.head()

Unnamed: 0,age,age_trf,Fare,Fare_trf
328,31.0,5.0,20.525,5.0
73,26.0,4.0,14.4542,4.0
253,30.0,5.0,16.1,5.0
719,33.0,6.0,7.775,1.0
666,25.0,4.0,13.0,4.0


In [66]:
output['Age_labels'] = pd.cut(x=X_train['Age'],
                        bins=trf.named_transformers_['first'].bin_edges_[0].tolist())
output['Fare_labels'] = pd.cut(x=X_train['Fare'],
                              bins=trf.named_transformers_['second'].bin_edges_[0].tolist())

In [67]:
output.Age_labels.value_counts(normalize=True)

Age_labels
(14.0, 19.0]    0.119298
(28.5, 32.0]    0.119298
(0.42, 14.0]    0.101754
(22.0, 25.0]    0.101754
(32.0, 36.0]    0.098246
(36.0, 42.0]    0.094737
(42.0, 50.0]    0.094737
(19.0, 22.0]    0.091228
(50.0, 80.0]    0.091228
(25.0, 28.5]    0.087719
Name: proportion, dtype: float64

In [68]:
output.Fare_labels.value_counts(normalize=True)
output.shape

(571, 6)

the reason for values not to be exactly 10% is because total number of records isn't directly divisble by 10. same entries are covered in 1 bin even it surpasses the quantile range(10%)

In [69]:
clf = DecisionTreeClassifier()
clf.fit(X_train_trf,y_train)
y_pred = clf.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.6223776223776224