<a href="https://colab.research.google.com/github/Sujeet2003/Feature-Engineering/blob/main/Binning_and_Binarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding Numerical Features
Numerical values can be distributed using `bins` in `2 ways`
  - Discretization / Binning
  - Binarization

Types of `Discretization`
  - *Uniform / width Binning* : Having `equal value` with given interval of bins
  - *Quantile Binning* : Contains `equal data points` based on quantiles of data distribution
  - *k means* : Uses in case of clustered data points


USES: `sklearn.preprocessing.kBinsDiscretizer(n_bins=10, encode='ordinal/onehot', strategy='uniform/quantile/kmeans')`

In [43]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

In [29]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/train.csv', usecols=['Age', 'Fare', 'Survived'])
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/test.csv')

In [30]:
train.shape

(891, 3)

In [31]:
train.describe()

Unnamed: 0,Survived,Age,Fare
count,891.0,714.0,891.0
mean,0.383838,29.699118,32.204208
std,0.486592,14.526497,49.693429
min,0.0,0.42,0.0
25%,0.0,20.125,7.9104
50%,0.0,28.0,14.4542
75%,1.0,38.0,31.0
max,1.0,80.0,512.3292


In [32]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Age       714 non-null    float64
 2   Fare      891 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 21.0 KB


In [33]:
train.dropna(inplace=True)

In [34]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Age       714 non-null    float64
 2   Fare      714 non-null    float64
dtypes: float64(2), int64(1)
memory usage: 22.3 KB


In [35]:
train.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [36]:
X = train.iloc[:, 1:]
y = train.iloc[:, 0]

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
dtc = DecisionTreeClassifier()

In [39]:
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred)

0.6363636363636364

In [41]:
np.mean(cross_val_score(DecisionTreeClassifier(), X, y, cv=10, scoring='accuracy'))

0.6289319248826291

In [44]:
kbins = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')

In [47]:
trf = ColumnTransformer([
    ('age', kbins, [0]),
    ('fare', kbins, [1])
])

In [48]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [51]:
trf.named_transformers_['fare'].n_bins_

array([5])

In [53]:
trf.named_transformers_['fare'].bin_edges_

array([array([  0.    ,   7.8958,  13.    ,  26.    ,  51.4792, 512.3292])],
      dtype=object)

In [54]:
X_train_trf

array([[2., 2.],
       [2., 2.],
       [2., 2.],
       ...,
       [3., 4.],
       [3., 2.],
       [3., 1.]])

In [57]:
final = pd.DataFrame({
    'age': X_train['Age'],
    'trf_age': X_train_trf[:, 0],
    'fare': X_train['Fare'],
    'trf_fare': X_train_trf[:, 1]
})

In [58]:
final

Unnamed: 0,age,trf_age,fare,trf_fare
328,31.0,2.0,20.5250,2.0
73,26.0,2.0,14.4542,2.0
253,30.0,2.0,16.1000,2.0
719,33.0,3.0,7.7750,0.0
666,25.0,2.0,13.0000,2.0
...,...,...,...,...
92,46.0,4.0,61.1750,4.0
134,25.0,2.0,13.0000,2.0
337,41.0,3.0,134.5000,4.0
548,33.0,3.0,20.5250,2.0


In [59]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train_trf, y_train)

In [60]:
y_pred = dtc.predict(X_test_trf)
accuracy_score(y_test, y_pred)

0.6433566433566433

In [61]:
np.mean(cross_val_score(DecisionTreeClassifier(), X, y, cv=10, scoring='accuracy'))

0.6372848200312988

## Binarization
Converting into binary as `0 or 1`

In [63]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/train.csv', usecols=['Age', 'Fare', 'SibSp', 'Parch', 'Survived'])

In [66]:
train['family'] = train['SibSp'] + train['Parch']

In [70]:
train.drop(columns=['SibSp', 'Parch'], inplace=True)

In [71]:
train.head()

Unnamed: 0,Survived,Age,Fare,family
0,0,22.0,7.25,1
1,1,38.0,71.2833,1
2,1,26.0,7.925,0
3,1,35.0,53.1,1
4,0,35.0,8.05,0


In [75]:
train.dropna(inplace=True)

In [76]:
X = train.drop(columns=['Survived'])
y = train['Survived']

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [79]:
# Without Binarization
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

In [80]:
accuracy_score(y_test, y_pred)

0.6363636363636364

In [81]:
np.mean(cross_val_score(DecisionTreeClassifier(), X, y, cv=10, scoring='accuracy'))

0.6485133020344287

In [82]:
from sklearn.preprocessing import Binarizer

In [83]:
trf = ColumnTransformer([
    ('binarizing', Binarizer(copy=False), ['family'])
], remainder='passthrough')

In [84]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [87]:
final = pd.DataFrame(X_train_trf, columns=['family', 'Age', 'Fare'])

In [89]:
final.head()

Unnamed: 0,family,Age,Fare
0,1.0,31.0,20.525
1,1.0,26.0,14.4542
2,1.0,30.0,16.1
3,0.0,33.0,7.775
4,0.0,25.0,13.0


In [90]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train_trf, y_train)
y_pred = dtc.predict(X_test_trf)

In [91]:
accuracy_score(y_test, y_pred)

0.6083916083916084

In [92]:
np.mean(cross_val_score(DecisionTreeClassifier(), X, y, cv=10, scoring='accuracy'))

0.6443466353677622