<h5>Binning and Binarization:</h5> are preprocessing techniques used to transform numerical columns to categorical columns. Although, most of the time, numerical data yield better results, there might be some cases when the reverse is true. For example, if you are working on <h6>Appstore dataset</h6> and you have a column called No. of downloads. So, one problem with such columns is that, the most popular apps will have millions, billions downloads while the new ones or so unpopular apps will have very less 10, 20, 100 etc. downloads.

In such cases, our ranges for the data columns will be significantly greater. so Binning can be useful.

<h5 style='color: green'>What binning does is, it creates bins/intervals like 1-100, 100-1000, 1000-10K etc and measures the frequency of the value that lies in certain range  in the dataset.</h5>

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.metrics import accuracy_score

from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier

In [102]:
df = pd.read_csv('titanic.csv', usecols=['Age', 'Fare', 'Survived'])
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [103]:
df.shape

(891, 3)

In [104]:
df.dropna(inplace=True)

In [105]:
df.shape

(714, 3)

In [106]:
y = df.iloc[:, 0]
X = df.iloc[:, 1:3]

In [107]:
X.shape

(714, 2)

In [108]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [109]:
X_train.shape

(571, 2)

In [110]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
score = accuracy_score(y_test, pred)
print("Accuracy score:", score)

Accuracy score: 0.6293706293706294


In [111]:
clf = DecisionTreeClassifier()
np.mean(cross_val_score(clf, X, y, scoring='accuracy', cv=10))

0.6274843505477308

In [112]:
age_discretizer = KBinsDiscretizer(n_bins=15, strategy='quantile', encode='ordinal')
fare_discretizer = KBinsDiscretizer(n_bins=15, strategy='quantile', encode='ordinal')

In [113]:
trnsf = ColumnTransformer(transformers=[
    ('first', age_discretizer, [0]),
    ('second', fare_discretizer, [1])
])

In [114]:
X_train_trnsf = trnsf.fit_transform(X_train)
X_test_trnsf = trnsf.transform(X_test)

In [115]:
trnsf.named_transformers_['first'].n_bins_

array([15])

In [116]:
output = ({
    'Age': X_train['Age'],
    'Fare': X_train['Fare'],
    'Age_transf': X_train_trnsf[:, 0],
    'Fare_transf': X_train_trnsf[:, 1]
})

result = pd.DataFrame(output)
result

Unnamed: 0,Age,Fare,Age_transf,Fare_transf
328,31.0,20.5250,8.0,8.0
73,26.0,14.4542,6.0,7.0
253,30.0,16.1000,8.0,7.0
719,33.0,7.7750,9.0,2.0
666,25.0,13.0000,6.0,6.0
...,...,...,...,...
92,46.0,61.1750,12.0,12.0
134,25.0,13.0000,6.0,6.0
337,41.0,134.5000,11.0,14.0
548,33.0,20.5250,9.0,8.0


In [117]:
# using transformed columns to make predictions
clf = DecisionTreeClassifier()
clf.fit(X_train_trnsf, y_train)
preds = clf.predict(X_test_trnsf)
score = accuracy_score(y_test, preds)
print("Accuracy score:", score)

Accuracy score: 0.6363636363636364


Performance has improved slightly this time.