**Objective**: Converting Numerical data to categorical data. We can create categorical data from numerical data using bins.
![image.png](attachment:image.png)

Mainly there are two methods as follows:

![image.png](attachment:image.png)

# A. Discretization

![image.png](attachment:image.png)

## Types of Binning

![image.png](attachment:image.png)

## I. Unsuperivised 


### 1. Equal Width / Uniform Binning

**Benefits**
- Outlier are handled.
- Spread of data does not change.

![image-2.png](attachment:image-2.png)

### 2.  Equal Frequency/ Quantile Binning

**Benefits**
- Handles outlier
- makes the value spread uniform.

![image.png](attachment:image.png)

### 3. KMeans Binning

**Kmeans is a clustering algorithm**
This is used when data are in clusters as shown in below figure.

Intervals in this algorithm are called centroid. 
we calculated distance of each point with every centroid.
The least distance with the centriod is considered as point of that centroid cluster.

Then centroid is re-calculated, by taking mean of all the points. and remapping is done.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

# How to implement it.

![image.png](attachment:image.png)

# Titanic Data set.

In [2]:
# Importing library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

In [20]:
# Reading CSV file
df = pd.read_csv("train.csv", usecols = ['Age', 'Fare', 'Survived'])

In [24]:
df.isnull().sum()

Survived    0
Age         0
Fare        0
dtype: int64

In [22]:
df.dropna(inplace = True)

In [25]:
df.shape

(714, 3)

In [26]:
df.sample(5)

Unnamed: 0,Survived,Age,Fare
233,1,5.0,31.3875
492,0,55.0,30.5
770,0,24.0,9.5
179,0,36.0,0.0
22,1,15.0,8.0292


In [33]:
X = df.drop('Survived', axis = 1)
Y = df.iloc[:, 0]

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, Y,
                                                   test_size = 0.2,
                                                   random_state = 1)

In [35]:
y_train.head(2)

830    1
565    0
Name: Survived, dtype: int64

In [36]:
clf = DecisionTreeClassifier()

In [38]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [39]:
accuracy_score(y_test, y_pred)

0.6573426573426573

In [40]:
np.mean(cross_val_score(DecisionTreeClassifier(),
                        X, Y, cv = 10 , scoring = 'accuracy' ))

0.6316705790297339

### Quantiles Strategy

In [41]:
kbin_age = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'quantile')
kbin_fare = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'quantile')

In [60]:
trf = ColumnTransformer([
    ('first', kbin_age, [0]),
    ('Second', kbin_fare, [1])
])

attributes
- bin_edges_
- n_bins_

In [61]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [59]:
X_train_trf

array([[1., 4.],
       [3., 5.],
       [7., 6.],
       ...,
       [4., 5.],
       [2., 9.],
       [2., 1.]])

In [64]:
X_train['Age']

830    15.0
565    24.0
148    36.5
105    28.0
289    22.0
       ... 
179    36.0
808    39.0
93     26.0
291    19.0
51     21.0
Name: Age, Length: 571, dtype: float64

In [65]:
output = pd.DataFrame({
    'age' : X_train['Age'],
    'age_trf': X_train_trf[:,0],
    'fare': X_train['Fare'],
    'fare_trf': X_train_trf[:, 1],
    
})

In [66]:
output.head()

Unnamed: 0,age,age_trf,fare,fare_trf
830,15.0,1.0,14.4542,4.0
565,24.0,3.0,24.15,5.0
148,36.5,7.0,26.0,6.0
105,28.0,5.0,7.8958,2.0
289,22.0,3.0,7.75,1.0


## II. Custom/Domain Based Binning

In this algorithm, we decide bin interval based on the domain knowledge. This algorithm cannot be performed with the help of Sklearn. We need to develop logic with the help of pandas.

![image.png](attachment:image.png)

# B. Binarization
Converting the numerical data into Binary class.
For example: If we have salary data, and we want to classify whether there will be tax on it or not. 
So we set the threshold 6 lakh, Any salary above 6 lakh will have 1 and 0 when the salary is less.

In [67]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer

In [68]:
df = pd.read_csv("train.csv")[['Age', 'Fare', 'SibSp', 'Parch', 'Survived']]

In [69]:
df.isnull().sum()

Age         177
Fare          0
SibSp         0
Parch         0
Survived      0
dtype: int64

In [70]:
df.dropna(inplace = True)

In [71]:
df.isnull().sum()

Age         0
Fare        0
SibSp       0
Parch       0
Survived    0
dtype: int64

In [72]:
df.sample(5)

Unnamed: 0,Age,Fare,SibSp,Parch,Survived
873,47.0,9.0,0,0,0
706,45.0,13.5,0,0,1
442,25.0,7.775,1,0,0
834,18.0,8.3,0,0,0
234,24.0,10.5,0,0,0
