## Numerical Encoding

### 1. Discretization / Binning
### 2. Binarization

---

### 1. Discretization
It is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that span the range of the variable's values.  
Discretization is also called **Binning**, where a *bin* is an alternative name for an interval.

#### Why use Discretization:
1. To handle outliers  
2. To improve the value spread  

---

#### Types of Binning:
1. **Unsupervised Binning**
   - **Equal Width (Uniform):**  
     Specify the number of bins and use the formula:  
     $$
     \text{Width} = \dfrac{\text{max} - \text{min}}{\text{bins}}
     $$
     Then, write values within each interval.  
   - **Equal Frequency (Quantile)**  
   - **K-means Binning**  

2. **Supervised Binning**
   - **Decision Tree Binning**  

3. **Custom Binning**

---

#### Encoding Discretized Values:
Use `sklearn` class **KBinsDiscretizer**:  
- Specify **number of bins**  
- Choose **strategy** (`uniform`, `quantile`, `kmeans`)  
- Select **encoding** (`ordinal`, `onehot`)  


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

from sklearn.metrics import accuracy_score

In [None]:
df = sns.load_dataset("titanic")
df = df[['age', 'fare', 'survived']]


In [None]:
df.head()

In [None]:
df.dropna(inplace = True)

In [None]:
df.shape

In [None]:
df.sample(5)

In [None]:
x = df.iloc[:,:2]
y = df.iloc[:,-1]


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=42)

In [None]:
x_train.head()

In [None]:
clf = DecisionTreeClassifier()


In [None]:
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
np.mean(cross_val_score(clf, x,y,cv=10, scoring='accuracy'))

In [None]:
kbin_age = KBinsDiscretizer(n_bins = 10, encode= 'ordinal', strategy = 'quantile', quantile_method='linear')
kbin_fare= KBinsDiscretizer(n_bins=10, encode='ordinal', strategy= 'quantile', quantile_method='linear')   

In [None]:
trf = ColumnTransformer([
    ('kbin_age', kbin_age, [0]),
    ('kbin_fare', kbin_fare, [1])
])

In [None]:
x_train_trf = trf.fit_transform(x_train)
x_test_trf = trf.transform(x_test)

In [None]:
trf.named_transformers_['kbin_fare'].n_bins_

In [None]:
trf.named_transformers_['kbin_age'].bin_edges_

In [None]:
output = pd.DataFrame({
    'age':x_train['age'],
    'age_trf': x_train_trf[:,0],
    'fare': x_train['fare'],
    'fare_trf':x_train_trf[:,1]
})

In [None]:
output['age_labels']= pd.cut(x = x_train['age'], bins = trf.named_transformers_['kbin_age'].bin_edges_[0].tolist())
output['fare_labels'] = pd.cut(x =x_train['fare'],bins = trf.named_transformers_['kbin_fare'].bin_edges_[0].tolist())

In [None]:
output.sample(5)

In [None]:
clf = DecisionTreeClassifier()
clf.fit(x_train_trf, y_train)
y_pred2 = clf.predict(x_test_trf)

In [None]:
accuracy_score(y_test, y_pred2)

In [None]:
x_trf = trf.fit_transform(x)
np.mean(cross_val_score(clf, x,y,cv=10, scoring='accuracy'))

In [None]:
def discretize(bins, strategy):
    kbin_age = KBinsDiscretizer(n_bins = bins, encode= 'ordinal', strategy = strategy, quantile_method='linear')
    kbin_fare= KBinsDiscretizer(n_bins=bins, encode='ordinal', strategy= strategy, quantile_method='linear')

    trf = ColumnTransformer([
        ('first', kbin_age,[0]),
        ('second', kbin_fare,[1])
    ])

    x_trf = trf.fit_transform(x)
    print(np.mean(cross_val_score(clf,x_trf, y, cv=10, scoring='accuracy')))

    plt.figure(figsize = (14,4))
    plt.subplot(121)
    plt.hist(x['age'])
    plt.title('Before')

    plt.subplot(122)
    plt.hist(x_trf[:,0], color = 'red')
    plt.title('After')
    plt.show()

In [None]:
discretize(10, 'quantile')

#### Binarization 
convert the data into binary variable like 
for salary
- for less than 6 lakh= values is 0
- for greater value = values is 1\
**Or for image processing**



In [3]:
# Other import same as above
df = pd.read_csv('E:/Machine learning/ML/FeatureEngineering/ML_pipelines/train.csv')[['Age','Fare', 'SibSp','Parch', 'Survived']]

In [4]:
df.dropna(inplace = True)

In [5]:
df.head()

Unnamed: 0,Age,Fare,SibSp,Parch,Survived
0,22.0,7.25,1,0,0
1,38.0,71.2833,1,0,1
2,26.0,7.925,0,0,1
3,35.0,53.1,1,0,1
4,35.0,8.05,0,0,0


In [6]:
df['Family'] = df['SibSp']+ df['Parch']

In [7]:
df.sample(5)

Unnamed: 0,Age,Fare,SibSp,Parch,Survived,Family
673,31.0,13.0,0,0,1,0
762,20.0,7.2292,0,0,1,0
13,39.0,31.275,1,5,0,6
326,61.0,6.2375,0,0,0,0
210,24.0,7.05,0,0,0,0


In [8]:
df.drop(columns = ['SibSp', 'Parch'], inplace = True)

In [9]:
df.head()

Unnamed: 0,Age,Fare,Survived,Family
0,22.0,7.25,0,1
1,38.0,71.2833,1,1
2,26.0,7.925,1,0
3,35.0,53.1,1,1
4,35.0,8.05,0,0


In [10]:
x = df.drop(columns=['Survived'])
y= df['Survived']

In [17]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

In [26]:
x_train.head()

Unnamed: 0,Age,Fare,Family
328,31.0,20.525,2
73,26.0,14.4542,1
253,30.0,16.1,1
719,33.0,7.775,0
666,25.0,13.0,0


In [19]:
# Without Binarization
clf = DecisionTreeClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
accuracy_score(y_test, y_pred)


0.6223776223776224

In [20]:
np.mean(cross_val_score(clf,x,y,cv =10, scoring='accuracy'))

np.float64(0.6569640062597809)

In [21]:
# Applying Binarization
from sklearn.preprocessing import Binarizer

In [22]:
trf = ColumnTransformer([
    ('bin',Binarizer(copy=False),['Family'])

],remainder='passthrough')

In [24]:
x_train_trf = trf.fit_transform(x_train)
x_tesst_trf = trf.transform(x_test)

In [25]:
pd.DataFrame(x_train_trf,columns = ['Family', 'Age','Fare'])

Unnamed: 0,Family,Age,Fare
0,1.0,31.0,20.5250
1,1.0,26.0,14.4542
2,1.0,30.0,16.1000
3,0.0,33.0,7.7750
4,0.0,25.0,13.0000
...,...,...,...
566,1.0,46.0,61.1750
567,0.0,25.0,13.0000
568,0.0,41.0,134.5000
569,1.0,33.0,20.5250


In [27]:
clf= DecisionTreeClassifier()
clf.fit(x_train_trf, y_train)
y_pred2 = clf.predict(x_tesst_trf)

accuracy_score(y_test, y_pred2)

0.6013986013986014

In [28]:
x_trf = trf.fit_transform(x)
np.mean(cross_val_score(clf,x_trf,y,cv=10, scoring='accuracy'))

np.float64(0.6219874804381847)