# Feature Scaling

- Standardization
- Normalization
- Maximum Absolute Scaler
- Robust Scaler
- Binarizer

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing

In [14]:
data = {
    'TB': [0.8, 0.85, 0.92, 0.88, 0.82, 0.86, 0.87, 0.93, 0.81, 1],
    'BB': [20, 25, 21, 29, 30, 21, 28, 27, 29, 30]
}
df = pd.DataFrame(data)
df

Unnamed: 0,TB,BB
0,0.8,20
1,0.85,25
2,0.92,21
3,0.88,29
4,0.82,30
5,0.86,21
6,0.87,28
7,0.93,27
8,0.81,29
9,1.0,30


<hr>

### 3. Maximum Absolute Scaler

- Maximum Absolute Scaler formula $\displaystyle x' = \frac {x} {|x_{\textrm{max}}|}$

- Data hasil scaler memiliki range antara __-1__ hingga __1__.

In [10]:
# 1. MaxAbsScaler without sklearn
df['BB'] / df['BB'].max()

0    0.666667
1    0.833333
2    0.700000
3    0.966667
4    1.000000
5    0.700000
6    0.933333
7    0.900000
8    0.966667
9    1.000000
Name: BB, dtype: float64

In [11]:
# 2. MaxAbsScaler using skelarn
from sklearn.preprocessing import MaxAbsScaler
MaxAbsScaler().fit_transform(df[['BB']])

array([[0.66666667],
       [0.83333333],
       [0.7       ],
       [0.96666667],
       [1.        ],
       [0.7       ],
       [0.93333333],
       [0.9       ],
       [0.96666667],
       [1.        ]])

In [13]:
# 3. MaxAbsScaler using skelarn
from sklearn.preprocessing import maxabs_scale
maxabs_scale(df['BB'])

array([0.66666667, 0.83333333, 0.7       , 0.96666667, 1.        ,
       0.7       , 0.93333333, 0.9       , 0.96666667, 1.        ])

<hr>

### 4. Robust Scaler

- Robust Scaler formula $\displaystyle x' = \frac {x - \textrm{median}(x)} {\textrm{IQR}} = \frac {x - Q_2} {Q_3 - Q_1}$

In [28]:
med = df['TB'].median()
q1 = np.quantile(df['TB'], .25) 
q2 = np.quantile(df['TB'], .5) 
q3 = np.quantile(df['TB'], .75)
med, q1, q2, q3

(0.865, 0.8275, 0.865, 0.91)

In [29]:
# 1. robust scaler manual calculation
(df['TB'] - q2) / (q3 - q1)

0   -0.787879
1   -0.181818
2    0.666667
3    0.181818
4   -0.545455
5   -0.060606
6    0.060606
7    0.787879
8   -0.666667
9    1.636364
Name: TB, dtype: float64

In [30]:
# 2. robust scaler using sklearn
from sklearn.preprocessing import RobustScaler
RobustScaler(quantile_range=(25, 75)).fit_transform(df[['TB']])

array([[-0.78787879],
       [-0.18181818],
       [ 0.66666667],
       [ 0.18181818],
       [-0.54545455],
       [-0.06060606],
       [ 0.06060606],
       [ 0.78787879],
       [-0.66666667],
       [ 1.63636364]])

<hr>

### 5. Binarizer

- Binarizer mengubah data menjadi nilai biner (__0__ atau __1__) dengan syarat _threshold_ tertentu. Misal _threshold_ = __20__, maka data <= 20 akan menjadi __0__ dan data > 20 akan menjadi __1__.

In [39]:
# 1. manual binarizer
threshold = 0.85
df['TB'].apply(lambda x: 0 if x <= threshold else 1)

0    0
1    0
2    1
3    1
4    0
5    1
6    1
7    1
8    0
9    1
Name: TB, dtype: int64

In [38]:
# 2. sklearn binarizer
from sklearn.preprocessing import Binarizer
Binarizer(threshold=0.85).fit_transform(df[['TB']])

array([[0.],
       [0.],
       [1.],
       [1.],
       [0.],
       [1.],
       [1.],
       [1.],
       [0.],
       [1.]])