feature scaling means transforming the numerical features into a small range of values.

1. Normalisation
2. Standardization
3. Robust Scaling

# 1. Normalisation

it scales range of values between 0 and 1.
its formula is X norm= (X - X min)/(X max - X min) .
it is mostly prefered when data at hand has not a normal or gaussian distribution.

In [1]:
from seaborn import load_dataset
tip_data=load_dataset('tips')
tip_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
num_feats=tip_data[['total_bill','tip','size']]

In [3]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()

num_scaled=scaler.fit_transform(num_feats)
num_scaled[:5]

# the output of scaler is anumpy array.convert it back to pandas dataframe.

array([[0.29157939, 0.00111111, 0.2       ],
       [0.1522832 , 0.07333333, 0.4       ],
       [0.3757855 , 0.27777778, 0.4       ],
       [0.43171345, 0.25666667, 0.2       ],
       [0.45077503, 0.29      , 0.6       ]])

In [4]:
import pandas as pd

num_scaled_df=pd.DataFrame(num_scaled,columns=num_feats.columns)
num_scaled_df.head()

Unnamed: 0,total_bill,tip,size
0,0.291579,0.001111,0.2
1,0.152283,0.073333,0.4
2,0.375786,0.277778,0.4
3,0.431713,0.256667,0.2
4,0.450775,0.29,0.6


we can see that all the values are scaled between 0 and 1 . 

# 2. Standardization

when we know the normal and gaussian distribution of training data then we standardize such data.

it formula is, X std = (X - 0 mean)/unity standard deviation

In [5]:
from sklearn.preprocessing import StandardScaler

std_scaler=StandardScaler()
num_std=std_scaler.fit_transform(num_feats)
num_std[:5]

array([[-0.31471131, -1.43994695, -0.60019263],
       [-1.06323531, -0.96920534,  0.45338292],
       [ 0.1377799 ,  0.36335554,  0.45338292],
       [ 0.4383151 ,  0.22575414, -0.60019263],
       [ 0.5407447 ,  0.4430195 ,  1.50695847]])

In [6]:
# the mean of each feature in the scaled data
std_scaler.mean_

array([19.78594262,  2.99827869,  2.56967213])

In [7]:
# variance of the scaled features
std_scaler.var_

array([78.92813149,  1.90660851,  0.9008835 ])

In [8]:
import numpy as np

print(f'The mean of scaled data:{np.round(num_std.mean(axis=0))}')
print(f'The standard deviation of scaled data:{num_std.std(axis=0)}')

The mean of scaled data:[-0.  0. -0.]
The standard deviation of scaled data:[1. 1. 1.]


In [9]:
# converting the scaled data back to dataframe
num_std_scaled_df=pd.DataFrame(num_std,columns=num_feats.columns)
num_std_scaled_df.head()

Unnamed: 0,total_bill,tip,size
0,-0.314711,-1.439947,-0.600193
1,-1.063235,-0.969205,0.453383
2,0.13778,0.363356,0.453383
3,0.438315,0.225754,-0.600193
4,0.540745,0.44302,1.506958


# 3. Robust Scaler

it is similar to standardization but is used when the data contains many outliers.

In [10]:
from sklearn.preprocessing import RobustScaler
rob_scaler=RobustScaler()
num_rob_scaled=rob_scaler.fit_transform(num_feats)

num_rob_scaled[:5]

array([[-0.07467532, -1.2096    ,  0.        ],
       [-0.69155844, -0.7936    ,  1.        ],
       [ 0.29823748,  0.384     ,  1.        ],
       [ 0.54591837,  0.2624    ,  0.        ],
       [ 0.63033395,  0.4544    ,  2.        ]])

In [11]:
print(f'The median of scaled data: {np.round(np.median(num_rob_scaled, axis=0))}')

The median of scaled data: [-0.  0.  0.]
