# Feature Transformation and Scaling

<b>StandardScaler</b>

the Standard Scaler scales the values in a way the mean would be 0 and the STD would be  1.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [3]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [4]:
X_train, X_test, Y_train, Y_test=train_test_split(housing.data,housing.target,test_size=0.3)

In [5]:
np.set_printoptions(suppress=True)
print(X_train)

[[   3.6125       15.            5.91780822 ...    2.84931507
    38.53       -120.59      ]
 [   4.8173       22.            5.53623188 ...    2.90144928
    38.15       -121.74      ]
 [   4.4423       37.            5.68435013 ...    2.77718833
    33.84       -118.12      ]
 ...
 [   1.4639        6.            3.6898263  ...    3.74441687
    34.         -118.29      ]
 [   2.4375       12.            3.96173733 ...    2.33505688
    34.2        -118.48      ]
 [  13.8093        7.            6.51724138 ...    2.89655172
    37.31       -122.11      ]]


In [6]:
print("the mean is : {}".format(X_train.mean(axis=0)))
print("STD is : {}".format(X_train.std(axis=0)))

the mean is : [   3.8626996    28.55087209    5.4382726     1.0983114  1424.96947674
    3.08010931   35.6347356  -119.56775055]
STD is : [   1.87758414   12.5984457     2.40242197    0.47692123 1128.80065849
   11.3510555     2.13827971    2.004983  ]


In [7]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train)
X_train_scaled=scaler.transform(X_train)
X_test_scaled=scaler.transform(X_test)


In [8]:
print("the mean after scaling is : {}".format(X_train_scaled.mean(axis=0)))
print("STD after scaling is : {}".format(X_train_scaled.std(axis=0)))

the mean after scaling is : [-0.  0.  0. -0. -0.  0. -0.  0.]
STD after scaling is : [1. 1. 1. 1. 1. 1. 1. 1.]


<b>MinMax Scaler</b>

It simply scale data between zero to 1

In [9]:
from sklearn.preprocessing import MinMaxScaler
m_scaler = MinMaxScaler()
m_scaler.fit(X_train)
X_train_m_scaled=m_scaler.transform(X_train)
X_test_m_scaled=m_scaler.transform(X_test)


In [10]:
print("the mean after scaling is : {}".format(X_train_m_scaled.mean(axis=0)))
print("STD after scaling is : {}".format(X_train_m_scaled.std(axis=0)))

the mean after scaling is : [0.23191401 0.54021318 0.03487142 0.02267722 0.03985452 0.00187521
 0.32992917 0.47631967]
STD after scaling is : [0.12948678 0.24702835 0.0182434  0.01413798 0.03163768 0.00913505
 0.22796159 0.1996995 ]


In [11]:
X_train_m_scaled.min()

0.0

In [12]:
X_train_m_scaled.max()

1.0000000000000002

<b>Robust Scaler</b>

Standard Scaler and MinMax Scaler use values like the mean, maximum and minimum values of the columns,which are sensitive to outliers.If our data has many outliers,they absoulutlly affect min,max or mean.So using two above method wouldn't garrantee a balanced data and a normal distribution after scaling. Robust Scaler is another method which is sensitive to outiers. This Mehod first removes the median from our data, then scales the data by the InterQuartile Range(IQR).
IRQ is the difference between the first and third quartile of the variable:
                                                
                                                IQR = Q3 – Q1
and the scaled value is :
                                        x_scaled = (x – Q1)/(Q3 – Q1)



In [13]:
from sklearn.preprocessing import RobustScaler
r_scaler = RobustScaler()
r_scaler.fit(X_train)
X_train_r_scaled=r_scaler.transform(X_train)
X_test_r_scaled=r_scaler.transform(X_test)

