# Feature Transformation and Scaling

<b>StandardScaler</b>

the Standard Scaler scales the values in a way the mean would be 0 and the STD would be  1.

In [14]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

%matplotlib inline

In [15]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [16]:
housing.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])

In [17]:
X_train, X_test, Y_train, Y_test=train_test_split(housing.data,housing.target,test_size=0.3)

In [18]:
np.set_printoptions(suppress=True)
print(X_train)

[[   5.9641       44.            4.35714286 ...    2.22857143
    33.99       -118.44      ]
 [   4.5455       25.            5.51132686 ...    2.50809061
    34.68       -118.14      ]
 [   2.2765       30.            4.02164502 ...    3.81962482
    34.07       -117.64      ]
 ...
 [   2.0096       22.            4.02702703 ...    3.3963964
    33.9        -118.26      ]
 [   2.6875       23.            6.35744681 ...    2.88510638
    36.27       -119.25      ]
 [   7.5696        4.            8.02617801 ...    2.85340314
    34.87       -120.45      ]]


In [19]:
print("the mean is : {}".format(X_train.mean(axis=0)))
print("STD is : {}".format(X_train.std(axis=0)))

the mean is : [   3.84886232   28.67421096    5.40042744    1.0940336  1426.09759136
    3.10512481   35.63013358 -119.56068868]
STD is : [   1.89567269   12.55080901    2.30360322    0.46073619 1107.38404741
   12.24310365    2.13846495    2.00399271]


In [20]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X_train)
X_train_scaled=scaler.transform(X_train)

In [21]:
print("the mean after scaling is : {}".format(X_train_scaled.mean(axis=0)))
print("STD after scaling is : {}".format(X_train_scaled.std(axis=0)))

the mean after scaling is : [-0.  0. -0. -0.  0.  0. -0. -0.]
STD after scaling is : [1. 1. 1. 1. 1. 1. 1. 1.]


<b>MinMax Scaler</b>

It simply scale data between zero to 1

In [22]:
from sklearn.preprocessing import MinMaxScaler
m_scaler = MinMaxScaler()
m_scaler.fit(X_train)
X_train_m_scaled=m_scaler.transform(X_train)

In [23]:
print("the mean after scaling is : {}".format(X_train_m_scaled.mean(axis=0)))
print("STD after scaling is : {}".format(X_train_m_scaled.std(axis=0)))

the mean after scaling is : [0.23095973 0.54263159 0.03458403 0.0225504  0.04982311 0.00189535
 0.32767379 0.47702304]
STD after scaling is : [0.13073424 0.24609429 0.017493   0.01365819 0.03876988 0.00985294
 0.22749627 0.19960087]


In [24]:
X_train_m_scaled.min()

0.0

In [25]:
X_train_m_scaled.max()

1.0000000000000004

<b>Robust Scaler</b>

Standard Scaler and MinMax Scaler use values like the mean, maximum and minimum values of the columns,which are sensitive to outliers.If our data has many outliers,they absoulutlly affect min,max or mean.So using two above method wouldn't garrantee a balanced data and a normal distribution after scaling. Robust Scaler is another method which is sensitive to outiers. This Mehod first removes the median from our data, then scales the data by the InterQuartile Range(IQR).
IRQ is the difference between the first and third quartile of the variable:
                                                
                                                IQR = Q3 – Q1
and the scaled value is :
                                        x_scaled = (x – Q1)/(Q3 – Q1)



In [26]:
from sklearn.preprocessing import RobustScaler
r_scaler = RobustScaler()
r_scaler.fit(X_train)
X_train_r_scaled=r_scaler.transform(X_train)


<b>comparison</b>

In [None]:
import 