# Chapter 7 : Prepare Your Data For Machine Learning

## Summary
* Rescale data.
* Standardize data.
* Normalize data.
* Binarize data.

## 7.1 Need For Data Pre-processing

We import libs we need for this chapter

In [2]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from numpy import set_printoptions
from sklearn.preprocessing import Binarizer
from numpy import set_printoptions

we load the dataset

In [3]:
filename = "kaggle-house-prices-train.csv"
data = read_csv(filename, index_col=0)

We select numeric column 

In [4]:
data_num = data.select_dtypes(exclude=['object'])
print(data_num)

      MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
Id                                                                            
1             60         65.0     8450            7            5       2003   
2             20         80.0     9600            6            8       1976   
3             60         68.0    11250            7            5       2001   
4             70         60.0     9550            7            5       1915   
5             60         84.0    14260            8            5       2000   
...          ...          ...      ...          ...          ...        ...   
1456          60         62.0     7917            6            5       1999   
1457          20         85.0    13175            6            6       1978   
1458          70         66.0     9042            7            9       1941   
1459          20         68.0     9717            5            6       1950   
1460          20         75.0     9937            5 

Split the dataset into the input and output variables for machine learning.
* we create an array with the values 
* we separate array into input and output components

In [5]:
array = data_num.values
X = array[:,0:36]
Y = array[:,36]

## 7.3 Rescale Data

In [6]:
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.235 0.151 0.033 0.667 0.5   0.949 0.883 0.122 0.125 0.    0.064 0.14
  0.12  0.414 0.    0.259 0.333 0.    0.667 0.5   0.375 0.333 0.5   0.
  0.936 0.5   0.386 0.    0.112 0.    0.    0.    0.    0.    0.091 0.5  ]
 [0.    0.202 0.039 0.556 0.875 0.754 0.433 0.    0.173 0.    0.122 0.207
  0.213 0.    0.    0.175 0.    0.5   0.667 0.    0.375 0.333 0.333 0.333
  0.691 0.5   0.324 0.348 0.    0.    0.    0.    0.    0.    0.364 0.25 ]
 [0.235 0.161 0.047 0.667 0.5   0.935 0.867 0.101 0.086 0.    0.186 0.151
  0.134 0.419 0.    0.274 0.333 0.    0.667 0.5   0.375 0.333 0.333 0.333
  0.918 0.5   0.429 0.    0.077 0.    0.    0.    0.    0.    0.727 0.5  ]
 [0.294 0.134 0.039 0.667 0.5   0.312 0.333 0.    0.038 0.    0.231 0.124
  0.144 0.366 0.    0.261 0.333 0.    0.333 0.    0.375 0.333 0.417 0.333
  0.891 0.75  0.453 0.    0.064 0.493 0.    0.    0.    0.    0.091 0.   ]
 [0.235 0.216 0.061 0.778 0.5   0.928 0.833 0.219 0.116 0.    0.21  0.187
  0.186 0.51  0.    0.351 0.333 0.    

## 7.4 Standardize Data

In [7]:
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.073 -0.208 -0.207  0.651 -0.517  1.051  0.879  0.51   0.575 -0.289
  -0.945 -0.459 -0.793  1.162 -0.12   0.37   1.108 -0.241  0.79   1.228
   0.164 -0.211  0.912 -0.951  0.992  0.312  0.351 -0.752  0.217 -0.359
  -0.116 -0.27  -0.069 -0.088 -1.599  0.139]
 [-0.873  0.41  -0.092 -0.072  2.18   0.157 -0.43  -0.573  1.172 -0.289
  -0.641  0.466  0.257 -0.795 -0.12  -0.483 -0.82   3.949  0.79  -0.762
   0.164 -0.211 -0.319  0.6   -0.102  0.312 -0.061  1.626 -0.704 -0.359
  -0.116 -0.27  -0.069 -0.088 -0.489 -0.614]
 [ 0.073 -0.084  0.073  0.651 -0.517  0.985  0.83   0.322  0.093 -0.289
  -0.302 -0.313 -0.628  1.189 -0.12   0.515  1.108 -0.241  0.79   1.228
   0.164 -0.211 -0.319  0.6    0.911  0.312  0.632 -0.752 -0.07  -0.359
  -0.116 -0.27  -0.069 -0.088  0.991  0.139]
 [ 0.31  -0.414 -0.097  0.651 -0.517 -1.864 -0.72  -0.573 -0.499 -0.289
  -0.062 -0.687 -0.522  0.937 -0.12   0.384  1.108 -0.241 -1.026 -0.762
   0.164 -0.211  0.297  0.6    0.79   1.65   0.791 -0.752 -0.176  4.093
 

### For the next steps : 
* We convert all value in Float64
* We Drop all the row contain NAN

In [8]:
data_float = data_num.astype('float64')
data_ok = data_float.dropna()

Split the dataset into the input and output variables for machine learning.

* we create an array with the values
* we separate array into input and output components

In [9]:
array = data_ok.values
X = array[:,0:36]
Y = array[:,36]

## 7.5 Normalize Data

In [10]:
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[6.206e-03 6.724e-03 8.741e-01 7.241e-04 5.172e-04 2.072e-01 2.072e-01
  2.027e-02 7.303e-02 0.000e+00 1.552e-02 8.854e-02 8.854e-02 8.834e-02
  0.000e+00 1.769e-01 1.034e-04 0.000e+00 2.069e-04 1.034e-04 3.103e-04
  1.034e-04 8.275e-04 0.000e+00 2.072e-01 2.069e-04 5.668e-02 0.000e+00
  6.310e-03 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 2.069e-04
  2.077e-01]
 [1.873e-03 7.492e-03 8.990e-01 5.619e-04 7.492e-04 1.850e-01 1.850e-01
  0.000e+00 9.159e-02 0.000e+00 2.660e-02 1.182e-01 1.182e-01 0.000e+00
  0.000e+00 1.182e-01 0.000e+00 9.365e-05 1.873e-04 0.000e+00 2.809e-04
  9.365e-05 5.619e-04 9.365e-05 1.850e-01 1.873e-04 4.308e-02 2.791e-02
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 4.682e-04
  1.880e-01]
 [4.914e-03 5.569e-03 9.214e-01 5.733e-04 4.095e-04 1.639e-01 1.640e-01
  1.327e-02 3.980e-02 0.000e+00 3.555e-02 7.535e-02 7.535e-02 7.093e-02
  0.000e+00 1.463e-01 8.190e-05 0.000e+00 1.638e-04 8.190e-05 2.457e-04
  8.190e-05 4.914e-04 8.190e-05 1.639e

## 7.6 Binarize Data (Make Binary)

In [11]:
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0.
  1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1.
  1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1.
  1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1.]]
