Here, we have discussed the steps of pre-processing for machine learnig. These are: 
1. load the dataset
2. Split the data into input and output variables for machine learning 
3. apply a pre-processing trasnform to the input variables. 
4. Data summary analysis to see the change

In [11]:
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer, Binarizer

In [12]:
# load data
file = '../data/pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
df = read_csv(file, names= names)
array = df.values

In [13]:
# separate array in to input & output components 
X = array[:,0:8]
Y = array[:,8]  # class 

Following are the data preprocessing techniques to bring all the datasets to a common scale. It is essential to avoid unnecessary data overfitting or under-fitting due to inappropriate data scales. 
1. Min Max Scaler 
2. Standard Scaler
3. Data Normalization
4. Binarize data

Rescaling data into the range of 0 & 1 --. This is useful for optimization algos used in the core of ML algos like gradient descent. It is also useful for algos that weight inputs like regression & neural nets & algos that use distance measures like k-nearest neighbors.

In [14]:
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_mms = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX_mms[0:5,:])

[[0.353 0.744 0.59  0.354 0.    0.501 0.234 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.117 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.254 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.944 0.2  ]]


Standardize data: 
    It is a technique to transform attributes with a Gaussian distribution & differeing means & standard deviations to a standard Gaussian distribution with a mean of 0 & a standard deviation of 1. This is most suitable for technique that assume a Gaussian distribution in the input variables & work better with rescaled data, such as Linear Regression, Logistic Regression & LDA.  The value for each attribute now have a mean value of 0 & standard deviation of 1. 

In [15]:
scaler = StandardScaler().fit(X)
rescaledX_ss = scaler.transform(X)
print(rescaledX_ss[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


Normalise Data- refers to rescaling each observation to have a length of 1(called unit norm). It is \
      useful for sparse datasets with attributes of varying scales when using algos that weight input values such as neural nets & algos that \
      use distance measures such as k-nearest neighbors.

In [16]:
# NORMALIZE DATA
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
print(normalizedX[0:5,:])

[[0.034 0.828 0.403 0.196 0.    0.188 0.004 0.28 ]
 [0.008 0.716 0.556 0.244 0.    0.224 0.003 0.261]
 [0.04  0.924 0.323 0.    0.    0.118 0.003 0.162]
 [0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
 [0.    0.596 0.174 0.152 0.731 0.188 0.01  0.144]]


Binarize data: Transforming the data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. 
      This is called binarizing your data or thresholding your data. 
      It can be useful when you have probabilities that you want to make crisp values.

In [17]:
# BINARIZE DATA
binarizer = Binarizer(threshold=0.0).fit(X)
binaryX = binarizer.transform(X)
print(binaryX[0:5,:])

[[1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1.]]
