# Lecture 4: Data Preprocessing

In this section, we will introduce how to preprocess data with sklearn.

## Handling Missing Values :

Refer to lecture 3

## Feature Scaling: Standardization and normalization:
Normalization is a form of standardization.
Normalization maps data into the interval [0,1].
Normalization is the process of scaling data to fit into a specific interval.
The standardized data has a mean of 0 and a standard deviation equal to 1, so the standardized data can be positive or negative.
Normalized data does not work well if the original data does not conform to a Gaussian distribution. (The reason for normalization is that if the variance of some features is too large, it will dominate the objective function and prevent the parameter estimator from learning other features correctly.)



In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from matplotlib import gridspec
import numpy as np
import matplotlib.pyplot as plt

### Standardize data and inverse transformation  using sklearn

The process of Standard transformation is divided into two steps.

* Decentering of the mean (mean becomes 0).
* Scaling of the variance (variance becomes 1).


Each column is standardized to the standard normal distribution, note that the standardization is done for each column.

In [2]:
data=np.random.randn(30,4)*3+1

In [3]:
data

array([[ 3.06704735, -0.42527931, -3.73118218,  1.68359769],
       [-1.1477499 , -1.31135066, -7.93709137,  1.48954002],
       [ 0.41252904, -3.34466299,  5.80735588, -1.26779775],
       [ 3.11064336, -3.53980013,  1.16909773, -1.44219196],
       [ 1.42869309, -0.05256714, -5.50733057,  0.16114609],
       [-1.29285012, -2.06498592,  6.82152767,  1.29159616],
       [-1.44876403, -3.09863312, -6.03431092,  0.29704696],
       [-5.40197207, -0.27729005,  0.85934617, -2.14119705],
       [ 3.28585635,  2.01873394, -6.25936775, -0.87398321],
       [-4.66358549, -3.51057435,  6.19122408, -0.95748436],
       [-1.26436722, -1.36999214,  2.05693612,  0.04560163],
       [ 0.88552068,  3.69656872,  1.21831224,  4.59176866],
       [-0.17564638,  6.65459102,  1.7844178 , -3.2470935 ],
       [ 2.75781144,  3.1205553 , -0.36811365,  4.11045494],
       [-0.91380537, -0.68436589, -6.839186  ,  2.34985202],
       [ 0.25141211,  5.13975599, -1.75772343,  3.45198479],
       [-0.77249465,  0.

In [13]:
data.std(axis=0)

array([3.10111586, 2.58138223, 2.83467788, 2.85750296])

In [14]:
std = StandardScaler()

In [15]:
data_n=std.fit_transform(data)

In [16]:
data_n

array([[ 0.44352074,  0.00992653, -0.74118271,  0.74850431],
       [-0.61089722,  0.12367808,  1.15590952, -1.06467252],
       [ 0.07283776,  1.05506955, -0.87936683, -1.38182642],
       [ 0.38951615,  0.94615321, -1.63039494, -0.23953292],
       [ 0.37732286, -1.08215241,  2.15864238,  0.64034304],
       [-2.83233896, -0.41273491,  0.2947669 , -0.68481087],
       [ 1.64272292, -0.89896435, -1.83113262,  0.56077623],
       [ 0.50745092, -0.33858374, -0.26683716,  0.55442794],
       [-0.09692605,  0.36875628,  0.90069506,  1.18045959],
       [ 0.62767106,  1.39061447,  0.44630222, -0.27809114],
       [-0.00987112,  0.31010379, -0.93698492,  2.24589872],
       [ 0.47660112,  0.65518326,  1.04382931,  0.27364609],
       [ 0.70281006,  0.20306403,  0.70095004,  0.11593608],
       [ 1.01650186, -1.28403625,  0.26882762, -0.1258953 ],
       [-0.74717079,  0.78255253,  0.50213918,  1.06021717],
       [-0.01040599,  0.15492097,  0.66220064, -0.8062387 ],
       [-0.32027703, -0.

In [17]:
data_n.std(axis=0)

array([1., 1., 1., 1.])

In [18]:
oringin_data=std.inverse_transform(data_n)

In [20]:
oringin_data

array([[ 2.59926934,  1.24932397, -1.19465453,  3.33336576],
       [-0.67060292,  1.54296021,  4.18299086, -1.8477924 ],
       [ 1.44973848,  3.94723758, -1.58636201, -2.75406061],
       [ 2.43179485,  3.66608289, -3.71528476,  0.51004645],
       [ 2.39398206, -1.56974918,  7.02541551,  3.02429459],
       [-7.55955111,  0.15827324,  1.74192892, -0.76233663],
       [ 6.31813425, -1.09687079, -4.28431144,  2.79693219],
       [ 2.79752424,  0.34968577,  0.1499623 ,  2.77879194],
       [ 0.92328125,  2.17560071,  3.45954007,  4.56767924],
       [ 3.17034083,  4.81340729,  2.17148274,  0.3998662 ],
       [ 1.19324865,  2.02419623, -1.74969072,  7.61217469],
       [ 2.70185543,  2.91497823,  3.86527955,  1.97645699],
       [ 3.40335559,  1.74788569,  2.89332729,  1.52580016],
       [ 4.37615019, -2.09088854,  1.66839942,  0.83476626],
       [-1.09320305,  3.24376701,  2.32976253,  4.22408617],
       [ 1.19158996,  1.62361006,  2.78348522, -1.10931701],
       [ 0.23064397, -0.

### Data normalization and inverse normalization  using sklearn. [0,1]

In [21]:
mm = MinMaxScaler()# 创建MinMaxScaler 对象
mm_data = mm.fit_transform(data) # 归一化数据
origin_data = mm.inverse_transform(mm_data)

In [22]:
mm_data

array([[0.67473217, 0.56479822, 0.27318581, 0.61861095],
       [0.45755263, 0.58808603, 0.74867434, 0.15679154],
       [0.59838222, 0.7787653 , 0.23855124, 0.07601187],
       [0.66360879, 0.75646739, 0.05031303, 0.36695608],
       [0.66109733, 0.34122218, 1.        , 0.59106207],
       [0.        , 0.47826879, 0.53283695, 0.25354298],
       [0.92173308, 0.37872538, 0.        , 0.57079626],
       [0.68789994, 0.4934494 , 0.39207611, 0.56917934],
       [0.5634158 , 0.6382597 , 0.6847072 , 0.72863073],
       [0.71266181, 0.8474598 , 0.57081786, 0.35713523],
       [0.58134659, 0.62625206, 0.22410981, 1.        ],
       [0.68154577, 0.69689852, 0.72058247, 0.4976637 ],
       [0.72813826, 0.60433833, 0.63464297, 0.45749468],
       [0.79274968, 0.29989148, 0.52633551, 0.39589979],
       [0.42948422, 0.72297421, 0.58481288, 0.69800477],
       [0.58123642, 0.59448224, 0.6249308 , 0.2226151 ],
       [0.51741197, 0.40743363, 0.27240277, 0.        ],
       [0.46108796, 0.45991531,

## Encoding the Categorical variables

Categorical variables means that data which shows categories or some memberships. Mostly this data can be in the form of string or characters. But all machine learning models works on only numerical data, so that's why we have to encode these variables in numerical format.

There are several methods for encoding categorical variables.
we will discuss get_dummies method from pandas library.
This method creates dummy variables for categorical data. Basically it converts categorical data into 0/1 on the basis of categories from that column.

In [24]:
import pandas as pd

In [25]:
s = pd.Series(list('abca'))
print(pd.get_dummies(s))

   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0


In [26]:
s1 = ['a', 'b', np.nan]
print(pd.get_dummies(s1))

   a  b
0  1  0
1  0  1
2  0  0


In [27]:
print(pd.get_dummies(s1, dummy_na=True))

   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1


In [28]:
df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
                   'C': [1, 2, 3]})
print(df)

   A  B  C
0  a  b  1
1  b  a  2
2  a  c  3


In [29]:
print(pd.get_dummies(df))

   C  A_a  A_b  B_a  B_b  B_c
0  1    1    0    0    1    0
1  2    0    1    1    0    0
2  3    1    0    0    0    1


In [30]:
pd.get_dummies(pd.Series(list('abcaa')))


Unnamed: 0,a,b,c
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0
4,1,0,0


In [31]:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)

Unnamed: 0,b,c
0,0,0
1,1,0
2,0,1
3,0,0
4,0,0


In [32]:
pd.get_dummies(pd.Series(list('abc')), dtype=float)

Unnamed: 0,a,b,c
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
