# Scikit Learn for Preprocessing


Pre-processing refers to the transformations applied to our data to convert it into clean data before feeding it to the algorithm.In general, algorithms benefit from standardization, normalization, scaling of the data-set.

The scikit-learn is a Python module for machine learning built on top of SciPy.The "**preprocessing**" package of  Scikit Learn provides a lot of different preprocessing functions like standardization, normalization, encoding categorical features, imputation of missing values and many more.

Resources : https://scikit-learn.org/stable/modules/preprocessing.html



In [0]:
from sklearn import preprocessing
import numpy as np
import pandas as pd

In [0]:
#load a dataset
df= pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.20,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050
2,1,13.16,2.36,2.67,18.6,101,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.50,16.8,113,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,3,13.71,5.65,2.45,20.5,95,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740
174,3,13.40,3.91,2.48,23.0,102,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750
175,3,13.27,4.28,2.26,20.0,120,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835
176,3,13.17,2.59,2.37,20.0,120,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840


In [0]:
x,y=df.iloc[:,1:].values,df.iloc[:,0].values
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3])

The samples belong to one of three different classes, 1, 2, and 3, which refer to the three different types of grapes

# Standardization
Standardization is the process of scaling attributes 

In [0]:
from sklearn.preprocessing import StandardScaler
#standard deviation based standardization
scaler = StandardScaler().fit(x) # x_std = (x-mean)/std_deviation
x_sc_trans=scaler.transform(x)
x_sc_trans[0]

array([ 1.51861254, -0.5622498 ,  0.23205254, -1.16959318,  1.91390522,
        0.80899739,  1.03481896, -0.65956311,  1.22488398,  0.25171685,
        0.36217728,  1.84791957,  1.01300893])

In [0]:
#can also be computed as follows
std=StandardScaler()
x_std=std.fit_transform(x)
x_std[0]

array([ 1.51861254, -0.5622498 ,  0.23205254, -1.16959318,  1.91390522,
        0.80899739,  1.03481896, -0.65956311,  1.22488398,  0.25171685,
        0.36217728,  1.84791957,  1.01300893])

In [0]:
x_std.std(axis=0)    # standard deviation is made to 1

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [0]:
from sklearn.preprocessing import MinMaxScaler
#minimum ,maximum based standardization
mnmx=MinMaxScaler()              #x_std = (x-min)/(max-min)
X_mnmx=mnmx.fit_transform(x)
X_mnmx[0]


array([0.84210526, 0.1916996 , 0.57219251, 0.25773196, 0.61956522,
       0.62758621, 0.57383966, 0.28301887, 0.59305994, 0.37201365,
       0.45528455, 0.97069597, 0.56134094])

In [0]:
X_mnmx.std(axis=0)

array([0.21303761, 0.22015882, 0.14629534, 0.17165823, 0.15480769,
       0.21520364, 0.21013691, 0.23415709, 0.18004696, 0.19724954,
       0.18530781, 0.25933819, 0.22398121])

#Normalization
Normalization is the process of scaling individual samples to have unit norm. 

In [0]:
from sklearn.preprocessing import Normalizer
nl =  Normalizer()
x_nl = nl.fit_transform(x)  #  x1 = x1/(sqrt((x1^2) + (x2^2) + (x3^2))) where x1,x2,x3 are features(coloumns).
x[0]


array([1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
       3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
       1.065e+03])

In [0]:
x_nl[0]

array([1.32644724e-02, 1.59397384e-03, 2.26512072e-03, 1.45415157e-02,
       1.18382852e-01, 2.61001565e-03, 2.85237424e-03, 2.61001565e-04,
       2.13461994e-03, 5.25731723e-03, 9.69434383e-04, 3.65402190e-03,
       9.92738094e-01])

#Encoding categorical features
 Categorical data that is represented as a string can be converted in to the format we need

In [0]:
df = pd.DataFrame({
    'Height':[125,149,190,132,175,154],
    'Gender':['Male','Female','Male','Female','Male','Male']
})

In [0]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()     #encode the data from 0 to (number of unique data)-1
df['encoded_gender'] = le.fit_transform(df.Gender)
df

Unnamed: 0,Height,Gender,encoded_gender
0,125,Male,1
1,149,Female,0
2,190,Male,1
3,132,Female,0
4,175,Male,1
5,154,Male,1


In a similar way,in wine dataset, the types of wine are encoded as 1,2 and 3

In [0]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
y_ohe=ohe.fit_transform(df[['Gender']]).toarray()
y_ohe  

array([[0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.]])

#Imputation
Many datasets contain missing values, often encodes as blanks or NaNs.We can impute missing values with the columns mean, median or most frequent value.

In [0]:
df = pd.DataFrame({
    'coloumn1':[1,2,3,4,np.nan,7,3,8,4,np.nan,1,5,9,2,np.nan,6,4,8,9,np.nan],
    'coloumn2':[3,4,1,np.nan,9,2,np.nan,1,4,5,2,4,7,np.nan,7,3,8,1,6,np.nan]
})
df

Unnamed: 0,coloumn1,coloumn2
0,1.0,3.0
1,2.0,4.0
2,3.0,1.0
3,4.0,
4,,9.0
5,7.0,2.0
6,3.0,
7,8.0,1.0
8,4.0,4.0
9,,5.0


In [0]:
from sklearn.impute import SimpleImputer
im = SimpleImputer(strategy='mean')
im.fit_transform(df)

array([[1.    , 3.    ],
       [2.    , 4.    ],
       [3.    , 1.    ],
       [4.    , 4.1875],
       [4.75  , 9.    ],
       [7.    , 2.    ],
       [3.    , 4.1875],
       [8.    , 1.    ],
       [4.    , 4.    ],
       [4.75  , 5.    ],
       [1.    , 2.    ],
       [5.    , 4.    ],
       [9.    , 7.    ],
       [2.    , 4.1875],
       [4.75  , 7.    ],
       [6.    , 3.    ],
       [4.    , 8.    ],
       [8.    , 1.    ],
       [9.    , 6.    ],
       [4.75  , 4.1875]])

In [0]:
im = SimpleImputer(strategy='most_frequent') #similarly we can apply median or constant value
im.fit_transform(df)

array([[1., 3.],
       [2., 4.],
       [3., 1.],
       [4., 1.],
       [4., 9.],
       [7., 2.],
       [3., 1.],
       [8., 1.],
       [4., 4.],
       [4., 5.],
       [1., 2.],
       [5., 4.],
       [9., 7.],
       [2., 1.],
       [4., 7.],
       [6., 3.],
       [4., 8.],
       [8., 1.],
       [9., 6.],
       [4., 1.]])

#Polynomial Features
Deriving non-linear features to add complexity which is mostly used in linear regression

In [0]:
from sklearn.preprocessing import PolynomialFeatures

X = pd.DataFrame({
     'x1':[1,2,3,5,8], 
     'x2':[4,5,6,7,1]
})
print(X) 
poly = PolynomialFeatures(2)
poly.fit_transform(X)

   x1  x2
0   1   4
1   2   5
2   3   6
3   5   7
4   8   1


array([[ 1.,  1.,  4.,  1.,  4., 16.],
       [ 1.,  2.,  5.,  4., 10., 25.],
       [ 1.,  3.,  6.,  9., 18., 36.],
       [ 1.,  5.,  7., 25., 35., 49.],
       [ 1.,  8.,  1., 64.,  8.,  1.]])