#### Python Data Preprocessing Techniques for Machine Learning Algorithms

Machine Learning algorithms don’t work so well with processing raw data. Before we can feed such data to an ML algorithm, we must preprocess it. In other words, we must apply some transformations on it. With data preprocessing, we convert raw data into a clean data set.

let't discuss some methods of preprocesssing...

@author: Muhammad Shifa

In [1]:
# import necessary packages

import numpy as np
import scipy as sc
import pandas as pd
import sklearn as sk

#Take any dataset you want
dataFframe = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",sep =';')
print(dataFframe)
#Return a Numpy representation of the DataFrame.
array = dataFframe.values
print(array)
#Separating data into input and output components
x = array[:,0:8]
y = array[:,8]


      fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
0               7.4             0.700         0.00  ...       0.56      9.4        5
1               7.8             0.880         0.00  ...       0.68      9.8        5
2               7.8             0.760         0.04  ...       0.65      9.8        5
3              11.2             0.280         0.56  ...       0.58      9.8        6
4               7.4             0.700         0.00  ...       0.56      9.4        5
...             ...               ...          ...  ...        ...      ...      ...
1594            6.2             0.600         0.08  ...       0.58     10.5        5
1595            5.9             0.550         0.10  ...       0.76     11.2        6
1596            6.3             0.510         0.13  ...       0.75     11.0        6
1597            5.9             0.645         0.12  ...       0.71     10.2        5
1598            6.0             0.310         0.47  ...       0.6

###a. Rescaling Data

For data with attributes of varying scales, we can rescale attributes to possess the same scale. We rescale attributes into the range 0 to 1 and call it normalization. We use the MinMaxScaler class from scikit-learn. Let’s see an example.

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0,1))
reScaledX = scaler.fit_transform(x)
np.set_printoptions(precision=3)
print(reScaledX)

[[0.248 0.397 0.    ... 0.141 0.099 0.568]
 [0.283 0.521 0.    ... 0.338 0.216 0.494]
 [0.283 0.438 0.04  ... 0.197 0.17  0.509]
 ...
 [0.15  0.267 0.13  ... 0.394 0.12  0.416]
 [0.115 0.36  0.12  ... 0.437 0.134 0.396]
 [0.124 0.13  0.47  ... 0.239 0.127 0.398]]


This gives us values between 0 and 1. Rescaling data proves of use with neural networks, optimization algorithms and those that use distance measures like k-nearest neighbors and weight inputs like regression.

###b. Standardizing Data

With standardizing, we can take attributes with a Gaussian distribution and different means and standard deviations and transform them into a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. For this, we use the StandardScaler class. Let’s take an example.

In [3]:
from sklearn.preprocessing import StandardScaler
#Standardizing Data

scaler = StandardScaler().fit(x)
scaledX = scaler.transform(x)
print(scaledX)

[[-0.528  0.962 -1.391 ... -0.466 -0.379  0.558]
 [-0.299  1.967 -1.391 ...  0.873  0.624  0.028]
 [-0.299  1.297 -1.186 ... -0.084  0.229  0.134]
 ...
 [-1.16  -0.1   -0.724 ...  1.255 -0.197 -0.534]
 [-1.39   0.655 -0.775 ...  1.542 -0.075 -0.677]
 [-1.333 -1.217  1.022 ...  0.203 -0.136 -0.666]]


###c. Normalizing Data
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use the Normalizer class. Let’s take an example.

In [4]:
from sklearn.preprocessing import Normalizer
#Normalizing Data
scaler = Normalizer().fit(x)
normalizedX = scaler.transform(x)
print(normalizedX)
normalizedX[0:5,:6]

[[2.024e-01 1.914e-02 0.000e+00 ... 3.008e-01 9.299e-01 2.729e-02]
 [1.083e-01 1.222e-02 0.000e+00 ... 3.472e-01 9.306e-01 1.385e-02]
 [1.377e-01 1.342e-02 7.061e-04 ... 2.648e-01 9.533e-01 1.760e-02]
 ...
 [1.263e-01 1.023e-02 2.607e-03 ... 5.815e-01 8.020e-01 1.997e-02]
 [1.077e-01 1.178e-02 2.191e-03 ... 5.842e-01 8.033e-01 1.817e-02]
 [1.298e-01 6.704e-03 1.016e-02 ... 3.893e-01 9.083e-01 2.153e-02]]


array([[0.202, 0.019, 0.   , 0.052, 0.002, 0.301],
       [0.108, 0.012, 0.   , 0.036, 0.001, 0.347],
       [0.138, 0.013, 0.001, 0.041, 0.002, 0.265],
       [0.177, 0.004, 0.009, 0.03 , 0.001, 0.268],
       [0.202, 0.019, 0.   , 0.052, 0.002, 0.301]])

###d. Binarizing Data
Using a binary threshold, it is possible to transform our data by marking the values above it 1 and those equal to or below it, 0. For this purpose, we use the Binarizer class. Let’s take an example.

In [5]:
from sklearn.preprocessing import Binarizer
 #Binarizing Data
scaler = Binarizer(threshold = 0.1).fit(x)
binarizedX = scaler.transform(x)
print(binarizedX)
binarizedX[0:5, : ]

[[1. 1. 0. ... 1. 1. 1.]
 [1. 1. 0. ... 1. 1. 1.]
 [1. 1. 0. ... 1. 1. 1.]
 ...
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 1. 1. 1.]]


array([[1., 1., 0., 1., 0., 1., 1., 1.],
       [1., 1., 0., 1., 0., 1., 1., 1.],
       [1., 1., 0., 1., 0., 1., 1., 1.],
       [1., 1., 1., 1., 0., 1., 1., 1.],
       [1., 1., 0., 1., 0., 1., 1., 1.]])

This marks 0 over all values equal to or less than 0, and marks 1 over the rest. When you want to turn probabilities into crisp values, this functionality comes handy.

###e. Mean Removal

We can remove the mean from each feature to center it on zero.

In [None]:
from sklearn.preprocessing import scale

#Mean Removal
data_standardized = scale(dataFframe)
data_mean = data_standardized.mean(axis = 0)
data_std = data_standardized.std()
print(data_mean)
print(data_std)

[ 3.555e-16  1.733e-16 -8.887e-17 -1.244e-16  3.910e-16 -6.221e-17
  4.444e-17  2.364e-14  2.862e-15  6.754e-16  1.066e-16  8.887e-17]
1.0


###f. One Hot Encoding
When dealing with few and scattered numerical values, we may not need to store these. Then, we can perform One Hot Encoding. For k distinct values, we can transform the feature into a k-dimensional vector with one value of 1 and 0 as the rest values.

In [None]:
#One Hot Encoding

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
encodedX = encoder.fit(x)
encodedX=encoder.transform(x)
print(encodedX)

###g. Label Encoding
Some labels can be words or numbers. Usually, training data is labelled with words to make it readable. Label encoding converts word labels into numbers to let algorithms work on them. Let’s take an example.

In [None]:
#Label Encoding

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
input_classes = ['Havells','Philips','Syska','Eveready','Lloyd']
label_encoder.fit(input_classes)
for i,item in enumerate(label_encoder.classes_):
  print(item,'-->',i)