Data preprocessing (Scaling) with Python.

Data preprocesing is the act of transforming data into more useful forms.
The main data preprocessing methods include scaling, normalization, binarization and standardization.
In today's case, we will use a three python libraries to illustrate data preprocessing:
1. Pandas(we will use one of its modules to open the csv file we are going to use).
2. Sklearn(has a couple of useful modules that we will use).
3. Numpy(for its set_printoptions module).

*I have used the terms modules and methods interchangeably, just saying.


In [8]:
#Scaling

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing

In [9]:
path = 'pima-indians-diabetes.csv'
dataframe = read_csv(path)
array = dataframe.values

Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range. Data normalization is used in machine learning to make model training less sensitive to the scale of features. This allows our model to converge to better weights and, in turn, leads to a more accurate model. To learn more about data normalization visit, https://www.educative.io/answers/data-normalization-in-python#:~:text=Normalization%20refers%20to%20rescaling%20real,to%20a%20more%20accurate%20model. 

In [10]:
#Use the MinMax Scaler to rescale the data to the range of 0 to 1.as_integer_ratio.
#This is data normalization.
data_scaler = preprocessing.MinMaxScaler(feature_range = (0,1))
data_rescaled = data_scaler.fit_transform(array)

print(data_rescaled)

[[0.05882353 0.42713568 0.54098361 ... 0.11656704 0.16666667 0.        ]
 [0.47058824 0.91959799 0.52459016 ... 0.25362938 0.18333333 1.        ]
 [0.05882353 0.44723618 0.54098361 ... 0.03800171 0.         0.        ]
 ...
 [0.29411765 0.6080402  0.59016393 ... 0.07130658 0.15       0.        ]
 [0.05882353 0.63316583 0.49180328 ... 0.11571307 0.43333333 1.        ]
 [0.05882353 0.46733668 0.57377049 ... 0.10119556 0.03333333 0.        ]]


The second method of data scaling is Binarization. As the name suggests, it is the conversion of data into two states usually denoted by 0 and 1. In our case, a person whose diabetes results is positive in our data set, his value will translate to a 1 andthe diabetes-free individual will evaluate as 0 after the data is binarized.

In [20]:
from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = 'pima-indians-diabetes.csv'
df = read_csv(path)
array = df.values
binarizer = Binarizer(threshold = 0.5).fit(array)
Data_binarized = binarizer.transform(array)

print(Data_binarized)

[[1. 1. 1. ... 0. 1. 0.]
 [1. 1. 1. ... 1. 1. 1.]
 [1. 1. 1. ... 0. 1. 0.]
 ...
 [1. 1. 1. ... 0. 1. 0.]
 [1. 1. 1. ... 0. 1. 1.]
 [1. 1. 1. ... 0. 1. 0.]]


In [27]:
#Standardization(Gaussian Distribution)
# Using the standard scaler class.
from sklearn.preprocessing import StandardScaler
data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

print(data_rescaled)

[[-0.84372629 -1.12208597 -0.16024856 ... -0.36426474 -0.18894038
  -0.73075304]
 [ 1.23423997  1.94447577 -0.26357823 ...  0.60470064 -0.1037951
   1.36845138]
 [-0.84372629 -0.99692019 -0.16024856 ... -0.91968415 -1.0403932
  -0.73075304]
 ...
 [ 0.343683    0.0044061   0.14974046 ... -0.68423462 -0.27408566
  -0.73075304]
 [-0.84372629  0.16086333 -0.47023757 ... -0.37030191  1.17338414
   1.36845138]
 [-0.84372629 -0.8717544   0.04641078 ... -0.47293375 -0.87010264
  -0.73075304]]



FEATURE SELECTION TECHNIQUES AND WHY FEATURE SELECTION IS IMPORTANT.
Importance of feature selection.
1. Helps prevent overfitting in models.
2. Will increase efficiency of the model.
3. Reduce training time.
4. Help boost generalization of models.
5. Minimize collinearity while enhancing interpretability.
6. Helps avoid the hectiness of dimensionality.

In [31]:
#Univariate selection.
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [32]:
path = 'pima-indians-diabetes.csv'
df = read_csv(path)
array = df.values

#Separate array into input and output components.
X = array[:,0:8]
Y = array[:,8]

test = SelectKBest(score_func = chi2, k=4)
fit = test.fit(X,Y)
set_printoptions(precision = 2)
print(fit.scores_)

featured_data = fit.transform(X)
print("\n Featured Data: \n", featured_data[0:4])

[ 110.73 1406.59   17.5    51.01 2219.4   127.67    5.36  178.01]

 Featured Data: 
 [[ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]
