<h2><font color="#004D7F" size=6>Module 3. Data Processing Module</font></h2>

<h1><font color="#004D7F" size=5>2. Data Transformation</font></h1>

<br><br>

---

<h2><font color="#004D7F" size=5>Index</font></h2>
<a id="index"></a>

* [1. Introduction](#section1)
    * [1.1. Libraries and CSV](#section11)
* [2. Transformations](#section2)
    * [2.1. Scaling](#section21)
    * [2.2. Standardization](#section22)
    * [2.3. Normalization](#section23)
    * [2.4. Binarization](#section24)
    * [2.5. Box-Cox](#section25)
    * [2.6. Yeo-Johnson](#section26)


<a id="section1"></a>
# <font color="#004D7F"> 1. Introduction</font>

The raw, unanalyzed data is unlikely to provide robust insights as many aspects of it require the data to be in a specific form, so we need to transform the dataset. On the other hand, some algorithms may perform better if the data is prepared in a specific way, for example, tree-based algorithms with nominal feature attributes. Therefore, it is essential to preprocess our data as a fundamental part of any machine learning project.

<a id="section11"></a>
## <font color="#004D7F"> 1.1. Libraries and CSV</font>


For this practice, we will load the Pima Indian Diabetes dataset and work with different types of data transformations. Additionally, for some transformations, we will work with other datasets to see a better impact on their transformation.

On the other hand, regarding the libraries, we will call them according to each transformation. Please note that these transformations will be performed using the **Scikit-learn** library.


In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

filename = 'data/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv(filename, names = names)
array = data.values

X = array[ : , 0:8] # All the characteristics of all the rows from the column 0 to 8
Y = array[ : , 8] # The target, all the rows in the last column
print(X)

[[  6.  148.   72.  ...  33.6 627.   50. ]
 [  1.   85.   66.  ...  26.6 351.   31. ]
 [  8.  183.   64.  ...  23.3 672.   32. ]
 ...
 [  5.  121.   72.  ...  26.2 245.   30. ]
 [  1.  126.   60.  ...  30.1 349.   47. ]
 [  1.   93.   70.  ...  30.4 315.   23. ]]


---
<a id="section2"></a>
## <font color="#004D7F"> 2. Transformations</font>


The fit and transform method is the preferred approach. It calls the `fit()` function to prepare the transformation parameters once on your data. Then, you can use the `transform()` function on the same data to prepare it for modeling and again on the test or validation dataset or new data that you may see in the future. Combined fit and transform is a convenience that you can use for one-time tasks. This can be useful if you are interested in plotting or summarizing the transformed data, and you will use the `fit_transform()` function. You can review the API [**sklearn.preprocessing**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing). As can be seen, there is a plethora of functions that we can apply in this preprocessing phase according to our data needs.

It is important to see how the data looks before and after transformation. In the following code, you can see how the original data looks and compare it with each transformation.


<a id="section21"></a>
## <font color="#004D7F"> 2.1. Scaling</font>


This transformation is useful for optimization algorithms used in the core of machine learning algorithms like Gradient Descent. It is also useful for algorithms that weigh inputs like Regression and Neural Networks and algorithms that use distance measures like k-Nearest Neighbours. You can rescale your data using the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) class.

After scaling, you can see that all values are in the range $[0,1]$.


In [15]:
# Rescale data (between 0 and 1)
from sklearn.preprocessing import MinMaxScaler

# Creating an instance of MinMaxScaler , meaning the scaled data will be within this range(0-1).
scaler = MinMaxScaler(feature_range=(0,1))

# Using the fit_transform method of the scaler object to transform the data in X to a specified range.
rescaledX = scaler.fit_transform(X)

# Setting the printing options for numpy arrays to display only three decimal places.
np.set_printoptions(precision=3)

print(names)

# Printing the first 5 rows of the scaled data (rescaledX) for all columns.
print(rescaledX[0:5, : ])



['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
[[0.353 0.744 0.59  0.354 0.    0.501 0.269 0.483]
 [0.059 0.427 0.541 0.293 0.    0.396 0.151 0.167]
 [0.471 0.92  0.525 0.    0.    0.347 0.289 0.183]
 [0.059 0.447 0.541 0.232 0.111 0.419 0.072 0.   ]
 [0.    0.688 0.328 0.354 0.199 0.642 0.982 0.2  ]]
