# Data pre-processing in Python

In this exercise we will learn how to perform the most basic data pre-processing operations (normalization, one-hot-encoding, binarization) using `scikit-learn` and `pandas` libraries.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

from sklearn import datasets, preprocessing

## Fisher dataset (Iris)

Function `load_iris()` creates the object representing the famous [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)

In [2]:
iris = datasets.load_iris()

print('Feature names: ', iris.feature_names)
print('Decision classes: ', iris.target_names)

print('shape of the dataset: ', iris.data.shape)
print('shape of labels: ', iris.target.shape)

Feature names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Decision classes:  ['setosa' 'versicolor' 'virginica']
shape of the dataset:  (150, 4)
shape of labels:  (150,)


In [3]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In this exercise we will be using [pandas](https://pandas.pydata.org) to store the data and temporary results. Pandas is a highly specialized library fully compatibile with [Numerical Python](http://www.numpy.org) and [SciKit Learn](http://scikit-learn.org), the primary libraries for data mining and machine learning in Python. Two nice introductions to Pandas can be found here:

* [Introduction to Pandas in Python](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d), 
* [Quick introduction to Pandas Python library](https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673)

In [4]:
# create a new DataFrame object
df = pd.DataFrame(iris.data)

In [5]:
df

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [6]:
# adding a new column (feature)
df['target'] = iris.target

In [7]:
df

Unnamed: 0,0,1,2,3,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [8]:
df.columns

Index([0, 1, 2, 3, 'target'], dtype='object')

In [9]:
# changing the list of column names
df.columns = iris.feature_names + ['target']

# display first 10 rows
df.head(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


Each column in the `DataFrame` object has the type `Series` and has [a very rich API](https://pandas.pydata.org/docs/reference/series.html). 

In [10]:
df['sepal length (cm)'].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sepal length (cm), dtype: float64

A very useful function is `apply()` which allows to define *ad hoc* functions executed against columns.

In [None]:
# quick way to summarize a column
df['sepal length (cm)'].apply(np.round)

In [None]:
# create a binary vector representing the results of evaluating a condition
df['sepal length (cm)'].head().apply(lambda x: x > 5.0)

In [None]:
# create a condition-based index for quick access to subsets of data
sepal_idx = df['sepal length (cm)'] > 7.0

In [None]:
sepal_idx

In [None]:
# display only rows that fullfill the condition
df[sepal_idx]

We will use the [MatPlot](https://matplotlib.org) library to draw basic plots. There are many more advanced alternatives to `Matplot`, such as `Seaborn`, `Bokeh`, `plotnine`, but `Matplot` will be sufficient for our purposes.

In [None]:
x = df['sepal length (cm)'][:]
y = df['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

A similar effect can be achieved by using directly the `plot()` method of the `pandas.Series` object. In the following example `iloc` is the *index localization*, it addresses all the rows `:` and columns from the second to the third `[1,2]` (keep in mind that the indexing of rows and columns starts at 0).

In [None]:
df.iloc[:,1:3]

In [None]:
df.iloc[:,[1,2]].plot(kind='scatter', x=0, y=1)

## Normalization

The first operation is the linear normalization performed by the [MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) class. This class performs the following transformation:

$$v' = \frac{v-min}{max-min} * (max'-min') + min'$$

where $max,min$ are the maximum and minimum value of the attribute, $max',min'$ are the maximum and minimum value in the new scale, $v'$ is the new value of the attribute, and $v$ is the original value of the attribute.

We will be transforming only the features of iris flowers, not the labels (the last column), so in the first step we will store these four features in a separate `X` variable.

In [None]:
# get all rows and all but the last column into X
X = df.iloc[:, :-1]

In [None]:
X

In [None]:
# don't forget to remove the label column from the list of columns
cols = df.columns[:-1]

The following code performs normalization of the entire dataset. All preprocessors follow the pattern of `fit().transform()`.

In [None]:
norm = preprocessing.MinMaxScaler(feature_range=(0,1)).fit(X)
X_minmax = pd.DataFrame(norm.transform(X), columns=cols)

X_minmax.head(n=10)

In [None]:
X_minmax.describe()

In [None]:
x = X_minmax['sepal length (cm)'][:]
y = X_minmax['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Standarization

Another type of normalization is the *standarization*, which is the transformation after which the mean value of the feature is 0 and its standard deviation is 1. In the `scikit-learn` library standarization is performed by the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) class which performs the following transformation:

$$v' = \frac{v-\mu}{\sigma}$$

where $\mu$ is the mean, and $\sigma$ is the standard deviation of the feature.

In [None]:
scale = preprocessing.StandardScaler().fit(X)
X_scaled = pd.DataFrame(scale.transform(X), columns=cols)

X_scaled.head()

In [None]:
X_scaled.describe()

In [None]:
x = X_scaled['sepal length (cm)'][:]
y = X_scaled['sepal width (cm)'][:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Discretization 

An alternative for the manual discretization is the automatic discovery of bin boundaries using the [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer) class. It finds *k* bins in the feature so that the geometric means of bins are as far as possible.

In [None]:
kbin = preprocessing.KBinsDiscretizer(n_bins=3, strategy='kmeans', encode='ordinal').fit(df[['sepal length (cm)']])

df_kbinned = pd.DataFrame(kbin.transform(df[['sepal length (cm)']]))

x = df['sepal length (cm)'][:]
y = df_kbinned[:]
t = df['target']

plt.scatter(x, y, c=t)
plt.show()

## Binarization

Sometimes we need to transform a discrete attribute into a binary flag representing the result of a test performed on the feature. This can be easily achieved using the [Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html) class.

In [None]:
binarize = preprocessing.Binarizer(threshold=3).fit(X)

X_binned = pd.DataFrame(binarize.transform(X), columns=cols)

pd.concat([df,X_binned], axis=1).head()

## Displaying histograms

For simple counting of values in a feature we can use:
- `pandas.Series.value_counts()`
- `collections.Counter`

and we just want to draw the histogram, it is enough to use `pandas.Series.hist()`

In [None]:
X_binned['sepal width (cm)'].value_counts()

In [None]:
from collections import Counter

Counter(X_binned['sepal width (cm)'].values)

In [None]:
X_binned.hist()

## Missing values imputation

Missing values can seriously impact the results of the analysis. Many data mining algorithms do not accept datasets with missing values present. [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) class allows to change the missing values to mean, median, or mode of the feature.

In [None]:
from sklearn.impute import SimpleImputer

matrix = np.array([[ 1, 2, np.nan], [np.nan, 4, 5], [6, np.nan, 7]])

# alternative strategies are 'mean', 'median' and 'most_frequent'
imp = SimpleImputer(missing_values=np.nan, strategy='mean').fit(matrix)

print(matrix)
print(imp.transform(matrix))

## Label encoding

A very often used class is the [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) which transforms a categorical feature into a set of binary flags. For a feature with *k* distinct values, *k* new binary features will be created which represent a bitmap index on the feature.

In [None]:
df_target = df['target'].values

print(df_target)

In [None]:
one_hot = preprocessing.OneHotEncoder(categories='auto').fit(df_target.reshape(-1,1))

one_hot.transform(df_target.reshape(-1,1)).todense()

In [None]:
one_hot.inverse_transform(np.array([[0,0,1]]))

# Assignment

Refer to the documentation of the [Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) class which performs normalization of individual instances in the learning set. Perform normalization of the Iris set, while checking the effect of changing the value of the `norm` parameter used when initializing the class.

hint : use the [DataFrame.sum()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html) method.