# Week 4 - Data Preprocessing

## Learning Objectives
+ Introduction to scikit-learn
+ Standardization and Normalization
+ Discretization
+ Data Encoding
+ Feature construction: generating polynomial features
+ Dimensionality Reduction using PCA

The contents from this tutorial are from [scikit-tutorials](https://scikit-learn.org/stable/auto_examples/index.html#preprocessing), [preprocessing tutorial](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing), [column transformers tutorial](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py), and [pipelines tutorial](https://scikit-learn.org/stable/modules/compose.html#combining-estimators).

For this tutorial, you need the following packages installed:
```
conda install -c anaconda scikit-learn
```

In the previous tutorial, we learnt data visualization. However, in practise, data cleaning and visualization go hand in hand, and are usually done together too. In this tutorial, we will go over other few strategies in this tutorial.

# Dataset - Heart Disease Dataset

Let us work on the heart disease [dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29). This dataset has 13 attributes and 1 label column (presence or absence of heart disease). In this tutorial, we are working on only data preprocessing, and not concerning ourselves with any model and its prediction. For simplicity, we will thus work on the whole dataset completely - treating it as the train dataset. 

The columns of this dataset include: 
```
['age', 'sex', 'chest_pain','restBP','cholesterol','fast_sugar','rest_ECG','max_HR','exer_angina','oldpeak','slope','vessels','thal', 'disease']
```
The doc file accompanying this dataset has further details regarding the dataset. 

Let us first read the dataset using pandas.

As the "disease" column has the presence and absence of disease, we are not using that for our preprocessing today. The preprocessing we will be doing will only use features and focus on the following tasks:
1. Standardization and Normalization
2. Discretization
3. Encoding of Nominal and Ordinal Variables
4. Feature constuction. 

The data has no missing values - this is given in the accompanying doc file. We can get quick descriptive statistics of the data using ```describe()```.

# Introduction to scikit-learn
The library [```scikit-learn```](https://scikit-learn.org/stable/index.html) is a part of the SciPy (Scientific Python) group, which has a set of libraries created for scientific computing. The first part of the name refers to this origin of the library, while the second part refers to the discipline this library pertains to: Machine Learning. It is built on Numpy, and has extremely efficient and reusable codes. 

This library has package [```skearn.preprocessing```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) which provide several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the tasks such as classification, regression, etc.

Let us import the preprocessing package from sklearn and use the information from the doc file to categorize the columns according different types of data: ordinal, nominal, numeric, binary and column we wish to discretize.

# Scaling and Normalization

Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. If a feature has a variance that is orders of magnitude larger than others, it might end up dominating the estimator, which might not learn well from other features. 

The ```preprocessing``` module provides the [```StandardScaler```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) utility class, which is a quick and easy way to perform the standardization on an array-like dataset. The scaled data has zero mean and a unit variance.

Normalization is the process of scaling individual samples to have unit norm, independently of the distribution of the samples. The ```preprocessing``` module has the [```Normalizer```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) utility class, which transforms individual samples to unit norm. 

Note that the standardization/scaling term is used for feature wise operation, while normalization term is being used for sample wise operation. 

This ```fit```, ```transform``` and ```fit_transform``` are methods available on all standard *Transformers* in sklearn. This is actually a good time to understand the [sklearn convention](https://scikit-learn.org/stable/glossary.html) for these terms.

1. *Estimator*: An object which manages the estimation and decoding of a model. Estimators must provide a ```fit``` method, and should provide ```set_params``` and ```get_params```, although these are usually provided by inheritance from ```base.BaseEstimator```.
2. *Transformer*: An estimator supporting ```transform``` and/or ```fit_transform```. 
3. *Predictor*: An estimator supporting ```predict``` and/or ```fit_predict```. This encompasses classifier, regressor, outlier detector and clusterer.

# Encoding Ordinal and Nominal Values

For ordinal data, we have the [```OrdinalEncoder```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder). The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. For our data, we do not actually require to use ordinal encoder as the data is already in integer form. 

A common technique for encoding the categorical variables is [```OneHotEncoder```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). It transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0. When ```handle_unknown='ignore'``` is specified and unknown categories are encountered during transform, no error will be raised but the resulting one-hot encoded columns for this feature will be all zeros 

# Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. This is possible using [```KBinsDiscretizer```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer). We can also change the encoding of the transformed result to ordinal too using ```encode``` parameter.

In our example, we can perform discretization on the age column. As the range of the column is 29 to 77, we can do a binning into 5 bins to express different age-groups. The strategy is used to define the width of the bins. In our case, we just want equal-sized bins to categorize people in groups of 10 effectively. 

# Feature Construction - constructing polynomial features

It is often useful to add complexity to the model by considering nonlinear features of the input data. [```PolynomialFeatures```](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) allows us to generate higher order terms and interaction terms to consider this non-linearity. 

The ```include_bias=True``` is the default value in the ```PolynomialFeatures``` Transformer. The bias column is the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). We can set it to ```False``` for our tutorial.

# Putting it all together for data with mixed data types - Pipeline and ColumnTransformer

Our dataset contains heterogeneous data types. As we have done various different preprocessing on different columns - how do we get it all together? A simple approach could be to stitch it all together in a new dataframe. The following code snippet could do categorical encoding and binning. 
```
new_data = pd.DataFrame()

for i in range(6):
    new_data['age_'+str(i)] = data_disc[:,i]
new_data['sex'] = data.sex
for i in range(4):
    new_data['chest_pain_'+str(i)] = data_cat[:,i]
new_data['restBP'] = data.restBP
new_data['cholesterol'] = data.cholesterol
new_data['fast_sugar'] = data.fast_sugar
for i in range(3):
    new_data['rest_ECG_'+str(i)] = data_cat[:,4+i]
new_data['max_HR'] = data.max_HR
new_data['exer_angina'] = data.exer_angina
new_data['oldpeak'] = data.oldpeak
new_data['slope'] = data.slope
new_data['vessels'] = data.vessels
for i in range(3):
    new_data['thal_'+str(i)] = data_cat[:,7+i]
    
new_data.head()
```

However, as the number of preprocessing steps increase and change, this approach becomes difficult to scale. To rescue us from this difficulty, sklearn has the [```sklearn.compose```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) and [```sklearn.pipeline```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline) packages. 

[```Pipeline```](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) can be used to chain multiple fixed steps into one. So, for example, our preprocessing steps for numeric columns are fixed: Scaling, and doing polynomial feature creation. So, we can essentially encapsulate these into a pipeline.

The [```ColumnTransformer```](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. To each column, a different transformation can be applied, such as preprocessing for different types of data. 

It is interesting to note that this ColumnTransformer in turn can be used in Pipeline to perform say some classification task. Though we are only focussed on preprocessing in this tutorial, the code blocks written now can easily be integrate with further steps in your project, which makes the ```Pipeline``` and ```ColumnTransformer``` utilities extremely useful.

# Image Data and PCA (Feature Decomposition)

## Dataset

Now let us work on image data, as we have already explored tabular, hierarchical and array data in the previous tutorials. Let us use the [Olivetti dataset](https://scikit-learn.org/0.19/datasets/olivetti_faces.html). This dataset contains a set of face images of 40 different subjects. This dataset is available in sklearn itself. 

Remember that standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. Let us standardize the images in this dataset using ```StandardScaler```. Feature Scaling and Processing plays an important role for PCA.

## Principal Component Analysis

Principal Component Analysis is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. It is a technique which essentially helps us to reduce the dimensionality of our dataset. As PCA is interested in the components that maximize the variance, if one component varies less than another because of their respective scales, PCA might determine that the direction of maximal variance more closely corresponds with the other component - which could be incorrect. As we have already scaled out dataset, we can proceed to performing PCA.

In [```sklearn.decomposition```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition), we have the *transformer* [```PCA```](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA). It learns n_components in its ```fit``` method, and can be used on new data to project it on these components. 

Let us first find out how many components are sufficient to explain our data.

As we have image data however, we can actually view these orthogonal components that PCA has learnt. These are called as **Eigenfaces**. A combination of these eigenfaces is actually usually sufficient to recreate the original sample. 

Let us first perform PCA and get these orthogobal components, or Eigenfaces.

Let us define a helper function to plot a gallery of eigenface images. We use the basicsof creating multiple plots as learnt in previous tutorial.

# Practice Exercise (Optional)
1. From the heart dataset, randomly select a few columns and randomly select a few rows of these columns, and set the values of these to ```np.nan```. This creates a dataset with missing values.
2. Read the documentation of [```SimpleImputer```](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer). The various strategies in this can be used for different types of data. For each of the columns, a specific strategy can be used for imputation based on the type of data in that column. Do the missing value imputation accordingly.
3. Do visualization on the heart dataset, and see if you can find some outliers. How does scaling affect the visualizations? Also, you can try different other standardization and normalization on the data and visualize the effects.