<left><img width=100% height=100% src="img/itu_logo.png"></left>

## Lecture 03: Feature Pre-processing Approaches

### __Gül İnan__<br><br>Istanbul Technical University

## Video Games Data

In a survey of video game playing, 91 students completed questionnaires. The data available here are the students’ responses to the questionnaire.
If a question **was not answered or improperly answered**, then it was coded as **99**. 

The answers to these questions were given below:
   - `Time`: Number of hours played in the week prior to survey. 
   - `How often`: semesterly=0; monthly=1; weekly=2; daily=3.
   - `Sex`: female=0; male=1.
   - `Age`: Student’s age in years.
   - `Computer at home`:  no=0; yes=1.
   - `Hate math`:  no=0; yes=1.
   - `Work`: Number of hours worked in week prior to the survey.
   - `Own PC`:  no=0; yes=1.
   - `Grade expected`: C=0; B=1; A=2.

In [280]:
#import dataset
import pandas as pd
video_df = pd.read_csv("datasets/video.csv", sep = ";",  na_values="99", index_col=0)
video_df.head(16)

Unnamed: 0,time,freq,sex,age,home,math,work,own,grade
0,2.0,weekly,female,19,yes,no,10.0,yes,A
1,0.0,monthly,female,18,yes,yes,0.0,yes,C
2,0.0,monthly,male,19,yes,no,0.0,yes,B
3,0.5,monthly,female,19,yes,no,0.0,yes,B
4,0.0,semesterly,female,19,yes,yes,0.0,no,B
5,0.0,semesterly,male,19,no,no,12.0,no,B
6,0.0,semesterly,male,20,yes,yes,10.0,yes,B
7,0.0,semesterly,female,19,yes,no,13.0,no,B
8,2.0,daily,male,19,no,no,0.0,no,A
9,0.0,semesterly,male,19,yes,yes,0.0,yes,A


In [281]:
#check the shape of the data set
video_df.shape

(91, 9)

In [282]:
#get some info on variables
video_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    91 non-null     float64
 1   freq    78 non-null     object 
 2   sex     91 non-null     object 
 3   age     91 non-null     int64  
 4   home    91 non-null     object 
 5   math    91 non-null     object 
 6   work    88 non-null     float64
 7   own     91 non-null     object 
 8   grade   91 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 7.1+ KB


In [283]:
#investigate data types of variables
video_df.dtypes

time     float64
freq      object
sex       object
age        int64
home      object
math      object
work     float64
own       object
grade     object
dtype: object


Prior to building a predictivel model for `the number of hours played`:

  - The numerical features should be scaled (not necessary in this application, but we will do),
  - The **non-numerical answers** to these questions are required to be **coded numerically**, and
  - The **missing values** should be filled in.

## Feature Scaling

-   When the **numerical features** in a data set have **different measurement scales** (e.g., say, you have multiple features like age, salary, and height; With their range as (18–100 Years), (25,000–75,000 Euros), and (1–2 Meters)), **features with much higher values may dominate** the objective function and make the model unable to learn from other features correctly as expected.
-   For that reason, there are **several optimization algorithms** (especially gradient-descent, etc.), which work much better (are more **robust**, **numerically stable**, and **converge faster**) if the data has a smaller range.
-   In this sense, to bring all the features in a data set within the same range, the features are often scaled by a **data-dependent transformation method** prior to model training and testing.
-  **Feature scaling** is a method used to **scale the range of features of data**. 

## Unsupervised Feature Scaling Approaches

-   There are many different ways for scaling features; however, we will only cover two most popular ones:
    -   `Z-score standardization` and
    -   `Min-Max scaling`.


## Remember that:

We represent the feature matrix $\textbf{X}$ as follows:

$$
\textbf{X} = \begin{bmatrix}
x_{11} & \ldots & x_{1j} & \ldots & x_{1d} \\
\vdots & \vdots &    \vdots    &  \vdots      &  \vdots\\
x_{i1} & \ldots & x_{ij} & \ldots & x_{id} \\
\vdots & \vdots &    \vdots    &  \vdots      &  \vdots\\
x_{n1} & \ldots & x_{nj} & \ldots & x_{nd} \\
\end{bmatrix}=[\mathbf{X}_1,\ldots, \mathbf{X}_j, \ldots, \mathbf{X}_d],
$$

and let $j$th column of $\mathbf{X}$ feature matrix denoted by $\mathbf{X}_j$ so that $\mathbf{X}=[\mathbf{X}_1,\ldots, \mathbf{X}_j, \ldots, \mathbf{X}_d]$. Thus, the column vector $\mathbf{X}_j$ contains the $n$ levels of the $j$th feature.


## Z-score standardization

-   The formula for `standardizing` of a single feature $x_{ij}$ data point in a feature vector $\mathbf{X}_j$ is given below:

$$
\begin{equation}
x^{std}_{ij} = \frac{x_{ij}-\bar{x}_j}{s_j},
\end{equation}
$$

-   where $\bar{x}_j= \frac{\sum_{i=1}^{n} x_{ij}}{n}$ and $s_j= \sqrt\frac{\sum_{i=1}^{n} (x_{ij}-\bar{x}_j)^2}{(n-1)}$.
-   After standardizing a feature, it will have `zero mean` and `unit variance` such as:

$$
\begin{eqnarray}
\bar{x}^{std}_j= \frac{\sum_{i=1}^{n} x^{std}_{ij}}{n}=0  \quad \text{and} \quad
s^{std}_j= \sqrt\frac{\sum_{i=1}^{n} (x^{std}_{ij}-\bar{x}^{std}_j)^2}{(n-1)}=1. \nonumber
\end{eqnarray}
$$

- Standardization is implemented in the scikit-learn's [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) transformer class and it ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude.

## Mean centering

-   The formula for `mean centering` of a single $x_{ij}$ data point in a feature vector $\mathbf{X}_j$ is given below:

$$
\begin{equation}
x^{centered}_{ij} = x_{ij}-\bar{x}_j,
\end{equation}
$$

-   where $\bar{x}_j= \frac{\sum_{i=1}^{n} x_{ij}}{n}$.

## Min-max scaling

-   `Min-max scaling` squashes the features into a \[0, 1\] range, which can be achieved via the following equation for a single $x_{ij}$ data point in a feature vector $\mathbf{X}_j$ is given below:

$$
\begin{equation}
x^{scaled}_{ij} = \frac{x_{ij}-x_{min,j}}{x_{max,j}-x_{min,j}}.
\end{equation}
$$

- where $x_{min,j}$ and $x_{max,j}$ are the minumum and maximum values in the feature vector $\mathbf{X}_j$, respectively.
- Min-max scaling is implemented in scikit-learn's [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) transformer class and it shifts the data such that all features are exactly between 0 and 1. 

## Other alternative scalers

- [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler), 
- [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer), and 
- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).

## CORRECT Supervised learning pipeline involving unsupervised feature pre-processing

-   Assume that you have only one numerical feature in your data set.
-   Split the data into train ($\mathbf{x}_{train},y_{train}$) and test set ($\mathbf{x}_{test},y_{test}$).
-   Learn the parameters associated with scaling
approach (e.g. in standardization learn mean and sample standard deviation, in minmax scaling learn max and min values of the feature) from the training feature ($\mathbf{x}_{train}$),
-  Transform the **feature vector**  ($\mathbf{x}_{train}$) with learned parameters and get $\mathbf{x}_{train, }^{transformed}$,
-  Transform the **feature vector**  ($\mathbf{x}_{test}$) with learned parameters, and get $\mathbf{x}_{test}^{transformed}$,
-  Train the model on the pre-processed training data $(
\mathbf{x}_{train}^{transformed},\mathbf{y}_{train})$, and
-  Evaluate it on the pre-processed test data ($\mathbf{x}_{test}^{transformed},\mathbf{y}_{test})$.

-   To guarantee that the **test error is an unbiased estimator of model performance**, all data-dependent **unsupervised pre-processing** operations should be determined using **only the training set** $\mathbf{x}_{train}$ and then merely applied to the test set $\mathbf{x}_{test}$.

## INCORRECT Supervised learning pipeline involving unsupervised feature pre-processing

-   In practice, one common approach is:
    -   Pre-process the **ENTIRE feature vector** ($\mathbf{x}$) before data splitting and get the transformed feature vetor ($\mathbf{x}^{transformed}$), 
    -   Split the transformed feature vector as: ($\mathbf{x}_{train}^{transformed},\mathbf{x}_{test}^{transformed}$)
    -   Train the model on the pre-processed training data ($\mathbf{x}_{train,}^{transformed},\mathbf{y}_{train})$, and
    -   Evaluate it on the pre-processed test data ($\mathbf{x}_{test}^{transformed},\mathbf{y}_{test})$.

## Data leakage

-   In a scenario, defined above,  the test error is **no longer** guaranteed to be an **unbiased estimate** of the generalization error.
-   When information from outside the training data is used to create a model, this is called as `data leakage` problem.


```
Data Leakage happens when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

— Daniel Gutierrez, [Ask a Data Scientist: Data Leakage](https://insidebigdata.com/2014/11/26/ask-data-scientist-data-leakage/)
```

### Scikit-learn Transformer API

- The `transformer API` in scikit-learn is very similar to the `estimator API`; the main difference is that transformers are typically "unsupervised," meaning, **they don't make use of class labels or target values**.

![](img/transformer.png)

## Applying Feature Scaling Approaches on Video Games Data

Let's take `student’s age in years` and the `number of hours worked` as predictors to build a model to predict `number of hours played` of a student. Create feature matrix and target vector first, and split our data set into training and test set first. 

**Step 1**. Arrange data into a features matrix and target vector and split it into train and test.

In [284]:
#Prepare X and y

import pandas as pd

video_X = video_df[["age", "work"]] #work involves missing values
video_y = video_df[["time"]]

In [285]:
video_X.info() #the work variable involves missing values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     91 non-null     int64  
 1   work    88 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 2.1 KB


In [286]:
from sklearn.model_selection import train_test_split

#Split 80:20
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.2, random_state=1400)

In [287]:
print(video_X_train.shape)
print(video_y_train.shape)
print(video_X_test.shape)
print(video_y_test.shape)

(72, 2)
(72, 1)
(19, 2)
(19, 1)


**Step 2.** Choose a class of pre-processing.  

For pre-processing, we need to import the class that implements the pre-processing. Import [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) class.

In [288]:
#import sklearn
#print(sklearn.__version__)

In [289]:
#!pip install scikit-learn --upgrade

In [290]:
from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

In [291]:
from sklearn.preprocessing import StandardScaler

**Step 3**. Instantiate pre-processing class

Then instantiate the class. We will accept the default `with_mean=True`, `with_std=True`.

In [292]:
scaler = StandardScaler()   # create feature trasformer object  

**Step 4**. Apply the transformer to your training data.

We then fit the scaler using the `fit` method, applied to the training data. For the `StandardScaler`, the `fit` method computes the sample mean and sample standard deviation of each feature on the training data set. 

In [293]:
scaler.fit(video_X_train)

In [294]:
#get the mean of each column in Xtrain
print(scaler.feature_names_in_)
scaler.mean_

['age' 'work']


array([19.54166667,  7.57142857])

In [295]:
import numpy as np
#get the variance of each column in Xtrain
print(scaler.feature_names_in_)
np.sqrt(scaler.var_)

['age' 'work']


array([ 1.99956593, 10.84774029])

**Step 5**. Apply the transform method to your training data.

To apply the transformed that we learned, we use the `transform` method of the `scaler`.

In [296]:
video_X_train_scaled = scaler.transform(video_X_train)  #scaler.fit_transform() is the shortcut version

In [297]:
video_X_train_scaled.head() #hopefully, we have column labels

Unnamed: 0,age,work
79,-0.270892,0.223878
21,2.229651,0.223878
87,0.229216,0.592618
5,-0.270892,0.408248
86,-0.270892,-0.697973


The transformed data has the same shape as the original data. But, the features have zero mean and unit standard deviation now.

In [298]:
print("Untransformed shape:", video_X_train.shape)
print("Transformed shape:", video_X_train_scaled.shape)

Untransformed shape: (72, 2)
Transformed shape: (72, 2)


In [299]:
video_X_train.describe().loc[["mean","std"]].round(4)

Unnamed: 0,age,work
mean,19.5417,7.5714
std,2.0136,10.9261


In [308]:
video_X_train_scaled.describe().loc[["mean","std"]].round(4)

Unnamed: 0,age,work
mean,-0.0,0.0
std,1.007,1.0072


In [301]:
video_X_train_scaled.info() #scalers works on missing data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72 entries, 79 to 75
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     72 non-null     float64
 1   work    70 non-null     float64
dtypes: float64(2)
memory usage: 1.7 KB


**Step 6**. Apply the transform method to your test data.

To apply a model to the transformed data, we also need to transform the test set. This is done by calling the `transform` method on `X_test`.

In [302]:
# transform test data
video_X_test_scaled = scaler.transform(video_X_test)

In [303]:
# print test data properties after scaling
print("Untransformed shape:", video_X_test.shape)
print("Transformed shape:", video_X_test_scaled.shape)

Untransformed shape: (19, 2)
Transformed shape: (19, 2)


In [304]:
video_X_test_scaled.describe().loc[["mean","std"]].round(4)

Unnamed: 0,age,work
mean,-0.0603,-0.0988
std,0.5088,0.7072


One eye-catching point is that mean values of the features in the test set are not 0 and, similarly, sample
standard deviation of the features in the test set are not 1.
Keep in mind that `StandardScaler` uses the mean and sample standard deviation values of the training data, not test data and uses these values for transformation.

Note that we SHOULD NEVER apply **scaler.fit_transform()** on Xtest data.

`An important side note:` Scaling does not change the correlation between variables since it is a linear transformation and correlation is a measure of linear association. Please, see the example below:

In [305]:
video_X_train.corr()

Unnamed: 0,age,work
age,1.0,0.471771
work,0.471771,1.0


In [306]:
video_X_train_scaled.corr() 

Unnamed: 0,age,work
age,1.0,0.471771
work,0.471771,1.0


## References
- https://www.stat.berkeley.edu/users/statlabs/labs.html
- https://github.com/DCtheTall/introduction-to-machine-learning/blob/master/chapter3/scaling.py

## Further reading suggestion:

- https://www.atoti.io/articles/when-to-perform-a-feature-scaling/
- https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/
- https://medium.com/atoti/what-is-data-leakage-and-how-to-mitigate-it-5be11f6d2f94

- Chapters 3-4 suggested:
    
   https://michael-fuchs-python.netlify.app/2019/01/01/tag-archive/#data-pre-processing
   
-  Chapter 4 suggested:
   
   Müller, A.C., and Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, Inc. [Available online at https://github.com/amueller/introduction_to_ml_with_python]. 

In [307]:
import session_info
session_info.show()