# Normalization vs Standardization  - Quantitative analysis

## Table of Contests
* [0. Why are we here?](#why_are_we_here)
* [1. Out-of-the-box classifiers](#Out-of-the-box_classifier)
* [2. Classifier + Scaling](#Classifiers_Scaling)
* [3. Classifier + Scaling + PCA](#Classifiers_Scaling_PCA)
* [4. Classifier + Scaling + PCA + Hyperparameter Tuning](#Classifiers_Scaling_PCA_Hyper_param)


<a id="why_are_we_here"></a>
# 0. Why are we here?

First, I was trying to understand what is the difference between Normalization and Standardization.<br>
So, I encountered this excellent <a href="https://sebastianraschka.com/Articles/2014_about_feature_scaling.html">blog</a> that answered my question. <br>
There is also a great explanation about why to scale features for neural networks by Hinton <a href="https://www.youtube.com/watch?v=Xjtu1L7RwVM&list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&index=26">here</a>

So, now I know the difference between Normalization and Standardization. That's it? No! <br>
I saw a lot of ML pipelines tutorials that use StandardScaler (usually called Z-score Standardization) or MinMaxScaler (usually called min-max Normalization) to scale features. However, when I checked Sklearn, I saw that there are lots of different scaling methods. Why no one uses other scaling techniques? Does StandardScaler or MinMaxScaler are the best scaling methods? <br>
I didn't see any explanation in the tutorials about why or when to use each one of them, so I ran some experiments to check all of them as a good DataScientist should do. <b>This is what this notebook is all about</b>


Usually, I prefer more solid mathematical explanations, but I couldn't find one that distinguishes between more than these two scaling techniques, and how each technique affect different well-known classifiers, so I ran a quantitative experiment.<br>
If you can point me to some mathematical reading, please email me at shayzm1@gmail.com <br>

If you find some mistakes or have proposals to improve the coverage or the validity of the experiment, please notify me.


## Project details

Like many DS projects, lets read some data and check several out-of-the-box classifiers.<br>
As a preprocessing step, I already calculated all the results (it takes some time). So we only load the results file and work with it.<br>


The classifiers I used are taken from the Sklearn library and denoted as:<br>
- 'LR', LogisticRegression <br>
- 'LDA', LinearDiscriminantAnalysis <br>
- 'KNN', KNeighborsClassifier<br>
- 'CART', DecisionTreeClassifier<br>
- 'NB', GaussianNB<br>
- 'SVM', SVC<br>
- 'RF', RandomForestClassifier<br>
- 'MLP', MLPClassifier (Multi Layer Perceptron - Neural Network)<br>


The scalers and normalizers I used are also taken from Sklearn library and denoted as:<br>
- 'StandardScaler', StandardScaler<br>
- 'MinMaxScaler', MinMaxScaler<br>
- 'MaxAbsScaler', MaxAbsScaler<br>
- 'RobustScaler', RobustScaler<br>
- 'QuantileTransformer-Normal', QuantileTransformer(output_distribution='normal')<br>
- 'QuantileTransformer-Uniform', QuantileTransformer(output_distribution='uniform')<br>
- 'PowerTransformer-Yeo-Johnson', PowerTransformer(method='yeo-johnson')<br>
- 'Normalizer', Normalizer<br>


The code that produces the results can be found here:<br>
TODO: Add link to github

The dataset: TODO: Add explanation about the dataset

Experiment  details: <br>
- I randomly split the data to train-test sets of 80%-20% respectively. <br>
- Then I used only the train part. I left the test for further results. <br>
- I do not discuss the results on the test set here. Usually, the test set should be kept hidden, and all of our conclusions about our classifiers should be derived only from the cross-validation scores.<br>
- All results are accuracy scores on 10-fold random cross-validation splits from the <b>train set</b>. <br>
 - In part 4, I performed nested cross-validation. One inner cross-validation with 5 random splits for hyperparameter tuning, and another outer CV with 10 random splits to get the model's score using the best parameters. Also in this part, all data taken only from the train set. <br>
- The same seed was used when needed for reproducibility.

### Let's read the results file

In [14]:
import os
import pandas as pd


# results_file = "sonar_results.csv"
results_file = "weatherAUS_results.csv"

results_df = pd.read_csv(os.path.join("..","data","processed",results_file)).dropna().round(3)
results_df

Unnamed: 0,Dataset,Classifier_Name,CV_mean,CV_std,Test_score
0,weatherAUS,_LR,1.0,0.0,1.0
1,weatherAUS,StandardScaler_LR,0.994,0.001,0.994
2,weatherAUS,MinMaxScaler_LR,0.887,0.004,0.892
3,weatherAUS,MaxAbsScaler_LR,0.887,0.005,0.89
4,weatherAUS,RobustScaler_LR,1.0,0.0,1.0
5,weatherAUS,QuantileTransformer-Normal_LR,1.0,0.0,1.0
6,weatherAUS,QuantileTransformer-Uniform_LR,0.994,0.001,0.995
7,weatherAUS,PowerTransformer-Yeo-Johnson_LR,1.0,0.0,1.0
8,weatherAUS,Normalizer_LR,0.827,0.004,0.835
10,weatherAUS,_LR-PCA,0.85,0.005,0.852


<a id="Out-of-the-box_classifier"></a>
# 1. Out-of-the-box classifiers

In [15]:
import operator
results_df.loc[operator.and_(results_df["Classifier_Name"].str.startswith("_"), ~results_df["Classifier_Name"].str.endswith("PCA"))].dropna()

Unnamed: 0,Dataset,Classifier_Name,CV_mean,CV_std,Test_score
0,weatherAUS,_LR,1.0,0.0,1.0
20,weatherAUS,_LDA,0.88,0.003,0.883
40,weatherAUS,_KNN,0.89,0.004,0.893
60,weatherAUS,_CART,1.0,0.0,1.0
80,weatherAUS,_NB,0.95,0.003,0.951
100,weatherAUS,_SVM,0.78,0.004,0.78
120,weatherAUS,_RF,0.97,0.003,0.971
140,weatherAUS,_MLP,0.992,0.01,0.999


Nice results. 
We can see that at the moment, MLF is on the lead.

Now, let's see how different scaling methods change the scores for each classifier

<a id="Classifiers_Scaling"></a>
# 2. Classifiers+Scaling

In [16]:
import operator
import numpy as np


temp = results_df.loc[~results_df["Classifier_Name"].str.endswith("PCA")].dropna()
temp["model"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
    return 'font-weight: 800'
    

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t["CART"].idxmax(),"CART"])
for col in list(pivot_t):
    pivot_t_bold = pivot_t_bold.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold

model,CART,KNN,LDA,LR,MLP,NB,RF,SVM
scaler,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,1,0.89,0.88,1.0,0.992,0.95,0.97,0.78
MaxAbsScaler,1,0.84,0.88,0.887,0.998,0.95,0.97,0.868
MinMaxScaler,1,0.845,0.88,0.887,0.998,0.95,0.97,0.869
Normalizer,1,0.893,0.884,0.827,0.999,0.942,0.948,0.78
PowerTransformer-Yeo-Johnson,1,0.974,0.961,1.0,1.0,0.978,0.97,0.997
QuantileTransformer-Normal,1,0.917,0.884,1.0,0.999,0.891,0.97,0.985
QuantileTransformer-Uniform,1,0.919,0.884,0.994,0.999,0.916,0.97,0.994
RobustScaler,1,0.991,0.88,1.0,1.0,0.95,0.97,0.998
StandardScaler,1,0.89,0.88,0.994,1.0,0.95,0.97,0.978


In [17]:
# Print table for the Medium article

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100000)
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)

dict2 = {'StandardScaler': "StandardScaler",
'MinMaxScaler':"MinMaxScaler",
'MaxAbsScaler':"MaxAbsScaler",
'RobustScaler':"RobustScaler",
'QuantileTransformer-Normal':"QuantileTransformer(output_distribution='normal')",
'QuantileTransformer-Uniform':"QuantileTransformer(output_distribution='uniform')",
'PowerTransformer-Yeo-Johnson':"PowerTransformer(method='yeo-johnson')",
'Normalizer':"Normalizer"}

scalers_df = pd.DataFrame(list(dict2.items()), columns=["Name","Sklearn_Class"])
s = scalers_df.style.set_properties(subset=["Name", "Sklearn_Class"], **{'text-align': 'left'})
s.set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

Unnamed: 0,Name,Sklearn_Class
0,StandardScaler,StandardScaler
1,MinMaxScaler,MinMaxScaler
2,MaxAbsScaler,MaxAbsScaler
3,RobustScaler,RobustScaler
4,QuantileTransformer-Normal,QuantileTransformer(output_distribution='normal')
5,QuantileTransformer-Uniform,QuantileTransformer(output_distribution='uniform')
6,PowerTransformer-Yeo-Johnson,PowerTransformer(method='yeo-johnson')
7,Normalizer,Normalizer


In [18]:
# Print table for the Medium article
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 100000)
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)

dict2 = {'LR': "LogisticRegression",
'LDA':"LinearDiscriminantAnalysis",
'KNN':"KNeighborsClassifier",
'CART':"DecisionTreeClassifier",
'NB':"GaussianNB",
'SVM':"SVC",
'RF':"RandomForestClassifier",
'MLP':"MLPClassifier"}

scalers_df = pd.DataFrame(list(dict2.items()), columns=["Name","Sklearn_Class"])
s = scalers_df.style.set_properties(subset=["Name", "Sklearn_Class"], **{'text-align': 'left'})
s.set_table_styles([ dict(selector='th', props=[('text-align', 'left')] ) ])

Unnamed: 0,Name,Sklearn_Class
0,LR,LogisticRegression
1,LDA,LinearDiscriminantAnalysis
2,KNN,KNeighborsClassifier
3,CART,DecisionTreeClassifier
4,NB,GaussianNB
5,SVM,SVC
6,RF,RandomForestClassifier
7,MLP,MLPClassifier


In [19]:
import operator

cols_max_vals = {}
cols_max_row_names = {}
for col in list(pivot_t):
    row_name = pivot_t[col].idxmax()
    cell_val = pivot_t[col].max()
    cols_max_vals[col] = cell_val
    cols_max_row_names[col] = row_name
    
sorted_cols_max_vals = sorted(cols_max_vals.items(), key=lambda kv: kv[1], reverse=True)

print("Best classifiers sorted:\n")
counter = 1
for model, score in sorted_cols_max_vals:
    print(str(counter) + ". " + model + " + " +cols_max_row_names[model] + " : " +str(score))
    counter +=1

Best classifiers sorted:

1. CART +  : 1.0
2. LR +  : 1.0
3. MLP + PowerTransformer-Yeo-Johnson : 1.0
4. SVM + RobustScaler : 0.998
5. KNN + RobustScaler : 0.991
6. NB + PowerTransformer-Yeo-Johnson : 0.978
7. RF +  : 0.97
8. LDA + PowerTransformer-Yeo-Johnson : 0.961



## Let's analyze the results

1. <b>There is no single scaling method to rule them all.</b>


2. We can see that scaling improved the results. SVM, MLP, KNN, and NB got a significant boost from different scaling methods.


3. Notice that NB, RF, LDA, CART are <b>unaffected</b> by some of the scaling methods. This is, of course, related to how each of the classifiers works. Trees are not affected by scaling because the splitting criterion first orders the values of each feature and then calculate the gini\entropy of the split. Some scaling methods don't affect this order so no change to the accuracy score. <br>
    NB is not affected because the model's priors affected by the count in each class and not by the actual value. LDA fits a Gaussian density to each class, so the scaling doesn't matter either.


4. Some of the scaling methods, like QuantileTransformer-Uniform, doesn't preserve the exact order of the values in each feature, hence the change in score even in the above classifiers that were agnostic to other scaling methods.

<a id="Classifiers_Scaling_pca"></a>
# 3. Classifier+Scaling+PCA

We know that some well-known ML methods like PCA can benefit from scaling (<a href="https://sebastianraschka.com/Articles/2014_about_feature_scaling.html">blog</a>).
Let's try adding PCA(n_components=4) to the pipeline and analyze the results.

In [20]:
import operator
temp = results_df.copy()
temp["model"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
    return 'font-weight: 800'
    

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t["CART"].idxmax(),"CART"])
for col in list(pivot_t):
    pivot_t_bold = pivot_t_bold.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold

model,CART,CART-PCA,KNN,KNN-PCA,LDA,LDA-PCA,LR,LR-PCA,MLP,MLP-PCA,NB,NB-PCA,RF,RF-PCA,SVM,SVM-PCA
scaler,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
,1,0.78,0.89,0.836,0.88,0.85,1.0,0.85,0.992,0.851,0.95,0.844,0.97,0.831,0.78,0.78
MaxAbsScaler,1,0.753,0.84,0.808,0.88,0.819,0.887,0.821,0.998,0.827,0.95,0.814,0.97,0.809,0.868,0.82
MinMaxScaler,1,0.759,0.845,0.814,0.88,0.824,0.887,0.825,0.998,0.83,0.95,0.823,0.97,0.801,0.869,0.824
Normalizer,1,0.78,0.893,0.834,0.884,0.848,0.827,0.821,0.999,0.849,0.942,0.844,0.948,0.839,0.78,0.781
PowerTransformer-Yeo-Johnson,1,0.818,0.974,0.858,0.961,0.869,1.0,0.872,1.0,0.873,0.978,0.864,0.97,0.861,0.997,0.873
QuantileTransformer-Normal,1,0.943,0.917,0.954,0.884,0.872,1.0,0.954,0.999,0.965,0.891,0.893,0.97,0.907,0.985,0.965
QuantileTransformer-Uniform,1,0.844,0.919,0.878,0.884,0.885,0.994,0.885,0.999,0.888,0.916,0.88,0.97,0.859,0.994,0.886
RobustScaler,1,0.997,0.991,0.995,0.88,0.86,1.0,1.0,1.0,1.0,0.95,0.871,0.97,0.947,0.998,0.998
StandardScaler,1,0.792,0.89,0.839,0.88,0.853,0.994,0.853,1.0,0.855,0.95,0.834,0.97,0.846,0.978,0.854


## Let's analyze the results

1. We can see that PCA only improve LDA and RF, so PCA is not a magic solution.<br> 
    It's fine. We didn't hypertune the n_components parameter, and even if we did, PCA doesn't guarantee to improve predictions. 
    

2. Most of the time scaling methods improve models with PCA, <b> but </b> no specific scaling method is in charge. <br>
    Let's look at "QuantileTransformer-Uniform", the method with most of the high scores. <br>
    In LDA-PCA it improved the results from 0.704 to 0.783 (8% jump in accuracy!), but in RF-PCA it makes things worse, from 0.711 to 0.668 (4.35% drop in accuracy!) <br>
    On the other hand, using RF-PCA with "QuantileTransformer-Normal", improved the accuracy to 0.766 (5% jump in accuracy!) <br>	
3. We can see that StandardScaler and MinMaxScaler achieve best scores only on 4 out of 16 cases. So one should think carefully what scaling method to choose, even as a default one.<br> 

<b>We can conclude that even though PCA is a known component that benefits from scaling, no single scaling method always improved our results, and some of them even cause harm</b>

<a id="Classifiers_Scaling_pca_hyper_param"></a>

# Classifiers+Scaling+PCA+Hyperparameter tuning

There were big differences in the accuracy score between different scaling methods for a given classifier. One can assume that when the hyperparameters are tuned, the difference between the scaling techniques will be minor and we can use StandardScaler or MinMaxScaler as used in many classification pipelines tutorials in the web. <br>
Let's check that!

TODO

In [13]:
import operator

import os
import pandas as pd

results_hyper_file = "sonar_results_hypertune.csv"
results_hyper_df = pd.read_csv(os.path.join("..","data","processed",results_hyper_file)).dropna().round(3)


temp = results_hyper_df.copy()
temp["model"] = results_hyper_df["Classifier_Name"].apply(lambda sen: sen.split("_")[1])
temp["scaler"] = results_hyper_df["Classifier_Name"].apply(lambda sen: sen.split("_")[0])

def df_style(val):
    return 'font-weight: 800'
    

pivot_t = pd.pivot_table(temp, values='CV_mean', index=["scaler"], columns=['model'], aggfunc=np.sum)
pivot_t_bold = pivot_t.style.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t["KNN"].idxmax(),"KNN"])
for col in list(pivot_t):
    pivot_t_bold = pivot_t_bold.applymap(df_style,
                      subset=pd.IndexSlice[pivot_t[col].idxmax(),col])
pivot_t_bold

model,CART,CART-PCA,KNN,KNN-PCA,LDA,LDA-PCA,LR,LR-PCA,MLP,MLP-PCA,RF,RF-PCA,SVM,SVM-PCA
scaler,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,0.734,0.704,0.85,0.771,0.77,0.676,0.76,0.689,0.734,0.651,0.771,0.699,0.789,0.67
MaxAbsScaler,0.734,0.723,0.843,0.741,0.765,0.76,0.736,0.766,0.728,0.754,0.776,0.758,0.759,0.743
MinMaxScaler,0.734,0.657,0.837,0.753,0.782,0.754,0.711,0.766,0.746,0.742,0.776,0.722,0.735,0.737
PowerTransformer-Yeo-Johnson,0.65,0.74,0.874,0.777,0.778,0.771,0.789,0.777,0.807,0.681,0.776,0.772,0.873,0.728
QuantileTransformer-Normal,0.64,0.687,0.806,0.731,0.771,0.778,0.795,0.766,0.694,0.735,0.795,0.773,0.837,0.735
QuantileTransformer-Uniform,0.651,0.717,0.891,0.771,0.783,0.765,0.814,0.778,0.758,0.741,0.746,0.741,0.814,0.753
RobustScaler,0.734,0.686,0.838,0.789,0.776,0.73,0.758,0.724,0.783,0.71,0.776,0.711,0.771,0.759
StandardScaler,0.734,0.736,0.825,0.776,0.783,0.753,0.741,0.748,0.777,0.742,0.776,0.722,0.861,0.718
