In [1]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)

#### Load sample data of video views

In [19]:
views = pd.DataFrame([1295., 25., 19000., 5., 1., 300.], columns=['views'])
views

Unnamed: 0,views
0,1295.0
1,25.0
2,19000.0
3,5.0
4,1.0
5,300.0


#### Standard Scaler 
The standard scaler tries to standardize each value in a feature column by removing the mean and scaling
the variance to be 1 from the values. This is also known as centering and scaling and can be denoted
mathematically as $SS(X_i) = {X_i-\mu_x \over \sigma_x}$

In [20]:
ss = StandardScaler()
views['zscore'] = ss.fit_transform(views)
views

(6, 1)
(6,)
(6, 1)


Unnamed: 0,views,zscore
0,1295.0,-0.307214
1,25.0,-0.489306
2,19000.0,2.231317
3,5.0,-0.492173
4,1.0,-0.492747
5,300.0,-0.449877


We can see the standardized and scaled values in the zscore column in the preceding dataframe. In
fact, you can manually use the formula we used earlier to compute the same result. The following example
computes the z-score mathematically

In [35]:
vw = views[['views']]
vv = (vw - np.mean(vw))/np.std(vw)
vv

Unnamed: 0,views
0,-0.307214
1,-0.489306
2,2.231317
3,-0.492173
4,-0.492747
5,-0.449877


#### Min-Max Scaling
With min-max scaling, we can transform and scale our feature values such that each value is within the
range of [0, 1]. However the MinMaxScaler class in scikit-learn also allows you to specify your own upper
and lower bound in the scaled value range using the feature_range variable. Mathematically we can
represent this scaler as
$$MMX_i = {X_i-min(X) \over max(X)-min(X)}$$

In [29]:
mm = MinMaxScaler()
views['mmscore'] = mm.fit_transform(views[['views']])
views

Unnamed: 0,views,zscore,mmscore
0,1295.0,-0.307214,0.068109
1,25.0,-0.489306,0.001263
2,19000.0,2.231317,1.0
3,5.0,-0.492173,0.000211
4,1.0,-0.492747,0.0
5,300.0,-0.449877,0.015738


The preceding output shows the min-max scaled values in the minmax column and as expected, the
maximum viewed video in row index 2 has a value of 1, and the minimum viewed video in row index 4 has a
value of 0. You can also compute this mathematically using the following code

In [36]:
(vw - np.min(vw))/(np.max(vw) - np.min(vw))

Unnamed: 0,views
0,0.068109
1,0.001263
2,1.0
3,0.000211
4,0.0
5,0.015738


#### Robust Scaler
The disadvantage of min-max scaling is that often the presence of outliers affects the scaled values for any
feature. Robust scaling tries to use specific statistical measures to scale features without being affected by
outliers. Mathematically this scaler can be represented as

$$RS(X_i)={X_i-median(X)\over IQR_{(1,3)}(X)}$$

where we scale each value of feature X by subtracting the median of X and dividing the resultant by the IQR
also known as the Inter-Quartile Range of X which is the range (difference) between the first quartile (25th
%ile) and the third quartile (75th %ile). The following code performs robust scaling on our sample feature.

In [37]:
rs = RobustScaler()
views['rs'] = rs.fit_transform(views[['views']])
views

Unnamed: 0,views,zscore,mmscore,rs
0,1295.0,-0.307214,0.068109,1.092883
1,25.0,-0.489306,0.001263,-0.13269
2,19000.0,2.231317,1.0,18.178528
3,5.0,-0.492173,0.000211,-0.15199
4,1.0,-0.492747,0.0,-0.15585
5,300.0,-0.449877,0.015738,0.13269


You can also compute the same using the mathematical equation we
formulated for the robust scaler as depicted in the following snippet

In [43]:
quartiles = np.percentile(vw,(25.,75.))
iqr = quartiles[1] - quartiles[0]
(vw - np.median(vw))/iqr

Unnamed: 0,views
0,1.092883
1,-0.13269
2,18.178528
3,-0.15199
4,-0.15585
5,0.13269
