In [144]:
from sklearn.preprocessing import StandardScaler, scale
import numpy as np
import pandas as pd
array1 = np.random.normal(loc = 50, scale=5, size=100)
array2 = np.random.normal(loc = 60, scale=2, size=100)
array3 = np.vstack((array1, array2)).T
a3_df = pd.DataFrame(array3)
StdScaler = StandardScaler()


# General Reasons for Scaling

1) Are required by some machine learning algorithms
2) "Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators."


# Algorithms that Do Not Require Scaling

1) Linear regression (without gradient descent)
2) Logistic regression (without gradient descent)
3) Decision Trees and Random Forests
4) Naive Bayes (?)
5) Linear Discriminant Analysis(LDA) (?)


# General Information About Scalers

1) "Scalers are linear (or more precisely affine) transformers"
2) Can intake arrays, series and dataframes, but return arrays
3) Have the same set of methods, but different parameters (i.e., hyperparameters) and attributes (parameters).

# Standard Scaler

In [148]:
a1_scale = scale(array1)
a2_scale = scale(array2)
a3_scale = scale(array3)
a3_df_scale = scale(a3_df)

In [149]:
a1_scale[0:5]

array([-0.89588916,  1.26027338,  2.13964297,  0.7563316 , -1.51852875])

In [152]:
a3_df_scale[0:5]

array([[-0.89588916, -0.4626879 ],
       [ 1.26027338, -1.18378559],
       [ 2.13964297,  1.00225161],
       [ 0.7563316 , -0.88672299],
       [-1.51852875, -0.34785406]])

In [63]:
a3_scale_df = pd.DataFrame(a3_scale)
print((a3_scale_df[0]==a1_scale).sum(), (a3_df[1]==a2_scale).sum())

100 100


In [64]:
np.mean(a3_scale_df)

0    1.334488e-15
1   -1.273426e-15
dtype: float64

In [65]:
a3_scale_df.describe()

Unnamed: 0,0,1
count,100.0,100.0
mean,1.298961e-15,-1.285638e-15
std,1.005038,1.005038
min,-2.463903,-3.14701
25%,-0.7011991,-0.660654
50%,-0.07981316,0.1445285
75%,0.5449091,0.7059845
max,2.979439,2.10299


In [62]:
np.mean(a3_df[0])

1.298960938811433e-15

In [141]:
a3_Std = StdScaler.fit_transform(array3)
a3_Std
# a3_Std_df = pd.DataFrame(a3_Std)
# print(np.mean(a3_Std_df[0]), np.mean(a3_Std_df[1]))

array([[-2.68170191e-01,  1.37658231e+00],
       [-1.27488934e+00, -6.95114201e-01],
       [ 3.91416929e-01, -8.57022221e-01],
       [ 7.83989317e-01,  1.60713470e+00],
       [-6.89619362e-01, -3.54620647e-01],
       [ 8.60610042e-01,  1.93253029e+00],
       [-4.87320045e-01, -5.58929062e-01],
       [ 2.18364114e+00,  1.64232537e-01],
       [ 2.35583592e-01,  1.05566701e+00],
       [ 1.40533843e-01,  4.14650467e-01],
       [ 8.42404770e-01, -1.22687306e+00],
       [ 1.67961608e+00, -5.77775940e-01],
       [-7.79162968e-01,  1.86128589e+00],
       [ 2.97943922e+00,  1.30590575e+00],
       [-1.24710978e-01,  8.38522445e-01],
       [ 4.56170863e-01,  1.06683687e+00],
       [ 3.25440553e-01, -1.45881187e+00],
       [ 1.36156467e+00,  2.10299048e+00],
       [-8.87182608e-01,  2.54005505e-01],
       [-1.57016615e-01, -1.90947256e+00],
       [ 5.79936361e-01,  2.92045473e-01],
       [-7.78011194e-01, -1.69757731e+00],
       [-1.50633170e+00,  7.87753486e-01],
       [-1.

In [143]:
a1_StdScale = StdScaler.fit_transform(array1.reshape(-1,1))


In [82]:
a2_StdScale = StdScaler.fit_transform(array2.reshape(-1,1))
np.mean(a2_StdScale)

-1.2856382625159313e-15

Scale will work on a 1D-array, StandardScaler requires a 2D-array or requires that a single feature to be reshaped as array.reshape(-1, 1). With StandardScaler it is necessary to fit and transform.  

The outputs of the two methods are different. Calling scale gives a transformed array. Calling StandardScaler gives a ___ object. Fit and transform are needed to get the data. 

The StandardScaler can apply a transformation fit on one dataset to a second dataset. 

In both, "NaNs are treated as missing values: disregarded in fit, and maintained in transform."

In both, "We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0)."

*Compare parameters and attributes of each method

StandardScaler has methods (fit, fit_transform, get_params, inverse_transform, partial_fit, set_params, and transform)

Scale is a function. StandardScaler is an utility class that creates an object.

Both can be applied to multi-dimensional arrays and pandas dataframes containing multiple features.

np.mean and .mean(), when called on an entire dataframe give inaccurate results for very small floats. However, calling on individual columns gives the correct result. 

Use Cases: 
1) Scale and StandardScaler are used with the RBF kernel in SVMs and L1 and L2 regularizers of linear models (i.e., Lasso and Ridge regression).
2) Metric-based and gradient-based estimators
3) PCA
4) Perceptrons
5) Neural Networks
6) Algorithms that use Euclidian distance (e.g., KNN, k-means, HAC)
7) Logistic regression using gradient descent



"This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data."

Very sensitive to outliers

Uses a biased estimator of the standard deviation (i.e., sets degrees of freedom to 0). 

# MaxAbsScaler and maxabs_scale

Used with very small standard deviations of features and to preserve zero entries in sparse data.

Used with data centered at zero or sparse data

maxabs_scale is a function, while MaxAbsScaler is a class that creates an object. 

scales between \[-1, 1\]

divides the values by the maximum absolute value in each feature

Can be applied to sparse CSR (compressed sparse rows) or CSC (compressed sparse columns) matrices

"Suffers from the presence of large outliers"

In both, "NaNs are treated as missing values: disregarded in fit, and maintained in transform." 

Use Cases:
1) The recommended way to scale sparse data.


# MinMaxScaler and minmax_scale

Scales features to a given range.

Scaled values often fall between \[0, 1\], though different min and max values can be specified.

Robust to very small std deviations and preserve zeros in sparse data.

Reduces the standard deviation

Very sensitive to outliers.

Used as an alternative to zero mean, unit variance scaling.

Transformation given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where min and max are hyperparameters

Transformation calculated by:
X_scaled = scale * X + min - X.min(axis=0) * scale
where scale = (max - min) / (X.max(axis=0) - X.min(axis=0))


For MinMaxScaler, "NaNs are treated as missing values: disregarded in fit, and maintained in transform."

Use Cases:
1) Data with hard boundaries such as colors in image data (https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e) 
2) Some neural networks

In [123]:
from sklearn.preprocessing import MinMaxScaler, minmax_scale

In [126]:
a1_MM_01 = MinMaxScaler().fit(array1.reshape(-1,1))
1/(array1.max() - array1.min()), a1_MM_01.scale_

(0.035821816042655816, array([0.03582182]))

In [127]:
a1_MM_02 = MinMaxScaler(feature_range=(0,2)).fit(array1.reshape(-1,1))
1/(array1.max() - array1.min()), a1_MM_02.scale_

(0.035821816042655816, array([0.07164363]))

In [132]:
a1_MM_02.min_, a1_MM_02.data_min_, a1_MM_02.data_max_, a1_MM_02.data_range_

(array([-2.67046568]),
 array([37.2742922]),
 array([65.19024149]),
 array([27.91594929]))

# RobustScaler and robust_scale

Robust to outliers

Centers the data using the median instead of the mean.

Scales (divides by) the data using a quantile range (the default is IQR). 

Use in machine learning tasks where the StandardScaler is typically used, but the data contains outliers.

Outliers remain in the transformed data

Cannot be used with sparse data

In [89]:
from sklearn.preprocessing import RobustScaler, robust_scale

In [154]:
a1_RS = RobustScaler(with_centering=True).fit(array1.reshape(-1,1))
print(type(a1_RS))
print(a1_RS.center_, a1_RS.scale_)
print(a1_RS.get_params())
#a1_RS.transform(array1.reshape(-1,1))

a1_RS.set_params(quantile_range= (80.0, 90.0))
print(a1_RS.get_params())
# a1_RS = a1_RS.fit_transform(array1.reshape(-1,1))
# print(type(a1_RS))
#a1_RS.fit(array1.reshape(-1,1))

<class 'sklearn.preprocessing.data.RobustScaler'>
[50.46710884] [6.86258537]
{'copy': True, 'quantile_range': (25.0, 75.0), 'with_centering': True, 'with_scaling': True}
{'copy': True, 'quantile_range': (80.0, 90.0), 'with_centering': True, 'with_scaling': True}


In [159]:
a1_RS.get_params()

{'copy': True,
 'quantile_range': (80.0, 90.0),
 'with_centering': True,
 'with_scaling': True}

# Methods

fit: fits the scaler (e.g., calculates the center and scale to be used) and stores the parameters to be used to transform the data.  Returns an object. 

fit_transform: Combines the fit and transform methods. Returns an array of transformed values.

get_params: call with or without (). The call with () returns a dictionary of the parameters (hyperparameters) of the function.

inverse_transform: reverses the transformation 

partial_fit: Online computation of data parameters. "All of X is processed as a single batch." Use with very large samples or continuously streaming data. 

set_params: Sets the parameters (hyperparameters) of the function. If the parameters are change after fitting the scaler, you will need to fit the scaler again to apply the changes and then transform the data. 

transform: Using the parameters from the fit method above, transform transforms the data. Returns an array of transformed values.

# ADD

Mean Normalization
Normalization
Unit Vector Scaling
