In [2]:
from sklearn import preprocessing
from sklearn import impute
import numpy as np

## Data scaling using StandardScaler

In general, two step process for each matrix column:
1. **fit** step: Compute the mean $\mu$ and standard deviation ($\sigma$)
2. **transform** step: Scale using mean and standard deviation

\begin{equation}
X_{scaled} = (X - \mu) / \sigma
\end{equation}
<br/>

**Note**: Can perform fit and transform in one step (`fit_transform`). Example is provided below.

#### Documentation:
- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler">StandardScaler</a>

In [11]:
# data matrix
X = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.], [ 0.,  1., -1.]])
print( X )
print( X.shape )

# create a standard scaler model
model = preprocessing.StandardScaler()

# fit the model
model.fit( X )

# print the model attributes
print( "mean = {}".format( model.mean_ ) )
print( "variance = {}".format( model.var_ ) )

# apply the model to unseen data
Z = np.random.rand( 3, 3 )
print( Z )
print( Z.shape ) 

# transform data using learned model attributes
Z_tx = model.transform( Z )
print( Z_tx )

[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]
(3, 3)
mean = [1.         0.         0.33333333]
variance = [0.66666667 0.66666667 1.55555556]
[[0.558812   0.16923081 0.81280186]
 [0.92962664 0.06064424 0.52665387]
 [0.01464053 0.88218252 0.90599068]]
(3, 3)
[[-0.54034274  0.20726456  0.38443006]
 [-0.08618941  0.07427372  0.15500126]
 [-1.20681396  1.08044852  0.45914734]]


## Data scaling using MinMaxScaler

In general, two step process for each matrix column:
1. Subtract minimum then scale using maximum and minimum difference. This will scale the value in $[0~1]$ 

\begin{equation}
\acute{X} = (X - X_{min}) / (X_{max} - X_{min})
\end{equation}
<br/>
2. Scale to a new minumum and maximum (if other than $[0~1]$). This will scale the value in $[min~max]$.

\begin{equation}
X_{scaled} = \acute{X} * (max - min) + min
\end{equation}

#### Documentation:
- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler">MinMaxScaler</a>

In [4]:
# data matrix
X = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.], [ 0.,  1., -1.], [ 1.,  0., -1.]])
print( X )
print( X.shape )

# create a MinMax model
model = preprocessing.MinMaxScaler()
# fit the model
X_tx = model.fit_transform( X )

# print the model attributes
print( "max = {}".format( model.data_max_ ) )
print( "min = {}".format( model.data_min_ ) )
print( X_tx )

# apply the model to unseen data
Z = np.random.rand( 3, 3 )
print( Z.shape )
print( Z )

# transform data using learned model attributes
Z_tx = model.transform( Z )
print( Z_tx )

[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]
 [ 1.  0. -1.]]
(4, 3)
max = [2. 1. 2.]
min = [ 0. -1. -1.]
[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]
 [0.5        0.5        0.        ]]
(3, 3)
[[0.58558157 0.98236996 0.45718929]
 [0.91312806 0.2077435  0.05437107]
 [0.20781968 0.05968542 0.47237055]]
[[0.29279079 0.99118498 0.48572976]
 [0.45656403 0.60387175 0.35145702]
 [0.10390984 0.52984271 0.49079018]]


## Data scaling using L1 and L2 norms

In general, the `p-norm` is $||x||_p$ where $x$ is a vector with $i$ components. 

Specifically, the `p-norm` is defined as:

\begin{equation}
||x||_p=(\sum_i|x_i|^p)^{1/p}
\end{equation}
 
The simplest norm conceptually is `L2` (Euclidean) distance.

\begin{equation}
||x||_2=\sqrt{\sum_i|x_i|^2}
\end{equation}
 
Another common norm is `L1` (taxicab) distance.

\begin{equation}
||x||_1=\sum_i|x_i|
\end{equation}

#### Documentation:
- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer">Normalizer</a>

In [7]:
# our data matrix example
X = np.array([[ 1., -1.,  2.], [ 2.,  0.,  0.], [ 0.,  1., -1.]])
print( X )
print( X.shape )

# create a normalizer model
model = preprocessing.Normalizer(norm="l1")
# fit the model
X_tx = model.fit_transform( X )

# notice row operation instead of column!!
print( X_tx )

[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]
(3, 3)
[[ 0.25 -0.25  0.5 ]
 [ 1.    0.    0.  ]
 [ 0.    0.5  -0.5 ]]


## Data imputation using SimpleImputer

For a column in the matrix, replace missing values using a descriptive statistic such as: `mean`, `median`, `most frequent`, or `constant` value.

#### Documentation:
- <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">SimpleImputer</a>

In [5]:
# our data matrix example
X = np.array([[ 1., np.nan,  2.], [ 2.,  0.,  0.], [ 0.,  1., np.nan]])
print( X )
print( X.shape )

# create a simple imputer model
model = impute.SimpleImputer(missing_values=np.nan, strategy='mean')
# fit the model
X_tx = model.fit_transform( X )

# replace nan with mean
print( X_tx )

[[ 1. nan  2.]
 [ 2.  0.  0.]
 [ 0.  1. nan]]
(3, 3)
[[1.  0.5 2. ]
 [2.  0.  0. ]
 [0.  1.  1. ]]
