### Standarization

Standardization or Z-Score Normalization is the transformation of features by subtracting from mean and dividing by standard deviation. This is often called as Z-score. Standardization can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Geometrically speaking, it translates the data to the mean vector of original data to the origin and squishes or expands the points if std is 1 respectively. We can see that we are just changing mean and standard deviation to a standard normal distribution which is still normal thus the shape of the distribution is not affected.
Standardization does not get affected by outliers because there is no predefined range of transformed features.

Formula for standarization:

$ x_{std} = \frac{x - \mu}{\sigma} $

In [1]:
import numpy as np
from sklearn import preprocessing


def standarizer(X):

    scaler = preprocessing.StandardScaler().fit(X)
    scaled = scaler.transform(X)
    mean = scaler.mean_ # Gets the mean
    varia = scaler.var_ # Gets the variance

    return scaled, mean, varia



scaled, mean, varia = standarizer(np.array([[100000, 150000, 350000, 200000], [1, 2, 3, 4]]).T)
print(mean)
print(varia)

[2.0e+05 2.5e+00]
[8.75e+09 1.25e+00]


In [2]:
print(scaled)
print(np.mean(scaled))
print(np.std(scaled))


[[-1.06904497 -1.34164079]
 [-0.53452248 -0.4472136 ]
 [ 1.60356745  0.4472136 ]
 [ 0.          1.34164079]]
0.0
1.0


In standarization, the mean will be zero and the standard deviation 1

When working with sparse matrices, matrices with more than 50% of its content is zero, it's important to make some adjustments.

In [3]:
sparse_mat = np.array([[1, 0, 0],
                       [0, 5, 6],
                       [0, 0, 7]])

def counting_zeros(mat):
    tot = mat.shape[0]*mat.shape[1]
    n = 0

    for i in range(mat.shape[0]):
        for j in range(mat.shape[1]):
            val = mat[i, j]

            if val == 0:
                n += 1
            else:
                continue
        

    return (n/tot)*100

counting_zeros(sparse_mat)


55.55555555555556

In [4]:
s, a, b = standarizer(sparse_mat)
s # Wrong

array([[ 1.41421356, -0.70710678, -1.40182605],
       [-0.70710678,  1.41421356,  0.53916387],
       [-0.70710678, -0.70710678,  0.86266219]])

In [5]:
preprocessing.StandardScaler(with_mean=False).fit_transform(sparse_mat)

array([[2.12132034, 0.        , 0.        ],
       [0.        , 2.12132034, 1.94098992],
       [0.        , 0.        , 2.26448824]])