#**Normalization**

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the L1, L2, or max norms.

Normalization makes the features more consistent with each other, which allows the model to predict outputs more accurately.


In [7]:
import numpy as np
from sklearn import preprocessing

In [8]:
data = [[ 1., -1.,  2.],  [ 2.,  0.,  0.],  [ 0.,  1., -1.]]
data

[[1.0, -1.0, 2.0], [2.0, 0.0, 0.0], [0.0, 1.0, -1.0]]

**1) L1 Normalization**

The L1 norm that is calculated as the sum of the absolute values of the vector.


In [9]:
#L1 
normalized = preprocessing.normalize(data, norm='l1')
normalized

array([[ 0.25, -0.25,  0.5 ],
       [ 1.  ,  0.  ,  0.  ],
       [ 0.  ,  0.5 , -0.5 ]])

**2) L2 Normalization**

The L2 norm that is calculated as the square root of the sum of the squared vector values.

In [10]:
#L2 
normalized = preprocessing.normalize(data, norm='l2')
normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

**3) Max Normalization**

The max norm that is calculated as the maximum vector values.

In [11]:
#max
normalized = preprocessing.normalize(data, norm='max')
normalized

array([[ 0.5, -0.5,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -1. ]])

#**Rescaling Data**

When our data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.
This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent.
It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.
We can rescale our data using scikit-learn using the MinMaxScaler class.


In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
print(scaler.fit(data))

MinMaxScaler()


In [5]:
print(scaler.data_max_)

[2. 1. 2.]


In [6]:
print(scaler.transform(data))

[[0.5        0.         1.        ]
 [1.         0.5        0.33333333]
 [0.         1.         0.        ]]


#**Standardizing Data**

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1. We can standardize data using scikit-learn with the StandardScaler class.

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
print(scaler.fit(data))

StandardScaler()


In [13]:
print(scaler.mean_)

[1.         0.         0.33333333]


In [14]:
print(scaler.transform(data))

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
