# <span style="color:#54B1FF">Preprocessing:</span> &nbsp; <span style="color:#1B3EA9"><b>Normalization</b></span>

<br>

**Normalization** is the process of scaling vectors so that they have a magnitude of one.  This is often done in a variety of mathematics problems. For example: in what direction does vector $\boldsymbol{r}$ point? The answer to this question is a unit vector;  **normalization** computes unit vectors.

This unit vector transformation is useful when you are more interested in **direction** than in **magnitude**. For example, if you are developing a machine learning algorithm whose goal is to predict the direction of travel of a walking person (based on a video stream, for example), the calculation result (i.e., the target value) will be a unit vector. The input data (i.e., the features) might also include unit vectors, for example: the direction of travel in previous frames, each of which is a unit vector.

Unit feature vectors can easily be calculated in **sklearn** using the `preprocessing.normalize` function.

Let's first import the packages we'll need for this notebook. We'll then consider some normalization examples.

<br>

In [1]:

import numpy as np
from matplotlib import pyplot as plt
from sklearn import preprocessing


<br>
<br>

___

## Normalization basics

The `preprocessing.normalize` function will transform a set of feature vectors into unit feature vectors:

<br>

In [2]:

np.random.seed(0)

x    = [50, 500, 1000] + 100 * np.random.rand(8, 3)
xn   = preprocessing.normalize(x)

print( x )
print()
print( xn )


[[ 104.88135039  571.51893664 1060.27633761]
 [ 104.4883183   542.36547993 1064.58941131]
 [  93.75872113  589.17730008 1096.36627605]
 [  88.34415188  579.17250381 1052.88949198]
 [ 106.80445611  592.55966383 1007.10360582]
 [  58.71292997  502.02183974 1083.26198455]
 [ 127.81567509  587.00121482 1097.86183422]
 [ 129.91585642  546.14793623 1078.05291763]]

[[0.08674637 0.47269792 0.87694455]
 [0.08712114 0.45221802 0.88764225]
 [0.07511668 0.47203119 0.8783758 ]
 [0.07331978 0.48067472 0.87382837]
 [0.09102386 0.50500764 0.85830178]
 [0.0491166  0.41996895 0.90620839]
 [0.10213143 0.46904477 0.87724921]
 [0.10688577 0.44933272 0.88694732]]


<br>
<br>

The feature vectors' magniutdes can be checked using the `np.linalg.norm` function like this:

<br>
<br>


In [3]:

print( np.linalg.norm(xn, axis=1) )


[1. 1. 1. 1. 1. 1. 1. 1.]


<br>
<br>

Alternatively, you can manually calculate the vector magnitude ( $|x| = \sqrt{x_0^2 + x_1^2 + x_2^2}$ ) like this:

<br>
<br>

In [4]:

m = np.sqrt(   xn[:,0]**2 + xn[:,1]**2 + xn[:,2]**2  )

print( m )


[1. 1. 1. 1. 1. 1. 1. 1.]


<br>
<br>

___

## Applying normalization to training and test sets

⚠️ Unlike scaling, for which scaling parameters are calculated for the **training set**, and can optionally be applied to the **test set**, normalization is conducted on **individual feature vectors**, so normalization results do not depend on training.

For example, we might be tempted to train a `Normalizer` on a training set like this:

<br>

In [5]:

np.random.seed(0)

x_train    = np.random.rand(10, 2)
x_test     = 10 * np.random.rand(3, 2)



normalizer = preprocessing.Normalizer()
normalizer.fit(x_train)   # this line actually does not fit data;  see below

x_train_n  = normalizer.transform( x_train )
x_test_n   = normalizer.transform( x_test )

print( np.linalg.norm(x_test_n, axis=1) )


[1. 1. 1.]


<br>
<br>

However, note that the `Normalizer` actually does not fit data. This can be seen when the `fit` method is bypassed:

<br>
<br>

In [6]:

np.random.seed(0)

normalizer = preprocessing.Normalizer()
x_train_n  = normalizer.transform( x_train )
x_test_n   = normalizer.transform( x_test )

print( np.linalg.norm(x_test_n, axis=1) )


[1. 1. 1.]


<br>
<br>

For many processing routines, bypassing the `fit` method like this will result in an error.

<br>

Why does `Normalizer` have a `fit` method if `fit` does nothing?

<br>

The answer to this question is somewhat complex, and requires basic knowledge of data processing [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). A pipeline consists of a collection of data processing routines that are usually conducted in a sequential order. Since most processing routines use the `fit` method (and will not work until `fit` is called), all processing routines in **sklearn** implement a `fit` method, regardless of whether `fit` actually does anything. This makes it easy to substitute and/or sequentially apply processing routines. For example:

<br>
<br>

In [7]:

np.random.seed(0)

x_train    = np.random.rand(10, 2)
x_test     = 10 * np.random.rand(3, 2)

routine0   = preprocessing.StandardScaler()
routine1   = preprocessing.Normalizer()

# apply routines sequentially:
for r in [routine0, routine1]:
    r.fit( x_train )  # if Normalizer did not have a "fit" method, this command would raise an error
    x_test = r.transform( x_test )
    print(x_test)
    print()


[[32.46150544 29.19906584]
 [14.34386741 28.45888677]
 [ 2.31989891 22.87227202]]

[[0.74347962 0.66875859]
 [0.45008362 0.89298641]
 [0.1009107  0.99489549]]



<br>

If the `Normalizer` routine did **not** have a `fit` method, this type of sequential processing would be difficult.

Refer to [sklearn's Pipeline documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) for more details regarding sequential data processing.