# Normalize

Scale input vectors individually to unit norm. Scaling inputs to unit norms is a common operation for text classification or clustering.

https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn import preprocessing
from sklearn.preprocessing import Normalizer

In [2]:
data = [169.6,166.8,157.1,181.1,158.4,165.6,166.7,156.5,168.1,165.3]
df = pd.DataFrame(data, columns=['TB'])
df

Unnamed: 0,TB
0,169.6
1,166.8
2,157.1
3,181.1
4,158.4
5,165.6
6,166.7
7,156.5
8,168.1
9,165.3


- Normalization:
    - Max norm $\displaystyle x' = \frac {x} {x_\textrm{max}}$
    - L1 norm $\displaystyle x' = \frac {x} {\sum x}$
    - L2 norm $\displaystyle x' = \frac {x} {\sqrt {\sum x^2}}$

In [26]:
# 0. Manual normalization calculation
# Norm: MAX = x_normalized = x / max(x)

df['TB'] / df['TB'].max()

0    0.936499
1    0.921038
2    0.867477
3    1.000000
4    0.874655
5    0.914412
6    0.920486
7    0.864163
8    0.928216
9    0.912755
Name: TB, dtype: float64

In [27]:
# Norm: L1 = x_normalized = x / sum(x)

df['TB'] / df['TB'].sum()

0    0.102465
1    0.100773
2    0.094913
3    0.109413
4    0.095698
5    0.100048
6    0.100713
7    0.094551
8    0.101559
9    0.099867
Name: TB, dtype: float64

In [28]:
# Norm: L2 = x_normalized = x / sqrt(sum(x ** 2))

df['TB'] / np.sqrt(np.sum(df['TB'] ** 2))

0    0.323744
1    0.318399
2    0.299883
3    0.345696
4    0.302365
5    0.316108
6    0.318208
7    0.298738
8    0.320881
9    0.315536
Name: TB, dtype: float64

In [19]:
# 1. Normalizer(): normalize works on row

print(Normalizer(norm='max').fit_transform([df['TB'].values]))
print(Normalizer(norm='l1').fit_transform([df['TB'].values]))
print(Normalizer(norm='l2').fit_transform([df['TB'].values]))

[[0.93649917 0.9210381  0.86747653 1.         0.87465489 0.91441193
  0.92048592 0.86416345 0.92821645 0.91275538]]
[[0.10246496 0.10077332 0.094913   0.10941276 0.09569841 0.10004833
  0.1007129  0.09455051 0.10155872 0.09986709]]
[[0.32374385 0.31839902 0.29988301 0.34569582 0.30236454 0.31610838
  0.31820813 0.29873769 0.32088055 0.31553572]]


In [21]:
# 2. normalize(): normalize works on row

print(preprocessing.normalize(df[['TB']], axis=0, norm='max'))
print(preprocessing.normalize(df[['TB']], axis=0, norm='l1'))
print(preprocessing.normalize(df[['TB']], axis=0, norm='l2'))

[[0.93649917]
 [0.9210381 ]
 [0.86747653]
 [1.        ]
 [0.87465489]
 [0.91441193]
 [0.92048592]
 [0.86416345]
 [0.92821645]
 [0.91275538]]
[[0.10246496]
 [0.10077332]
 [0.094913  ]
 [0.10941276]
 [0.09569841]
 [0.10004833]
 [0.1007129 ]
 [0.09455051]
 [0.10155872]
 [0.09986709]]
[[0.32374385]
 [0.31839902]
 [0.29988301]
 [0.34569582]
 [0.30236454]
 [0.31610838]
 [0.31820813]
 [0.29873769]
 [0.32088055]
 [0.31553572]]
