# Standardization and Normalization
---

## Standardizatin vs. Normalization

**`Standardization`** scales **features** while **`Normalization`** scales **samples**.

* **`Standardization`** - removes feature means and then scale it by dividing its standard deviation. All scaled features will have **mean 0** and standard **deviation 1**.
* **`Normalization`** - **scales individual samples** so that all samples have **unit norm**.

However, all the functions have an **`axis`** parameter which controls the scalling axis, by features or by samples.

## Import data

In [2]:
from sklearn.datasets import load_iris

In [3]:
X,y = load_iris(return_X_y=True)

In [5]:
X[:5]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

## Standardization

In [6]:
from sklearn.preprocessing import scale

In [9]:
X_scaled = scale(X, axis=0)

In [10]:
X_scaled[:5]

array([[-0.90068117,  1.03205722, -1.3412724 , -1.31297673],
       [-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
       [-1.38535265,  0.33784833, -1.39813811, -1.31297673],
       [-1.50652052,  0.10644536, -1.2844067 , -1.31297673],
       [-1.02184904,  1.26346019, -1.3412724 , -1.31297673]])

### Feature means

In [16]:
X_scaled.mean(axis=0)

array([ -1.69031455e-15,  -1.63702385e-15,  -1.48251781e-15,
        -1.62314606e-15])

### Feature standard deviations

In [19]:
X_scaled.std(axis=0)

array([ 1.,  1.,  1.,  1.])

### Standard Scaling with `StandardScaler`

The  `StandardScaler` implements the `transfomer` API which allows you to apply the same transformation (**mean and standard deviation from the train dataset**) to both the train and test datasets.

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [21]:
X_train, X_test = train_test_split(X, test_size=0.3)

In [31]:
ss = StandardScaler().fit(X_train)

In [32]:
## scaler mean
ss.mean_

array([ 5.79904762,  3.07809524,  3.68666667,  1.19809524])

In [34]:
## scaler standard deviation
ss.scale_

array([ 0.80811306,  0.4349014 ,  1.74585586,  0.78254541])

#### Transform X_train

In [26]:
X_train_scaled = ss.transform(X_train)

In [27]:
X_train_scaled[:5]

array([[ 0.86739396,  0.28030437,  0.80953609,  1.02473895],
       [-0.12256654, -0.40950716,  0.29402962,  0.13022217],
       [-1.73125234, -0.17956998, -1.3670468 , -1.27544705],
       [-0.2463116 , -0.63944434,  0.6949791 ,  1.02473895],
       [-1.11252703, -1.32925586,  0.46586511,  0.64137461]])

#### Transform X_test

In [35]:
X_test_scaled = ss.transform(X_test)

In [36]:
X_test_scaled[:5]

array([[ 1.85735445, -0.40950716,  1.49687806,  0.76916272],
       [ 0.37241371, -0.17956998,  0.6949791 ,  0.76916272],
       [-0.37005666, -1.55919304,  0.06491563, -0.12535405],
       [ 0.61990383, -0.63944434,  0.80953609,  0.38579839],
       [ 0.49615877, -2.01906739,  0.46586511,  0.38579839]])

As you can see the mean and standard deviation from the train dataset is 0 and 1, respectively. But it is not for the train dataset. Because the test dataset is scaled with the train mean and standard deviation.

In [39]:
X_train_scaled.mean(axis=0)

array([ -2.46363776e-16,  -7.88787025e-16,  -4.69465736e-16,
        -3.80647894e-17])

In [40]:
X_test_scaled.mean(axis=0)

array([ 0.18267128, -0.1846797 ,  0.13746839,  0.00243406])

In [41]:
X_train_scaled.std(axis=0)

array([ 1.,  1.,  1.,  1.])

In [42]:
X_test_scaled.std(axis=0)

array([ 1.05827554,  0.96645231,  1.01751721,  0.90319671])

## Normalization

**Norm Options**

* L1-norm - least absolute deviations. **Feature values** of a normalized sample add up to 1.
* L2-norm - least squares. **The squares of feature values** of a normalized sample add up to 1.
* Max-norm - Normalize samples by dividing the maximum feature value in a sample.

### L1 Norm

In [43]:
from sklearn.preprocessing import normalize

In [51]:
X_norm_l1 = normalize(X, norm='l1')

In [54]:
X_norm_l1[:5]

array([[ 0.5       ,  0.34313725,  0.1372549 ,  0.01960784],
       [ 0.51578947,  0.31578947,  0.14736842,  0.02105263],
       [ 0.5       ,  0.34042553,  0.13829787,  0.0212766 ],
       [ 0.4893617 ,  0.32978723,  0.15957447,  0.0212766 ],
       [ 0.49019608,  0.35294118,  0.1372549 ,  0.01960784]])

In [53]:
X_norm_l1.sum(axis=1)[:5]

array([ 1.,  1.,  1.,  1.,  1.])

### L2 Norm

In [55]:
X_norm_l2 = normalize(X, norm='l2')
X_norm_l2[:5]

array([[ 0.80377277,  0.55160877,  0.22064351,  0.0315205 ],
       [ 0.82813287,  0.50702013,  0.23660939,  0.03380134],
       [ 0.80533308,  0.54831188,  0.2227517 ,  0.03426949],
       [ 0.80003025,  0.53915082,  0.26087943,  0.03478392],
       [ 0.790965  ,  0.5694948 ,  0.2214702 ,  0.0316386 ]])

In [58]:
(X_norm_l2**2).sum(axis=1)[:5]

array([ 1.,  1.,  1.,  1.,  1.])

In [62]:
### Max Norm

In [63]:
X_norm_max = normalize(X, norm='max')

In [65]:
X[:5]

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2]])

In [66]:
X_norm_max[:5]

array([[ 1.        ,  0.68627451,  0.2745098 ,  0.03921569],
       [ 1.        ,  0.6122449 ,  0.28571429,  0.04081633],
       [ 1.        ,  0.68085106,  0.27659574,  0.04255319],
       [ 1.        ,  0.67391304,  0.32608696,  0.04347826],
       [ 1.        ,  0.72      ,  0.28      ,  0.04      ]])

## Transformer API for normalization

In [67]:
X_train, X_test = train_test_split(X, test_size=0.3)

In [68]:
from sklearn.preprocessing import Normalizer

In [72]:
normalizer = Normalizer(norm='l2').fit(X_train)

### Transform train data

In [74]:
X_train_norm = normalizer.transform(X_train)

In [75]:
X_train_norm[:5]

array([[ 0.69276796,  0.31889319,  0.61579374,  0.1979337 ],
       [ 0.72785195,  0.32870733,  0.56349829,  0.21131186],
       [ 0.68619022,  0.31670318,  0.61229281,  0.232249  ],
       [ 0.72366005,  0.32162669,  0.58582004,  0.17230001],
       [ 0.73350949,  0.35452959,  0.55013212,  0.18337737]])

In [77]:
(X_train_norm**2).sum(axis=1)[:5]

array([ 1.,  1.,  1.,  1.,  1.])

### Transform test data

In [78]:
X_test_norm = normalizer.transform(X_test)

In [79]:
(X_test_norm**2).sum(axis=1)[:5]

array([ 1.,  1.,  1.,  1.,  1.])