## **Normalization**

### **Theory**

Normalization is the process of converting a numerical feature into a standard range of values. The range of values might be either [-1, 1] or [0, 1]. For example, think that we have a data set comprising two features named "**Age**" and the "**Weight**" as shown below:

In [None]:
import pandas as pd

In [None]:
X = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
y = [5, 8, 13, 17, 27, 33, 36, 40, 50, 70, 78, 80, 100, 103, 108, 109, 113, 120, 123, 130]

In [None]:
df = pd.DataFrame(list(zip(X, y)), columns =['Age', 'Weight'])
df

Unnamed: 0,Age,Weight
0,5,5
1,10,8
2,15,13
3,20,17
4,25,27
5,30,33
6,35,36
7,40,40
8,45,50
9,50,70


Suppose the actual range of a feature named "**Age**" is **5** to **100**. We can normalize these values into a range of **[0, 1]** by subtracting **5** from every value of the "**Age**" column and then dividing the result by **95** (100–5). To make things clear in your brain we can write the above as a formula.

![alt text](https://cdn-images-1.medium.com/max/800/1*i0oJBKdU7QgTLjwRTAvhIA.png)

where min^(j) and max^(j) are the minimum and the maximum values of the feature j in the dataset.



---



## **Implementation**

Now that you know the theory behind it let's now see how to put it into production. As normal there are two ways to implement this: **Traditional Old school manual method** and the other using `sklearn preprocessing` library. Today let's take the help of `sklearn` library to perform normalization. 


### **Using sklearn preprocessing - Normalizer**


Before feeding the "**Age**" and the "**Weight**" values directly to the method we need to convert these data frames into a `numpy` array. To do this we can use the `to_numpy()` method as shown below:

In [None]:
# Storing the columns Age values into X and Weight as Y
X = df['Age']
y = df['Weight']
X = X.to_numpy()
y = y.to_numpy()

The above step is very important because of both the `fit()` and the `transform()` method works on an array.

In [None]:
from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit([X])
normalizer.transform([X])

array([[0.01866633, 0.03733267, 0.055999  , 0.07466534, 0.09333167,
        0.11199801, 0.13066434, 0.14933068, 0.16799701, 0.18666335,
        0.20532968, 0.22399602, 0.24266235, 0.26132869, 0.27999502,
        0.29866136, 0.31732769, 0.33599403, 0.35466036, 0.3733267 ]])

In [None]:
normalizer = Normalizer().fit([y])
normalizer.transform([y])

array([[0.01394837, 0.02231739, 0.03626577, 0.04742446, 0.07532121,
        0.09205925, 0.10042828, 0.11158697, 0.13948372, 0.1952772 ,
        0.2175946 , 0.22317395, 0.27896743, 0.28733646, 0.30128483,
        0.3040745 , 0.3152332 , 0.33476092, 0.34312994, 0.36265766]])

As seen above both the arrays have the values in the range **[0, 1]**. More details about the library can be found below:

[Pre-processing data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization)



---



## **When should we actually normalize the data?**

Although normalization is not mandatory or a requirement (must-do thing). There are two ways it can help you which is



1.   Normalizing the data will **increase the speed of learning**. It will increase the speed both in building (training) and testing the data. Give it a try!!

2.   It will avoid **numeric overflow**. What is really means is that normalization will ensure that our inputs are roughly in a small relatively small range. This will avoid problems because computers usually have problems dealing with very small or very large numbers.



---





## **Standardization**

### **Theory**

Standardization or **z-score normalization** or **min-max scaling** is a technique of rescaling the values of a dataset such that they have the properties of a standard normal distribution with **μ** = 0 (mean - average values of the feature) and **σ** = 1 (standard deviation from the mean). This can be written as:

![alt text](https://cdn-images-1.medium.com/max/800/1*JAmQxAfwtO9AM1xTzMIIXw.png)

## **When to standardize:**

1️⃣ Linear distances Model in linear space 

Examples:
-   k-Nearest Neighbors (kNN)
-   Linear regression
-   K-Means Clustering

2️⃣ Dataset features have high variance

3️⃣ Different scales: Features are on different scales
Example: Predicting house prices using no. bedrooms & last sale price. 



## **Implementation**

Now there are plenty of ways to implement standardization, just as normalization, we can use `sklearn` library and use `StandardScalar` method as shown below:



In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform([X])
sc.transform([X])
sc.fit_transform([y])
sc.transform([y])

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]])

You can read more about the library from below:

[Pre-processing data](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling)



---



## **Z-Score Normalization**

Similarly, we can use the pandas `mean` and `std` to do the needful

In [None]:
# Calculating the mean and standard deviation
df = (df - df.mean())/df.std()
df

Unnamed: 0,Age,Weight
0,-1.605793,-1.458724
1,-1.436762,-1.389426
2,-1.267731,-1.273929
3,-1.098701,-1.181531
4,-0.92967,-0.950538
5,-0.760639,-0.811942
6,-0.591608,-0.742644
7,-0.422577,-0.650247
8,-0.253546,-0.419253
9,-0.084515,0.042734




---



## **Min-Max scaling**


Here we can use pandas `min` and `max` to do the needful



In [None]:
# Calculating the minimum and the maximum 
df = (df-df.min())/(df.max()-df.min())
df

Unnamed: 0,Age,Weight
0,0.0,0.0
1,0.052632,0.024
2,0.105263,0.064
3,0.157895,0.096
4,0.210526,0.176
5,0.263158,0.224
6,0.315789,0.248
7,0.368421,0.28
8,0.421053,0.36
9,0.473684,0.52


Usually, the **Z-score normalization** is preferred because min-max scaling is prone for **overfitting**.






---
**References**
-   [The Hundred-Page Machine Learning Book by Andriy Burkov](http://themlbook.com/)" (Chapter 5) 

-   Datacamp.com
