# Normalization
<mark>**Normalization**</mark>, in the **_general sense_**, refers to scaling our numeric columns to bring them into the **same terms** so that they fall within a smaller and standard range of values, making it easier to compare them. 

This is useful because data variables are typically measured in varied units with varied magnitudes. For example, a patient's age is measured in terms of years while heart rate is measured in terms of beats per minute. These variables are measured in completely different terms which makes it difficult to make a comparison as to which may be better or worse than the other.

However, normalization, in the **_specific sense_**, refers to scaling variables so that their values fall strictly between 0 and 1, making it easier to compare them. This is also known as min-max scaling.

$$\large\ normalization = \frac{X-X_{min}}{X_{max}-X_{min}}$$

To normalize the values in the age column below, we first need to identify the minimum and maximum values.

<img src='./images/n1.png'>



We then will need to calculate the difference between the maximum and minimum values (known as the <mark>range</mark>). In this case (74 - 29), the max-min difference is **45**. This becomes the denominator for our scaling.

<img src='./images/n2.png'>

Now, let's try scaling the first age in the column. We will take 41 and subtract the minimum age from it.

<img src='./images/n3.png'>

That's 41 - 29 which yields 12. So to scale the first age, we will divide 12 by 45 (the max-min difference calculated above). The scaled age would be **.27**.

<img src='./images/n4.png'>



Following is a normalized version of the age column where all values are scaled to be between 0 and 1:

<img src='./images/n5.png'>


Normalization conveys the proportion of the range, from min to max, that is covered by a given value within a column. So, in our example, from 29 (min age) to 74 (max age), each age value will cover some proportion of that range, and that is its normalized value.

<img src='./images/n6.png'>

## Normalizing a sample heart disease dataset

In the sample dataset below, "age", cholesterol ("chol"), and max heart rate ("max_hr") are all measured in different units and have different magnitudes and ranges of values. Let's <mark>**normalize**</mark> the dataset. 

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('./heart_disease.csv', usecols=[0, 4, 5])

df

Unnamed: 0,age,chol,max_hr
0,63,233,150
1,37,250,187
2,41,204,172
3,56,236,178
4,57,354,163
...,...,...,...
298,57,241,123
299,45,264,132
300,68,193,141
301,57,131,115


## Normalized dataset will be:

In [4]:
(df - df.min())/ (df.max() - df.min())

Unnamed: 0,age,chol,max_hr
0,0.708333,0.244292,0.603053
1,0.166667,0.283105,0.885496
2,0.250000,0.178082,0.770992
3,0.562500,0.251142,0.816794
4,0.583333,0.520548,0.702290
...,...,...,...
298,0.583333,0.262557,0.396947
299,0.333333,0.315068,0.465649
300,0.812500,0.152968,0.534351
301,0.583333,0.011416,0.335878


It should now be clear why normalizing values is useful. It's an effective way of comparing values that are measured in different units and it scales values down into a standard range. 

<div class="alert alert-block alert-success">
<b>Tip: </b>Another scaling technique that is used to transform variables so that they are in similar terms is called <b>standardization</b>. This technique scales values so that they are in terms of how far they are from their respective means (in terms of standard deviations). Rather than falling between 0 and 1 as normalization does, standardized values are scaled to fall typically between -3 and 3 (see Z-score).
</div>