# Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization

## Why Should we Use Feature Scaling?

The first question we need to address – why do we need to scale the variables in our dataset? Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it. Let me explain that in more detail.

### Gradient Descent Based Algorithms

Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled. Take a look at the formula for gradient descent below:

$$
\theta_j := \theta_j -\alpha\frac{1}{m}\sum_{i=1}^{m}\Big{(}h_{\theta}(x^{(i)})-y^{(i)}\Big{)}x_j^{(i)}
$$

The presence of feature value X in the formula will affect the step size of the gradient descent. The difference in ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model.

> *Having features on a similar scale can help the gradient descent converge more quickly towards the minima.*

### Distance-Based Algorithms

Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity.

For example, let’s say we have data containing high school CGPA scores of students (ranging from 0 to 5) and their future incomes (in thousands Rupees):

| |Student|CGPA|Salary'000|
|:-:|:-:|:-:|:-:|
|**0**|1|3.0|60|
|**1**|2|3.0|40|
|**2**|3|4.0|40|
|**3**|4|4.5|50|
|**4**|5|4.2|52|

Since both the features have different scales, there is a chance that higher weightage is given to features with higher magnitude. This will impact the performance of the machine learning algorithm and obviously, we do not want our algorithm to be biassed towards one feature.

> *Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.*

| |Student|CGPA|Salary'000|
|:-:|:-:|:-:|:-:|
|**0**|1|-1.184341|1.520013|
|**1**|2|-1.184341|-1.100699|
|**2**|3|0.416120|-1.100699|
|**3**|4|1.216350|0.209657|
|**4**|5|0.736212|0.471728|

The effect of scaling is conspicuous when we compare the Euclidean distance between data points for students A and B, and between B and C, before and after scaling as shown below:

- Distane AB before scaling: $\sqrt{(40-60)^2+(3-3)^2}=20$
- Distane BC before scaling: $\sqrt{(40-40)^2+(4-3)^2}=1$
- Distane AB after scaling: $\sqrt{(1.1+1.5)^2+(1.18-1.18)^2}=2.6$
- Distane BC after scaling: $\sqrt{(1.1-1.1)^2+(0.41+1.18)^2}=1.59$

Scaling has brought both the features into the picture and the distances are now more comparable than they were before we applied scaling.

### Tree-Based Algorithms

Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. Think about it, a decision tree is only splitting a node based on a single feature. The decision tree splits a node on a feature that increases the homogeneity of the node. This split on a feature is not influenced by other features.

So, there is virtually no effect of the remaining features on the split. This is what makes them invariant to the scale of the features!

### What is Normalization?
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Here’s the formula for normalization:

$$
X^{'}=\frac{X-X_{min}}{X_{max}-X_{min}}
$$

Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.

- When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0
- On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1
- If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1

### What is Standardization?

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Here’s the formula for standardization:

$$
X^{'}=\frac{X-\mu}{\sigma}
$$

$\mu$ is the mean of the feature values and Feature scaling and $\sigma$ is the standard deviation of the feature values. Note that in this case, the values are not restricted to a particular range.

Now, the big question in your mind must be when should we use normalization and when should we use standardization? Let’s find out!

## The Big Question – Normalize or Standardize?

Normalization vs. standardization is an eternal question among machine learning newcomers. Let me elaborate on the answer in this section.

- **Normalization** is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
- **Standardization**, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. **You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.**

> It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.