
# ML/Python Course Material
### [Mohamad Dia](http://mohamaddia.me)
Feb 11 2020

# T3: Common Questions

#### 1. Do we always need to normalize our data?
Data normalization (or feature scaling) is an essential preprocessing technique in machine learning that makes the learning algorithm numerically more stable by brining all the features to the same scale. Whether to normalize the data or not depends on the type and scale of the features in addition to the choice of the machine learning model.

**Tree-based** learning models for example (such as decision trees) do not require data normalization because the decision threshold can be learned independently for each feature regardless of its scale or range. **Graph-based** models (such as Naive Bayes) are designed to treat each feature independently and does not require normalization. However, normalizing the data won't really hurt.

On the other hand, **gradient-based** models (e.g. logistic regression, neural network,...) and **distance-based** models (e.g. KNN) require data normalization when the features have different scales or units. Take for example the binary classification problem using logistic regression based on three features: the height of a person in cm, the weight in kg, and the annual salary in CHF. Since the features have completely different scales, the learning algorithm will be very sensitive and numerically unstable with many inefficient oscillations (a small change in the gradient of the third feature can have a huge impact compared to a similar change in the gradient of the first feature). Therefore, if we assume that different features have relatively similar importance in the classification, it is very essential to normalize the data so that we attain the following:
* Numerical stability in the learning algorithm
* Fast convergence
* Learning algorithms can focus on learning the absolute importance of each feature (instead of learning its scale).

#### 2. When should we use min-max, or zero mean/unit variance scaling, what's the difference?

Zero mean/unit variance scaling (or standardization) rescale the features so that they have the property of the standard normal distribution centered around $0$ with standard deviation equals $1$.

Min-max scaling (or simply called normalization) brings all the feature values to the range $[0,1]$.

Standardization or normalization? No absolute answer, it depends on the application. In general, most machine learning algorithm benefit from standardization more than normalization. However, there are cases where normalization is preferred.

Standardization works better whenever there are outliers in the data (standardization makes the data unbounded while normalization bound the data). Moreover, whenever the learning algorithm assume normality (Gaussianity) of the features, we use standardization. For example, some neural networks with $\tanh$ activations assume that the features are centered around zero, and hence it is better to standardize the data. However, this is not always the case. There are cases where the Gaussian assumption does not hold (e.g. image classification in neural network with $relu$ activation), hence one can use normalization. Furthermore, there are cases where the standard deviation is very small, which makes normalization a better option than standardization.
