## Table of Contents
- [Regularizaion](#Regularization)
    - [Types of Regularization](#Types-of-Regularization)
      - [1. L1 Regularization](#1.-L1-Regularization-(Lasso-Regression))
      - [2. L2 Regularization](#2.-L2-Regularization-(Ridge-Regression))
    - [Regularization Conclusion](#Conclusion)
- [Revisit on Feature Scaling](#Revisit-on-Feature-Scaling)
# Regularization
Regularization is a way to reduce model overfitting. It requires some additional bias and search for optimal penalty hyperparameter.
#### How it is done:
- Minimizing model complexity
- penalizing loss function 
- Reducing model overfitting (by adding some bias to reduce variance)

#### How overfitting happens:
- It happens when the model is too flexible, and the training process adapts too much to the training data, thereby losing predictive accuracy on new test data. The causing factor is high variance low bias
- <b><u>Intuition</u></b>:the curve is too smooth passing through every data points in training set

# Types of Regularization
1. L1 Regularization (or Lasso Regression)
2. L2 Regularization (or Ridge Regression)
3. Elastic Net Regularization (combination of L1 and L2)

# 1. L1 Regularization (Lasso Regression)
- Adds a penalty equal to the absolute value of the magnitude of the coefficients to the loss function (aka cost function)
- It limits the size of coefficients in the regression equation
- It can yield sparse models where some coefficients can be zero (In polynomial regression as we saw some coefficients were very small that they're almost zero, it'll treat them as zero eliminating the coefficient)
$$ L1 = \sum_{i=0}^{m-1}(y_i - \hat y_i)^2 + \lambda \sum_{j=0}^{n-1} |\beta_j| $$
$$ which\ is: $$
$$ L1 = SSR + \lambda \sum_{j=0}^{n-1} |\beta_j| $$
    - here, SSR = Sum of Squared Residuals
    - $\lambda$ is a hyperparameter
# 2. L2 Regularization (Ridge Regression)
- Adds a penalty equal to the squared of the magnitude of coefficients
- All the coefficients are shrunk by same factor but doesnot necessarily eliminate them.

$$ L2 = \sum_{i=0}^{m-1}(y_i - \hat y_i)^2 + \lambda \sum_{j=0}^{n-1} (\beta_j)^2$$
$$ which\ is: $$
$$ L2 = SSR + \lambda \sum_{j=0}^{n-1} (\beta_j)^2 $$

# Conclusion
- In regularization we are just adding some penalty term to the the error that we're trying to minimize
- $\lambda$ is the hyperparameter. $\lambda = 0$ is same as not performing any regularization and becomes just a Sum of Squared Residuals (SSR) 

# Revisit on Feature Scaling
- Algorithim like gradient descent and KNN (which relys on distance metric) requires feature scaling to perform well
- In gradient descent, the features with large scale will have their coefficient updated faster than the coefficient of small scaled features. Scaled features will allow gradient descent to converge efficiently.
- There are some algorithms in ML where feature scaling will have no effect. (Regression trees, decision trees, random forest etc.)
- Generally, decision tree based algorithms will have no effect with feature scaling
> If we scale the training features, we'll have to scale the unseen data too before feeding it to the model

- Improves coefficient interpreatability meaning we can relate and compare between coefficients and their impact.
- It causes great increase in model performance

> If we're not sure if we need to scale or not, we can scale it anyway. It has no drawback and doesnot affect the data set in any conditions. Only thing to remember is to scale the unseen data (with the same scaling factors as in trained model) before feeding it to the model

#### Ways to scale features
1. <b>Standardization</b>: Rescale data to have mean $(\mu) = 0$ and standard deviation $(\sigma) = 1$. It is also known as Z-score normalization
      $$x_{1,scaled} = \frac{x_1 - \mu_1}{\sigma_1}$$
2. <b>Normalization</b>: Rescale all the data values to be between $0\ and\ 1$
      $$x_{1,scaled} = \frac{x_1 - x_{min}}{x_{max} - x_{min}}$$
> While performing featuring scaling mean, standard deviation, min and max should be calculated for train data. Using all the data set(both train and test) will cause information leakage to the training data. We can use .fit() and .transform() method in scikit learn.