# Regularization Overview
Regularization is a way to reduce model overfitting. It requires some additional bias and search for optimal penalty hyperparameter.
#### How it is done:
- Minimizing model complexity
- penalizing loss function 
- Reducing model overfitting (by adding some bias to reduce variance)

#### How overfitting happens:
- It happens when the model is too flexible, and the training process adapts too much to the training data, thereby losing predictive accuracy on new test data. The causing factor is high variance low bias
- <b><u>Intuition</u></b>:the curve is too smooth passing through every data points in training set

# Types of Regularization
1. L1 Regularization (or Lasso Regression)
2. L2 Regularization (or Ridge Regression)
3. Elastic Net Regularization (combination of L1 and L2)

# 1. L1 Regularization (Lasso Regression)
- Adds a penalty equal to the absolute value of the magnitude of the coefficients to the loss function (aka cost function)
- It limits the size of coefficients in the regression equation
- It can yield sparse models where some coefficients can be zero (In polynomial regression as we saw some coefficients were very small that they're almost zero, it'll treat them as zero eliminating the coefficient)
$$ L1 = \sum_{i=0}^{m-1}(y_i - \hat y_i)^2 + \lambda \sum_{j=0}^{n-1} |\beta_j| $$
$$ which\ is: $$
$$ L1 = SSR + \lambda \sum_{j=0}^{n-1} |\beta_j| $$
- Expanding Sum of Squared Residuals (SSR)
$$ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} +\ ....\ + \beta_jx_{ij}$$
$$ SSR = \sum_{i=0}^{m-1} \left( y_i - \beta_0 - \sum_{j=1}^{n}\beta_j x_{ij}\right)^2  $$
    - here, SSR = Sum of Squared Residuals
    - $\lambda$ is a hyperparameter
# 2. L2 Regularization (Ridge Regression)
- Adds a penalty equal to the squared of the magnitude of coefficients
- All the coefficients are shrunk by same factor but doesnot necessarily eliminate them.

$$ L2 = \sum_{i=0}^{m-1}(y_i - \hat y_i)^2 + \lambda \sum_{j=0}^{n-1} (\beta_j)^2$$
$$ which\ is: $$
$$ L2 = SSR + \lambda \sum_{j=0}^{n-1} (\beta_j)^2 $$

- Theoretically, the Ridge regression is:
$$ Error = SSR + Shrinkage\ penalty$$

# 3. Elastic Net (Combination of L1 and L2)
$$ Error = SSR + \lambda_1 \sum_{j=0}^{n-1}\beta_j^2 + \lambda_2 \sum_{j=0}^{n-1}|\beta_j| $$

- We can alternatively express as ratio between L1 and L2
$$ Error = SSR + \lambda \left( \frac{1-\alpha}{2} \sum_{j=0}^{n-1} \beta_j^2 + \alpha \sum_{j=0}^{n-1}|\beta_j| \right)$$
    - here, $\alpha$ and $\lambda$ are the tunable parameters
    - $\alpha$ is used as the ratio between the L1 and L2
    - when $\alpha = 0$ we're only considering L2
    - when $\alpha = 1$ we're only considering L1
    > Scikit learn's `ElasticNetCV` model represents $\lambda$ as `alpha` parameter and `l1_ratio` for the $\alpha$ we're representing here

# Conclusion
- In regularization we are just adding some penalty term to the the error that we're trying to minimize
- $\lambda$ is the hyperparameter. $\lambda = 0$ is same as not performing any regularization and becomes just a Sum of Squared Residuals (SSR)
> Every tunable parameters are represented as $\alpha$ in scikit learn. So donot confuse $\lambda$ with $\alpha$

# Revisit on Feature Scaling
- Algorithim like gradient descent and KNN (which relys on distance metric) requires feature scaling to perform well
- In gradient descent, the features with large scale will have their coefficient updated faster than the coefficient of small scaled features. Scaled features will allow gradient descent to converge efficiently.
- There are some algorithms in ML where feature scaling will have no effect. (Regression trees, decision trees, random forest etc.)
- Generally, decision tree based algorithms will have no effect with feature scaling
> If we scale the training features, we'll have to scale the unseen data too before feeding it to the model

- Improves coefficient interpreatability meaning we can relate and compare between coefficients and their impact.
- It causes great increase in model performance

> If we're not sure if we need to scale or not, we can scale it anyway. It has no drawback and doesnot affect the data set in any conditions. Only thing to remember is to scale the unseen data (with the same scaling factors as in trained model) before feeding it to the model

#### Ways to scale features
1. <b>Standardization</b>: Rescale data to have mean $(\mu) = 0$ and standard deviation $(\sigma) = 1$. It is also known as Z-score normalization
      $$x_{1,scaled} = \frac{x_1 - \mu_1}{\sigma_1}$$
2. <b>Normalization</b>: Rescale all the data values to be between $0\ and\ 1$
      $$x_{1,scaled} = \frac{x_1 - x_{min}}{x_{max} - x_{min}}$$
> While performing featuring scaling mean, standard deviation, min and max should be calculated for train data. Using all the data set(both train and test) will cause information leakage to the training data. We can use .fit() and .transform() method in scikit learn.

# Cross-Validation Overview
- It is way to estimate errors wihout having to split data set into train and test sets
- As we're not splitting dataset for test, we'll have advantage of more data points to train our model(generally 30% of the data set would have splitted for model test).

### Types of Cross-Validation
1. K-fold cross-validation:
    - Largest possible value of k is nomber of rows (m), we generally choose k = 10
3. Leave one out cross-validation:
    - Computationally expensive but provides more accurate readings

> For most of the problems, K-fold cross-validation provides more than enough information.

> NOTE: If we're using scikit learn's Regularization class, it has build in cross-validation calls. Learn more during implementations.

# Hold-out test set
- Typically what we do is split our data into train and test sets.
- Also we can do is first split some data as a hold-out test set. And then perform classical train-test split on remaining data.
- There's a philosophical training strategy that is beyond mathematical evaluation that we train our model with training set, test with test set and tune hyperparameter on `the test set`.
- This way we're slightly doing work-around to make the model perform well for the unseen data.
- Finally, as the evaluation for the model, we use hold-out test set which the model has never seen, neither during training nor during hyperparameter tuning with test data.

    - This way the data set is divided into 3 portion making the train set smaller in size.
    - To avoid train-test split, we use cross-validation, so that we can train and test on the same data set and perform hyperparameter tunining
    - And as a evaluation for model, we use small portion of data set called hold-out test set.

> This procedure is called as train-validation-test split or train-validation-holdout test split

> It is philosophical in the sense that it is not mathematically said that it is good to evaluate model on hold-out set, and bad to evaluate the model based on hyperparameter tuned test set. It is a general idea of evaluating the model genuinely with the data that is never seen.
