# Regularization Overview

Regulation is the way to reduce model overfitting . It require some additional  bias  and search for optimal penalty hyperparameter .


### How it  is done
- Minimizing model complexity
- penalizing loss function
- Reducing model-overfitting (by adding some bias to reduce variance).

# Types of Regularization

1. <b>L1 regularization( or Lasso Regression)<b>
2. <b>L2 regularization( or Ridge Regression)<b>
3. <b> Elastic Net regularization( combinartion of l1 and l2)<b>

# 1.L1 Regularization (Lasso Regression)
- Adds a penalty equal to the absolute value of the magnitude of the coefficient to the loss function(i.e cost function)
- It limits the size of the coefficient in the regression equation.
- It can yield spare models where some coefficients  can be zero (In polynominal regression as we saw some coefficients were very small that are almost zero, it'll treat them as zero elimination the coefficient).

$$ L1 = \sum_{i=0}^{m-1}(y_i-\hat y_i)^2 + \lambda \sum_{j=0}^{n-1}|\beta_j| $$
$$ which\ is: $$
$$ L1= SSR +\lambda \sum_{j=0}^{n-1}|\beta_j|$$

Expanding Sum of Squared Residuals(SSR):
$$ \hat y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} +\ ....\ + \beta_jx_{ij}$$
$$ SSR = \sum_{i=0}^{m-1} \left( y_i - \beta_0 - \sum_{j=1}^{n}\beta_j x_{ij}\right)^2  $$
- here SSR= Sum of Squared Residuals
- $\lambda $ is a hyperparameter.

# 2.L2 Regularization (Ridge Regression)
- Adds a penalty equal to the squared of the magnitude of coefficients.
- All the coefficients are shrunk by same factor but doesnot necessarily eliminate them.
    $$ L2 = \sum_{i=0}^{m-1}(y_i - \hat y_i)^2 + \lambda \sum_{j=0}^{n-1} (\beta_j)^2$$
  $$ which\ is: $$
  $$L2=SSR +\lambda \sum_{j=0}^{n-1}\left(\beta_j\right)^2 $$

- Theoretically, the ridge regression is:
  $$ Error = SSR + Shrinkage Penalty $$


   




# 3. Elastic Net (Combination of L1 and L2)

$$ Error = SSR + \lambda \sum_{j=0}^{n-1} \beta_j^2 +\lambda_2 \sum_{j=0}^{n-1} |\beta_j| $$
- We can alternatively expres as ratio between L1 and L2
  $$ Error = SSR + \lambda \left( \frac{1-\alpha}{2} \sum_{j=0}^{n-1}\beta^2+\alpha\sum_{j=0}^{n-1}|\beta_j|\right) $$

  - here, $\alpha$ and $\lambda$ are the tunable parameters
    - $\alpha$ is used as the ratio between the L1 and L2
    - when $\alpha = 0$ we're only considering L2
    - when $\alpha = 1$ we're only considering L1
    > Scikit learn's `ElasticNetCV` model represents $\lambda$ as `alpha` parameter and `l1_ratio` for the $\alpha$ we're representing here



# Conclusion
- In regularization we are just adding some penalty term to the the error that we're trying to minimize
- $\lambda$ is the hyperparameter. $\lambda = 0$ is same as not performing any regularization and becomes just a Sum of Squared Residuals (SSR)
> Every tunable parameters are represented as $\alpha$ in scikit learn. So donot confuse $\lambda$ with $\alpha$

# Revisit on Feature Scaling
- Algorithim like gradient descent and KNN (which relys on distance metric) requires feature scaling to perform well
- In gradient descent, the features with large scale will have their coefficient updated faster than the coefficient of small scaled features. Scaled features will allow gradient descent to converge efficiently.
- There are some algorithms in ML where feature scaling will have no effect. (Regression trees, decision trees, random forest etc.)
- Generally, decision tree based algorithms will have no effect with feature scaling
> If we scale the training features, we'll have to scale the unseen data too before feeding it to the model

- Improves coefficient interpreatability meaning we can relate and compare between coefficients and their impact.
- It causes great increase in model performance

> If we're not sure if we need to scale or not, we can scale it anyway. It has no drawback and doesnot affect the data set in any conditions. Only thing to remember is to scale the unseen data (with the same scaling factors as in trained model) before feeding it to the model

#### Ways to scale features
1. <b>Standardization</b>: Rescale data to have mean $(\mu) = 0$ and standard deviation $(\sigma) = 1$. It is also known as Z-score normalization
      $$x_{1,scaled} = \frac{x_1 - \mu_1}{\sigma_1}$$
2. <b>Normalization</b>: Rescale all the data values to be between $0\ and\ 1$
      $$x_{1,scaled} = \frac{x_1 - x_{min}}{x_{max} - x_{min}}$$
> While performing featuring scaling mean, standard deviation, min and max should be calculated for train data. Using all the data set(both train and test) will cause information leakage to the training data. We can use .fit() and .transform() method in scikit learn.

# Cross-Validation Overview
- It is way to estimate errors wihout having to split data set into train and test sets
- As we're not splitting dataset for test, we'll have advantage of more data points to train our model(generally 30% of the data set would have splitted for model test).