# <p style='text-align: center;'> Regularization in Machine Learning </p>

## Regularization :
- Regularization is a technique to prevent the model from overfitting by adding extra information to it.


- One problem that often occurs in practice with multiple linear regression is multicollinearity – when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. This can cause the coefficient estimates of the model to be unreliable and have high variance. That is, when the model is applied to a new set of data it hasn’t seen before, it’s likely to perform poorly. So to avoid Multicollinearity problem we are going to use this regularization.


- In order to reduce the overfitting problem in Linear Regression. Some datasets may have overfitting problem in linear regression, to prevent the model from overfitting by adding extra information to it.


- Regularization works by adding a penalty or complexity term to the complex model.


- In regularization technique, we reduce the magnitude of the features by keeping the same number of features.


- Sometimes the machine learning model performs well with the training data but does not perform well with the test data. It means the model is not able to predict the output when deals with unseen data by introducing noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a regularization technique.




## Techniques of Regularization :
<b> There are mainly three types of regularization techniques, which are given below :
    
   - Ridge Regression (L2 regularization).
   - Lasso Regression (L1 regularization).
   - Elastic net Regression.
    

<b> Before discussing Techniques of Regularization, we need to discuss some topics :
    
    
    a) Cost function and Loss function.
    b) Bias-Variance Tradeoff.

### a) Cost function and Loss function :
- The cost function is the average error of n-samples in the data (for the whole training data).


- The loss function is to capture the difference between the actual and predicted values for a single record.

### b) Bias-Variance Trade-off  :
- While building the machine learning model, it is really important to take care of bias and variance in order to avoid overfitting and underfitting in the model. If the model is very simple with fewer parameters, it may have low variance and high bias. Whereas, if the model has a large number of parameters, it will have high variance and low bias. So, it is required to make a balance between bias and variance errors, and this balance between the bias error and variance error is known as the **Bias-Variance trade-off**.


- For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible because bias and variance are related to each other :

    - If we decrease the variance, it will increase the bias.
    - If we decrease the bias, it will increase the variance.
    
    
- Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and variance errors.


<b> Bias :
- While making predictions, a difference occurs between prediction values made by the model and actual values/expected values, and this difference is known as bias errors or Errors due to bias.
    
    
- **Low Bias:** A low bias model will make fewer assumptions about the form of the target function.
    
    
- **High Bias:** A model with a high bias makes more assumptions, and the model becomes unable to capture the important features of our dataset. A high bias model also cannot perform well on new data.
    
    
High bias mainly occurs due to a much simple model. Below are some ways to reduce the high bias :
    
   - Increase the input features as the model is underfitted.
   - Decrease the regularization term.
   - Increase the training data.
   - Use more complex models, such as including some polynomial features.
    
    
High Bias can be identified if the model has: High training error and the test error is almost similar to training error.
    
    
<b> Variance :
- variance tells that how much a random variable is different from its expected value.
    
    
- **Low variance** means there is a small variation in the prediction of the target function with changes in the training data set. 
    
    
- **High variance** shows a large variation in the prediction of the target function with changes in the training dataset.
    
    
- With high variance, the model learns too much from the dataset, it leads to overfitting of the model. A model with high variance has the below problems :

    - A high variance model leads to overfitting.
    - Increase model complexities.
    
    
Below are some ways to reduce the high Variance :
    
   - Reduce the input features or number of parameters as a model is overfitted.
   - Do not use a much complex model.
   - Increase the Regularization term.
    
    
High variance can be identified if the model has: Low training error and high test error.

<b> There are four possible combinations of bias and variances, which are represented by the below diagram :
    
![image.png](attachment:image.png)
    
    
**1. Low-Bias, Low-Variance:** The combination of low bias and low variance shows an ideal (good) machine learning model. However, it is not possible practically.
    
    
**2. Low-Bias, High-Variance:** With low bias and high variance, model predictions are inconsistent and accurate on average. This case occurs when the model learns with a large number of parameters and hence leads to an **overfitting**.
    
    
**3. High-Bias, Low-Variance:** With High bias and low variance, predictions are consistent but inaccurate on average. This case occurs when a model does not learn well with the training dataset or uses few numbers of the parameter. It leads to **underfitting** problems in the model.
    
    
**4. High-Bias, High-Variance:** With high bias and high variance, predictions are inconsistent and also inaccurate on average. This is the worst machine learning model. 

<b> Now, we are going to discuss the Techniques of Regularization

## Ridge Regression (L2 regularization) :
- Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization.


- In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the model is called Ridge Regression penalty. We can calculate it by multiplying with the alpha to the squared weight of each individual feature.


- The equation for the Linear regression will be :

    sum of residuals squared (SSE) = ∑ (y - ŷ)^2
    
    
- The equation for the Ridge regression will be :

       sum of residuals squared (SSE) + Penalty Term = ∑ (y - ŷ)^2 + α ∑ Bn^2
    

- Here, α (alpha) is the  the penalty or the tuning parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. The default value of regularization parameter given by α is 1.


- In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression reduces the amplitudes of the coefficients that decreases the complexity of the model.


- As we can see from the above equation, if the values of α tend to zero, the equation becomes the cost function of the linear regression model.


- A general linear or polynomial regression will fail if there is high collinearity between the independent variables, so to solve such problems, Ridge regression can be used.


## Ridge Regression (L2 regularization) Intuition :

![image.png](attachment:image.png)

<b> Explaination :
- Least sum of squares is apllied to obtain the best fit line.


<b> In the Fig.1 :
- Since the line passes through the 3 training dataset points, the sum of squared residuals = 0.


- However, for the testing dataset, the sum of residuals is large so the line has a high varianve.


- Variance means that there is a difference in fit (or variability) between the training dataset and the testing dataset.
    
    
- This regression model is overfitting the training dataset. 
    
    
<b> In the Fig.2 :
- Ridge regression works by attempting at increasing the bias to improve variance (generalization capability).
    
    
- This works by changing the slope of the line.
    
    
- The model performance maight be little poor on the training set but it will perform consistently well on both the training and testing datasets.
    
    
- Slope has been reduced with ridge regression penalty and therefore the model becomes less sensitive to changes in the independent variable (# years of experiance). 
    
   
- <b> Least Squares Regression :
    
       Min(sum of the squared residuals)
    
- <b> Ridge Regression :
    
       Min(sum of the squared residuals + α slope^2)
    
    
<b> In the Fig.3 :
- As Alpha increases, the slope of the regression line is reduced and becomes more horizontal.
    
    
- As Alpha increases, the model becomes less sensitive to the variations of the independent variable (# Years of experiance)
    

## Lasso Regression (L1 regularization) :
- Lasso regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L1 regularization.


- Lasso regression is a regularized regression algorithm that performs L1 regularization which adds penalty equal to the absolute value of the magnitude of coefficients. “LASSO” stands for Least Absolute Shrinkage and Selection Operator.


- It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead of a square of weights.


- Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink it near to 0.


- Lasso regression helps to reduce overfitting and it is perticularly useful for feature selection.


- Lasso regression can be usefull if we have several independent variables that are useless.


- The equation for the Linear regression will be :

    sum of residuals squared (SSE) = ∑ (y - ŷ)^2
    
    
- The equation for the Ridge regression will be :

       sum of residuals squared (SSE) + Penalty Term = ∑ (y - ŷ)^2 + α ∑ |Bn|
       
       
- In the above equation, the penalty term regularizes the coefficients of the model, and hence lasso regression reduces the amplitudes of the coefficients that decreases the complexity of the model.


- As we can see from the above equation, if the values of α tend to zero, the equation becomes the cost function of the linear regression model.

## Key Difference between Ridge Regression and Lasso Regression :
<b> Ridge Regression :
1. Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features present in the model. It reduces the complexity of the model by shrinking the coefficients.
    
    
<b> Lasso Regression :
1. Lasso regression helps to reduce the overfitting in the model as well as feature selection.

## Is ridge or lasso regression better?
- In cases where only a small number of predictor variables are significant, lasso regression tends to perform better because it’s able to shrink insignificant variables completely to zero and remove them from the model.


- However, when many predictor variables are significant in the model and their coefficients are roughly equal then ridge regression tends to perform better because it keeps all of the predictors in the model.


- To determine which model is better at making predictions, we typically perform k-fold cross-validation and choose whichever model produces the lowest test mean squared error.

## Elastic net Regression :
-  Elastic net linear regression uses the penalties from both the lasso and ridge techniques to regularize regression models.


- Elastic net combines L1 and L2 with the addition of an λ parameter deciding the ratio between them.


- Coefficient to the variables are considered to be information that must be relevant, however, ridge regression does not promise to remove all irrelevant coefficient which is one of its disadvantages over Elastic Net Regression(ENR)


- It uses both Lasso as well as Ridge Regression regularization in order to remove all unnecessary coefficients but not the informative ones.


- The equation for Elastic net Regression is given below :

    ∑ (y - ŷ)^2 + α ( (1 - λ) ∑ Bn^2 + λ ∑ |Bn| )
    
  
Where, 

   - α (alpha) is the the penalty or the tuning parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. Defaults to 1.0.
   - λ The lambda parameter controls the amount of regularization applied to the model.
   - λ (l1_ratio) The ElasticNet mixing parameter, with ``0 <= λ <= 1``. For λ = 0 the penalty is an L2 penalty. For λ = 1`` it
    is an L1 penalty.  For ``0 < λ < 1, the penalty is a combination of L1 and L2.
   
   
- the lambda parameter where λ = 0 corresponds to ridge and λ = 1 to lasso. Simply put, if you plug in 0 for lambda, the penalty function reduces to the L1 (ridge) term and if we set lambda to 1 we get the L2 (lasso) term. Therefore we can choose an lambda value between 0 and 1 to optimize the elastic net.

## Difference between Ridge, Lasso and Elastic Net Regression :
1. In terms of handling bias, Elastic Net is considered better than Ridge and Lasso regression, Small bias leads to the disturbance of prediction as it is dependent on a variable. Therefore Elastic Net is better in handling collinearity than the combined ridge and lasso regression.


2. Also, When it comes to complexity, again, Elastic Net performs better than ridge and lasso regression as both ridge and lasso, the number of variables is not significantly reduced. Here, incapability of reducing variables causes declination in model accuracy.


3. Ridge and Elastic Net could be considered better than the Lasso Regression as Lasso regression predictors do not perform as accurately as Ridge and Elastic Net. Lasso Regression tends to pick non-zero as predictors and sometimes it affects accuracy when relevant predictors are considered as non zero.