# Regularization:

https://www.statisticshowto.com/regularization/  
Regularization is a <span class="mark">set of techniques which can help avoid overfitting in neural networks,and hence helps to improve the accuracy</span> of deep learning models when the new data is feeded to it. There are various regularization techniques, some of the most popular ones are **— L1, L2, dropout, early stopping, and data augmentation.**

Regularization is a <span class="mark">technique which makes slight modifications to the learning algorithm such that the model generalizes better</span>. This in turn improves the model’s performance on the unseen data as well.

![Screenshot%202021-06-02%20191558.png](attachment:Screenshot%202021-06-02%20191558.png)

Have you seen this image before? As <span class="girk">we move towards the right in this image, our model tries to learn too well the details and the noise from the training data, which ultimately results in poor performance on the unseen data.</span>

In other words, <span class="mark">while going towards the right, the complexity of the model increases such that the training error reduces but the testing error doesn’t. This is shown in the image below</span>

![Screenshot%202021-06-02%20192024.png](attachment:Screenshot%202021-06-02%20192024.png)

# Why is Regularization Required?
The characteristic of a good machine learning model is its ability to generalise well from the training data to any data from the problem domain; this allows it to make good predictions on the data that model has never seen. **To define generalisation, it refers to how well the model has learnt the concepts to apply to any data rather than just with the specific data it was trained on during the training process.**

On the flip side, if the **model is not generalised, a problem of overfitting emerges. overfitting, the machine learning model works on the training data too well but fails when applied to the testing data. <span class="burk">It even picks up the noise and fluctuations in the training data and learns it as a concept.</span> This is where regularization steps in and makes slight changes to the learning algorithm so that the model generalises better**

# What does regularization do to the weights?
Regularization refers to the act of modifying a learning algorithm to favor “simpler” prediction rules to avoid overfitting. Most commonly, <span class="mark">regularization refers to modifying the loss function</span> <span class="mark">to penalize certain values of the weights you are learning</span>. Specifically, <span class="mark">penalize weights that are large . it limits the size of the coefficents</span>

# How does Regularization help reduce Overfitting?

![Screenshot%202021-06-02%20221020.png](attachment:Screenshot%202021-06-02%20221020.png)

<span class="mark">If the penalty is too weak, the model will be allowed to overfit the training data.</span>

![Screenshot%202021-06-02%20221215.png](attachment:Screenshot%202021-06-02%20221215.png)

If the penalty is too strong, the model will underestimate the weights and underfit the problem.

![Screenshot%202021-06-02%20221400.png](attachment:Screenshot%202021-06-02%20221400.png)

# 1) L2 & L1 regularization

L1 and L2 are the most common types of regularization. These update the general cost function by adding another term known as the regularization term.

<div class="burk">
<span class="girk">Cost function = Loss (say, binary cross entropy) + Regularization term</span></div><i class="fa fa-lightbulb-o "></i>

Due to the addition of this regularization term, <span class="mark">the values of weight matrices decrease because it assumes that a neural network with smaller weight matrices leads to simpler models. Therefore, it will also reduce overfitting to quite an extent</span>

![Screenshot%202021-06-02%20223213.png](attachment:Screenshot%202021-06-02%20223213.png)

![Screenshot%202021-06-02%20223455.png](attachment:Screenshot%202021-06-02%20223455.png)

Unlike in the case of L2 regularization, where weights are never reduced to zero, in L1 the absolute value of the weights are penalised. This technique is useful when the aim is to compress the model. Also called Lasso regularization, in this technique, insignificant input features are assigned zero weight and useful features with non-zero

# L1 vs L2 Regularization: The intuitive difference

![Screenshot%202021-06-02%20232001.png](attachment:Screenshot%202021-06-02%20232001.png)

The main intuitive difference between the L1 and L2 regularization is that <span class="mark">L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting.</span>

As we can see from the formula of L1 and L2 regularization, L1 regularization adds the penalty term in cost function by adding the absolute value of weight(Wj) parameters, while L2 regularization adds the squared value of weights(Wj) in the cost function.   

While taking derivative of the cost function, in L1 regularization it will estimate around the median of the data. Let me explain it in this way — Suppose you take an arbitrary value from the data (assume data is spread along a horizontal line). If you then move in one direction to some distance d, suppose in the backward direction, then while calculating loss, the values to the one side (let say left side) of the chosen point will have a lesser loss value while on another side will contribute more in the loss function calculation.    

Therefore, to minimize the loss function, we should try to estimate a value that should lie at the mid of the data distribution. That value will also be the median of the data distribution mathematically.   

While in L2 regularization, while calculating the loss function in the gradient calculation step, the loss function tries to minimize the loss by subtracting it from the average of the data distribution.    

That’s the main intuitive difference between the L1 (Lasso) and L2 (Ridge) regularization technique.   
<span class="mark">Another difference between them is that L1 regularization helps in feature selection by eliminating the features that are not important. This is helpful when the number of feature points are large in number.</span>

Regularized methods such as Ridge Regression can be used to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction.

**Advantages of Regularization**

- We can use a regularized model to reduce the dimensionality of the training dataset. Dimensionality reduction is important because of three main reasons:

- Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
- Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.
- Computational Efficiency: A model trained on a lower dimensional dataset is computationally efficient (execution of algorithm requires less computational time).

**Disadvantages of Regularization**

- <span class="mark">Regularization leads to dimensionality reduction,</span> <span class="mark">which means the machine learning model is built using a lower dimensional dataset</span>. This generally leads to a high bias errror.

- If regularization is performed before training the model, a perfect balance between bias-variance tradeoff must be used.


## Advantage and disadvantage of L2(ridge) regularization technique ?
**Advantages**
- <span class="mark">Avoids overfitting a model.</span>
- They does not require unbiased estimators.
- They add just enough bias to make the estimates reasonably reliable approximations to true population values.
- They still perform well in cases of a large multivariate data with the number of predictors (p) larger than the number of observations (n).
- <span class="mark">The ridge estimator is preferably good at improving the least-squares estimate when there is multicollinearity</span>.

**Disadvantages**
- They include all the predictors in the final model.
- <span class="mark">They are unable to perform feature selection.</span>
- <span class="mark">They shrink the coefficients towards zero.</span>
- They trade the variance for bias.
- <span class="mark">not robust to outliers</span>

## Advantage and disadvantage of L1(lasso) regularization technique ?
**Advantage**
- <span class="mark">Avoids overfitting model
- <span class="mark">it is robust to outliers.
- <span class="mark">helps in feature selection.</span></span></span>

**disadvantage**
- The disadvantage is that we have to tune the hyperparameter  α or regularization parameter(lambda)  properly (manually) and a poor value for  α  might make the model performance worse.
- <span class="mark">shrinks the regularization coefficients to zero.</span>
![Screenshot%202021-06-22%20095733.png](attachment:Screenshot%202021-06-22%20095733.png)

**What is a sparse solution?**
- <span class="mark">Once a variable has a 0 coefficient, it has no impact on the model anymore</span>. ... This is what we mean by a sparse solution - it only uses a few variables in the dataset. Other methods may produce a solution where many variables have small, but non-zero coefficients.

### What is the role of L1 and L2 norm Regularisation of weights in deep neural networks?
### Why do we penalize large weights?

<span class="mark">Large weights in a neural network are a sign of a more complex network that has overfit the training data. Penalizing a network based on the size of the network weights during training can reduce overfitting</span>. An L1 or L2 vector norm penalty can be added to the optimization of the network to encourage smaller weights

### What is L1 and L2 Penalty?

<span class="mark">L1 regularization adds an L1 penalty equal to the absolute value of the magnitude of coefficients. In other words, it limits the size of the coefficients. ... L2 regularization adds an L2 penalty equal to the square of the magnitude of coefficients</span>

### Why is L2 better than L1?

- L1 tends to shrink coefficients(weight) to zero whereas L2 tends to shrink coefficients evenly. L1 is therefore useful for feature selection, as we can drop any variables associated with coefficients that go to zero. L2, on the other hand, is useful when you have collinear/codependent features  
-L2 regularization tries to reduce the possibility of overfitting by keeping the values of the weights and biases small

# Elastic Net Regression 
So, what if I don’t want to choose? What if I don’t know what I want or need? Elastic Net regression was created as a critique of Lasso regression. While it helps in feature selection, sometimes you don’t want to remove features aggressively. As you may have guessed, Elastic Net is a combination of both Lasso and Ridge regressions. Since we have an idea of how the Ridge and Lasso regressions act, I will not go into details. Please refer to sci-kit learn’s documentation.

![Screenshot%202021-06-03%20003105.png](attachment:Screenshot%202021-06-03%20003105.png)

As you can see in the picture above there are now two λ terms. λ₁ is the “alpha” value for the Lasso part of the regression and λ₂ is the “alpha” value for the Ridge regression equation. When using sci-kit learn’s Elastic Net regression the alpha term is a ratio of λ₁:λ₂. When setting the ratio = 0 it acts as a Ridge regression, and when the ratio = 1 it acts as a Lasso regression. Any value between 0 and 1 is a combination of Ridge and Lasso regression.

# ii) Dropout

- <span class="mark">Another most frequently used regularization technique is dropout.</span> <span class="mark">It essentially means that during the training, randomly selected neurons are turned off or ‘dropped’ out. It means that they are temporarily obstructed from influencing or activating the downward neuron in a forward pass, and none of the weights updates is applied on the backward pass.</span>

- <span class="mark">So if neurons are randomly dropped out of the network during training, the other neurons step in and make the predictions for the missing neurons</span>. <span class="mark">This results in independent internal representations being learned by the network,</span> making the network less sensitive to the specific weight of the neurons. Such a network is better generalised and has fewer chances of producing overfitting.

![Screenshot%202021-06-03%20003655.png](attachment:Screenshot%202021-06-03%20003655.png)

![Screenshot%202021-06-03%20003745.png](attachment:Screenshot%202021-06-03%20003745.png)

So each iteration has a different set of nodes and this results in a different set of outputs. It can also be thought of as an ensemble technique in machine learning.  

Ensemble models usually perform better than a single model as they capture more randomness. Similarly, dropout also performs better than a normal neural network model.   

This probability of choosing how many nodes should be dropped is the hyperparameter of the dropout function. As seen in the image above, dropout can be applied to both the hidden layers as well as the input layers.  

![Screenshot%202021-06-03%20003919.png](attachment:Screenshot%202021-06-03%20003919.png)

![Screenshot%202021-06-03%20003954.png](attachment:Screenshot%202021-06-03%20003954.png)

## Advantage and Disadvantage of dropout
**advantage**
- <span class="mark">The main advantage of this method is that it prevents all neurons in a layer from synchronously optimizing their weights.</span> <span class="mark">This adaptation, made in random groups, prevents all the neurons from converging to the same goal, thus decorrelating the weights.</span>

**disadvantage**
- <span class="mark">Right before the last layer. This is generally a bad place to apply dropout, because the network has no ability to "correct" errors induced by dropout before the classification happens</span>
- <span class="mark">When the network is small relative to the dataset, regularization is usually unnecessary.If the model capacity is already low, lowering it further by adding regularization will hurt performance</span>
- When training time is limited. It's unclear if this is the case here, but if you don't train until convergence, dropout may give worse results. Usually dropout hurts performance at the start of training, but results in the final ''converged'' error being lower. Therefore, if you don't plan to train until convergence, you may not want to use dropout.

# iii) Data Augmentation

<span class="mark">The simplest way to reduce overfitting is to increase the data, and this technique helps in doing so.</span>   

Data augmentation is a regularization technique, which is used generally when we have images as data sets. It generates additional data artificially from the existing training data by making minor changes such as rotation, flipping, cropping, or blurring a few pixels in the image, and this process generates more and more data. Through this regularization technique, the model variance is reduced, which in turn decreases the regularization error.

<span class="burk">from keras.preprocessing.image import ImageDataGenerator  
datagen = ImageDataGenerator(horizontal flip=True)  
datagen.fit(train)</span>  

## advantage and disadvantage of Data Augmentation
**advantage** 
- Improving model prediction accuracy. adding more training data into the models. preventing data scarcity for better models. ...
- <span class="mark">reducing costs of collecting and labeling data.</span>

**disadvantage** 
- <span class="mark">The main limitation of data augmentation arises from the data bias, i.e. the augmented data distribution can be quite different from the original one. This data bias leads to a suboptimal performance of existing data augmentation methods</span>

# iv)Early Stopping

Early stopping is a kind of cross-validation strategy where we keep one part of the training set as the validation set. When we see that the performance on the validation set is getting worse, we immediately stop the training on the model. This is known as early stopping.

![Screenshot%202021-06-03%20005729.png](attachment:Screenshot%202021-06-03%20005729.png)

![Screenshot%202021-06-03%20005927.png](attachment:Screenshot%202021-06-03%20005927.png)

## advantage and disadvantage of early stopping
**advantage**
- early stopping stops the training of a neural network early before it has overfit the training dataset so that it  can reduce overfitting problem and improve the generalization of deep neural networks.

**disadvantage** 
- A problem with early stopping is that the model does not make use of all available training data. It may be desirable to avoid overfitting and to train on all possible data, especially on problems where the amount of training data is very limited

# Use Weight Regularization to Reduce Overfitting of Deep Learning Models

Neural networks learn a set of weights that best map inputs to outputs. 

A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be a sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data.  

A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model. 

<span class="mark">In this post, you will discover weight regularization as an approach to reduce overfitting for neural networks.</span>  

->Large weights in a neural network are a sign of a more complex network that has overfit the training data.  
->Penalizing a network based on the size of the network weights during training can reduce overfitting.  
->An L1 or L2 vector norm penalty can be added to the optimization of the network to encourage smaller weights.

### Problem With Large Weights

When fitting a neural network model, we must learn the weights of the network (i.e. the model parameters) using stochastic gradient descent and the training dataset.  

The longer we train the network, the more specialized the weights will become to the training data, overfitting the training data. The weights will grow in size in order to handle the specifics of the examples seen in the training data.  

Large weights make the network unstable. Although the weight will be specialized to the training dataset, minor variation or statistical noise on the expected inputs will result in large differences in the output  

<span class="mark">Large weights tend to cause sharp transitions in the node functions and thus large changes in output for small changes in the inputs.</span> 

Having small weights or even zero weights for less relevant or irrelevant inputs to the network will allow the model to focus learning. This too will result in a simpler model.

### Encourage Small Weights

Larger weights result in a larger penalty, in the form of a larger loss score. The optimization algorithm will then push the model to have smaller weights, i.e. weights no larger than needed to perform well on the training dataset.  

Smaller weights are considered more regular or less specialized and as such, we refer to this penalty as weight regularization. 

When this approach of penalizing model coefficients is used in other machine learning models such as linear regression or logistic regression, it may be referred to as shrinkage, because the penalty encourages the coefficients to shrink during the optimization process.  

**Shrinkage.=** This approach involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero […] This shrinkage (also known as regularization) has the effect of reducing variance

The addition of a weight size penalty or weight regularization to a neural network has the effect of reducing generalization error and of allowing the model to pay less attention to less relevant input variables

https://machinelearningmastery.com/weight-regularization-to-reduce-overfitting-of-deep-learning-models/