### Regularization:
Regularization is one of the ways to improve our model to work on unseen data by ignoring the less important features. Regularization minimizes the validation loss and tries to improve the accuracy of the model. It avoids overfitting by adding a penalty to the model with high variance, thereby shrinking the beta coefficients to zero.
1. L2 Regularization
2. L1 Regularization
3. Elastic Net Regularization
4. Dropout Regularization

1. L2 Regularization:

It adjusts models with overfitting or underfitting by adding a penalty equivalent to the sum of the squares of the magnitudes of the coefficients. Ridge regression never reaches zero.

    $Loss = L = 1/N * Sum(Li(f(X[i],W),Y[i])) +  λ * Σ|θ^2|$
     0<λ<infinity


2. L1 Regularization:

Modifies overfitted or under-fitted models by adding a penalty equivalent to the sum of the absolute values ​​of the coefficients. Lasso regression also performs coefficient minimization, but instead of squaring the magnitudes of the coefficients, it takes the actual values ​​of the coefficients. This means that the sum of the coefficients can also be 0 because there are negative coefficients. So, Lasso regression helps to reduce the overfitting in the model as well as feature selection.

    $Loss = L = 1/N * Sum(Li(f(X[i],W),Y[i])) +  λ * Σ|θ|.// 0<lambda<infinite$

3. Elastic Net Regularization:

Elastic Net combines L1 and L2 regularization by adding a mixture of both penalty terms to the objective function. It offers a balance between the feature selection capability of L1 regularization and the coefficient shrinkage of L2 regularization.



4. Dropout Regularization:

The Dropout algorithm is a regularization technique commonly used in deep learning models, particularly in neural networks. It helps to prevent overfitting by randomly dropping out or deactivating a certain percentage of neurons or connections during the training phase. This dropout process forces the network to learn more robust and generalizable features. While training, dropout is implemented by only keeping a neuron active with some probability p
 (a hyperparameter generally .2 to .5 ), or setting it to zero otherwise.

1. During training:

    1. For each training example, during the forward pass, randomly set a fraction (dropout rate) of the neurons or connections to zero. This means these neurons or connections will be temporarily ignored and have no contribution to the subsequent layers' computations.
    2. Perform the forward pass as usual, computing the output of the model.
    During the backward pass, only update the weights of the active neurons or connections (the ones that were not dropped out).

2. During testing or inference:

    1. No dropout is applied. Instead, the full network is used to make predictions.

In [5]:
import numpy as np;
""" 
Inverted Dropout: Recommended implementation example.
We drop and scale at train time and don't do anything at test time.
"""

p = 0.5 # probability of keeping a unit active. higher = less dropout
W1= np.random.rand(2)
W2= np.random.rand(2)
W3= np.random.rand(2)
b1= np.random.rand(2)
b2= np.random.rand(2)
b3= np.random.rand(2)

x= np.random.rand(2)

def train_step(X):
  # forward pass for example 3-layer neural network
  H1 = np.maximum(0, np.dot(W1, X) + b1)
  U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p!
  H1 *= U1 # drop!
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p!
  H2 *= U2 # drop!
  out = np.dot(W3, H2) + b3
  return out
  # backward pass: compute gradients... (not shown)
  # perform parameter update... (not shown)
  
def predict(X):
  # ensembled forward pass
  H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
  H2 = np.maximum(0, np.dot(W2, H1) + b2)
  out = np.dot(W3, H2) + b3

print(train_step(x))

[3.63092928 3.67980665]
