# Regularization 
This notebook aims to help gain a mathematical understanding of implicit and explicit regularization

## Explicit Regularization 
For explicit regularization, a regularization term is added to improve the generalization of a neural network model by penalizing certain parameter values that overfit the training data. The regularization term can be added to the loss function, which would be expressed as:

$$
\hat{\phi}
= \arg\min_{\phi}
\left[
\sum_{i=1}^{I} \ell_i[\mathbf{x}_i, \mathbf{y}_i]
+ \lambda \cdot g[\phi]
\right]$$

Where:
-  $g[\phi]$ represents the regularization term, where parameters that are less preferred by the model causes the regularization term to output a higher value

-  $\lambda$ represents a scalar that controls the effect that the regularization term ($g[\phi]$) has on the loss function

In addition, the regularization term can also be added to the maximum likelihood criterion, which would be expressed as: 

$$
\hat{\phi} = \arg\max_{\phi}
\left[
\prod_{i=1}^{I}
\Pr(y_i \mid \mathbf{x}_i, \phi)\,\Pr(\phi)
\right].
$$

Where: 
- $\Pr(\phi)$ is the regularization term, which expressess the model's assumptions on the distribution of the parameters prior to observing the data

Even though regularization terms penalize certain parameter values, the following equations don't show a way to manipulate the regularization term to penalize parameter values that overfit the trainnig data. By using L2 regularization, we are able to penalize parameters with large weight values. L2 regularization is represented as:

$$
\hat{\phi}
= \arg\min_{\phi}
\left[
\sum_{i=1}^{I} \ell_i[\mathbf{x}_i, \mathbf{y}_i]
\;+\;
\lambda \sum_{j} \phi_j^{2}
\right].
$$

Where: 
- $\;\lambda \sum_{j} \phi_j^{2}$ is the L2 norm (L2 regularization term), which represents the magnitude of the weights. Therefore higher values of weights corresponds to an increase in loss and a penalizing in that specific set of parameter values.



## Implicit Regularization 
For implicit regularization, the stochastic gradient descent step or the gradient descent step automatically favors certain parameter values over others without the manual addition of a regularization term 

A favoring in certain parameter values is shown because when a parameter is updated through either the stochastic gradient descent step or the gradient descent step, the loss function is modified. For gradient descent steps, the modified loss function is represented as: 

$$
\tilde{L}_{GD}[\phi]
= L[\phi] + \frac{\alpha}{4} \left\lVert \frac{\partial L}{\partial \phi} \right\rVert^2
$$ 

Where: 
- $\frac{\alpha}{4} \left\lVert \frac{\partial L}{\partial \phi} \right\rVert^2$ represents the regularization term.

$\frac{\alpha}{4} \left\lVert \frac{\partial L}{\partial \phi} \right\rVert^2$ is dependent on the value of the gradient ($\frac{\partial L}{\partial \phi}$). The value of the gradient is dependent on the value of the parameters, where higher parameter values results in larger gradients and vice versa. Therefore by the addition of this regularization term, it favors small parameter values over large parameter values

For stochastic gradient descent steps, the modified loss function is represented as:

$$
\tilde{L}_{SGD}[\phi]
=
\tilde{L}_{GD}[\phi]
+
\frac{\alpha}{4B}
\sum_{b=1}^{B}
\left\lVert
\frac{\partial L_b}{\partial \phi}
-
\frac{\partial L}{\partial \phi}
\right\rVert^2
$$

Where: 
- $\frac{\partial L}{\partial \phi}$ is the gradient of the loss function for the entire dataset

- $\frac{\partial L_b}{\partial \phi}$ is the gradient of the loss function for each batch of dataset

Because the regularization term ($\frac{\alpha}{4B}\sum_{b=1}^{B}\left\lVert\frac{\partial L_b}{\partial \phi}-\frac{\partial L}{\partial \phi}\right\rVert^2$) takes the difference between $\frac{\partial L}{\partial \phi}$ and $\frac{\partial L_b}{\partial \phi}$, the stochastic gradient descent step favors values of parameters that results in a small batch variance and a small difference between the batch gradient and the overall gradient.