A learned model has to perform well not only on training data, but also on unseen data. To ensure that the model did not overfit and generalize well, regularization is a prominently used method. To put it bluntly, regularization is any additional parameters or noise added to the learning model so as to make it more robust and general. 

Typically, regularization is a penalty added to the parameter space. Adding an extra term to the loss function puts an additional constraint on the optimization objective. Regularization is generally not applied to the bias term.

$$ \widetilde J(\theta; X,y) = J(\theta; X,y) + \alpha * W(\theta) $$

In deep learning, is is quite common to use different $ \alpha $ to different layers.

### 7.1. $ L^2$ (Ridge) Regularization 

* The cost function for Ridge:

$$ \widetilde J(W; X,y) = J(W; X,y) + 0.5 * \alpha * W^T W $$

$$ \nabla_W  \widetilde J(\theta; X,y) = \alpha * W + \nabla_W  J(W; X,y) $$

* A single step of weight updates would need

$$ W \leftarrow W - \epsilon  ( \alpha * W + \nabla_W  J(W; X,y) ) $$

$$ W \leftarrow (1- \epsilon \alpha) W + \epsilon \nabla_W  J(W; X,y) $$


* This shrinks the weights by an extra constant factor. To get a intuition for what this extra tem is doing, Ian Goodfellow follows an elaborate path considering quadratic approximation of the loss function, and allows us to look at how eigenvalues of Hessian matrix effect the parameters W. I think this approximation is a beautiful thing to look at, so we will have to dig a little bit more into the math.

* ** Note-1: ** Jacobian and Hessian matrices are generalization of differential calculus to multivariate functions.

 Jacobian $ \rightarrow $ All the first order derivatives of a function whose input and output are both vectors or      simply the gradient

 Hessian $ \rightarrow $  All the second order derivatives of a function whose input and output are both vectors or simply rate of change of gradient

* ** Note-2: ** Quadratic approximation of a function with Hessian matrix

 This is kind of like Taylor expansion for multivariate functions. Quadratic approximation can be deduced as-
 
 where $H_f(x_0)$ is the Hessian at point $ x_0 $ 

$$ Q_f(x,y) = f(x_0) + \nabla f(x_0)(x-x_0) + 0.5 * \nabla (x-x_0)^T H_f(x_0) (x-x_0)$$

* Considering the quadratic approximation to cost function gives

 where $ W^* $ is the point at which the gradient vanishes

$$ \widetilde J(W) = J(W^*) + 0.5 * (W-W^*)^T H (W-W^*) $$

* To mimimize $ J(W) \rightarrow H(W-W^*) = 0 $

* When regularization is added and quadratic approximation is used

$$ \alpha \widetilde W + H(\widetilde W - W^*) = 0 $$

$$ \widetilde W = (H+ \alpha I)^{-1} H W^* $$

* So regularization actually changes the behavior of parameter space. 

* When $\alpha \rightarrow$ 0 the weight matrix approaches local minima. When $\alpha$ increases, using SVD of Hessian, where D is the diagonal matrix, and Q are the eigenvectors,

$$ \widetilde W = (QDQ^T + \alpha I)^{-1} QDQ^T W^* $$ 

* Raising alpha with rescale the parameters along axis defined by eigenvectors.

* ** Keypoint: ** Only the directions in which the parameters have contribution towards the minimization of objective function are preserved. In the other directions, eigenvalue of Hessian is not really significany and they are decayed. This sort of regularizes the unimportant variables in the data.

### Example of Regularization on Linear Regression:

* Cost function of linear regression without regularization.

$$ J(W) = {(XW- Y)}^T(XW- Y) $$

* With regularization 

$$ \widetilde J(W) = {(XW- Y)}^T(XW- Y) + 0.5 * \alpha * W^T W $$

$$  W = {(X^TX + \alpha I)}^{-1} X^TY $$

* Here $ X^TX $ is the covariance matrix, so an additional $\alpha$ is added to it.
* This makes the learning algorithm perceive X as having higher variance in one direction rather than the other.

## $L^1$ Regularization

* $ L^1 $ regularization is just the sum of absolute values of parameters.

  $$ \lambda(\theta) = | W |$$

* The derivate of loss with respect to the weights becomes 

  $$  \nabla_W \widetilde J(W;X,Y) = \alpha |W| + \nabla_W J(W;X,Y) $$
  

* When you use this loss function in the optimization objective, it tends to make things either go to zero or it makes a shift in the parameters. The book goes into a little depth about this, I could not properly understand it.

* This forces the optimizer to choose sparse solutions, hence $L^1$ is used as a feature selection method.

* ** Some trivia: ** $L^2$ regularization can be seen as Bayesian inference with a Gaussian prior and $L^1$ also has a similar intrepretation of prior. 


## 7.2. Norm Penalties as Constained Optimization 

Should have read the fourth chapter first, my non-linear style reading made sure that I can't understand a thing out of this section. TO DO later!

## 7.3. Regularization and under-constrained problems