## Forms of weight decay...

These error functions all penalize "complex" hypotheses by punishing weights $\bar{w}$ with a large norm.

There are three basic strategies:

1. Punish $|\!|\bar{w}|\!|^2$ where $|\!|\bar{w}|\!|$ is the [L2 norm](https://mathworld.wolfram.com/L2-Norm.html)
2. Punish $|\!|\bar{w}|\!|_1$ where $|\!|\bar{w}|\!|_1$ is the [L1 norm](https://mathworld.wolfram.com/L1-Norm.html)
3. Punish a weighted combination of the options from (1) and (2)

Note that

$|\!|\bar{w}|\!|^2 = \sum_{i=0}^d w_i^2$

and

$|\!|\bar{w}|\!|_1 = \sum_{i=0}^d |w_i|$

When the underlying model is linear regression, 

(1) is called Ridge regression

(2) is called LASSO

(3) is called [Elastic Net](https://en.wikipedia.org/wiki/Elastic_net_regularization)

There is a nice discussion of Ridge and LASSO here: https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b


The LASSO approach tends to perform "feature selection" because in the resulting solution many dimensions of the weight vector will be zero.  (picture from [here](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b))

![img](L1_vs_L2.png)


Many interesting and informative plots on regularization can be found here:


https://github.com/ageron/handson-ml/blob/master/04_training_linear_models.ipynb



### Formulas

#### L2

$E_{aug}(\bar{w}) = E_{in}(\bar{w}) + \lambda\sum_{i=0}^d w_i^2$

#### L1

$E_{aug}(\bar{w}) = E_{in}(\bar{w}) + \lambda\sum_{i=0}^d |w_i|$

#### Elastic Net

$E_{aug}(\bar{w}) = E_{in}(\bar{w}) + \lambda(r\sum_{i=0}^d |w_i| +(1-r)\sum_{i=0}^d w_i^2)$

where $0 \leq r \leq 1$.

