# Regularization

**Prerequisites**

- Linear Algebra
- Linear Models
- Overfitting and Model Selection


**Outcomes**

- Understand the concept of regularization
- Be familiar with common forms of regularization

## Review: Overfitting, Model Selection, Validation Procedures

- **Goal** of supervised learning is to maximize *out of sample* performance
- We fit a model (or many models) on training data and examine *metrics* or *losses*
- If we optimize for minimal loss on training data we risk **overfitting**
- When we overfit, model performs **significantly** better on training data than out of sample data
- This hurts our goal

### Hold Out Sets

- To minimize risk of overfitting, we hold back some data from training set
- We call this data either **validation** data or **test** data, depending on how it is used
    - Either way it is called **hold-out** data
- After training, we evaluate metrics on hold out data (gives estimate of out of sample performance)

### Model Selection and Validation Procedures

- To increase chances of finding a model that meets our goal, we will train many models
- These models may vary in structure, distributional assumptions, or family 
- We use a **validation procedure** like the hold-out method or k-fold CV to select the champion model from our set of models
- Main intution behind a validation procedure is:
    1. Accurately estimate out of sample performance for all models
    2. Select model with lowest out hold-out set metric score as chamption model

### Types of Candidate Pools

-   The validation procedures begin with defining a set of candidate models
-   We call this the **candidate pool**
-   Common types of candidate pools are:
    -   Models of the same structure that consider different sets of features (here validation helps in **feature selection**)
    -   Family of models indexed by one or more hyper parameters (e.g. structure of neural network -- here validation addresses **model regularization**)
    -   A potpourri collection of models not belonging to a common family -- called a **bag of models**, for lack of a better term

## Regularization

-   **Model regularization** is an approach to *avoid overfitting*
    in models that are *over-parameterized*
-   In ML, regularization started as a heuristic approach to avoid overfitting, but later on theory was developed that justified its use
-   We will (mostly) approach regularization as a heuristic, but the theoretical side is fascinating

### Motivation

-   Suppose we have an over-parameterized linear regression model
-   In this case we know that the number of parameters $P$ is greater than or equal to the number of training samples $N$
-   We also know that overfitting is guaranteed to occur (training MSE will be 0)
-   How can we we prevent overfitting from occurring?
-   **Idea**: Prevent the model parameters from taking values that minimize (zero-out) the training MSE

... But how?

### Penalizing the Loss Function

-   Instead of minimizing the training MSE, minimize the sum of the training MSE plus a penalty (a.k.a. regularization) term

$$\min_{\theta \in \mathbb{R}^P} \bigg[\frac{1}{N} \sum_{n=1}^N \ell(y_n, f(x_n | \theta) ) + {\color{green} \mu} {\color{red}R(\theta)} \bigg], \quad \mu \ge 0$$

-   Notice the non-negative <span style="color: green">regularization parameter $\mu$</span>
-   Also the non-negative  <span style="color: red">regularization term $R(\theta)$</span>
    -   $R$ satisfies $\frac{\partial d}{\partial \theta} > 0$
    -   Typically, $R( \cdot )$ is a function of one of $\theta$'s norms, i.e. it is really $R(\| \theta \|)$. We will make this assumption
    -   Typically $R(0)=0$

-   We will refer to the whole expression as **regularized (average)
    loss**

### Summary

- We can minimize the penalized loss
$$\min_{\theta \in \mathbb{R}^P} \bigg[\frac{1}{N} \sum_{n=1}^N \ell(y_n, f(x_n | \theta)) + \mu R(\theta) \bigg], \quad \mu \ge 0$$
- This is one of the classic ways to apply regularization
- This type of regularization is called **Tikhonov regularization**

### Effect of Regularization Parameter ($\mu$)

- A quick definition:
$$\theta_{\mu}^* \triangleq \argmin_{\theta \in \mathbb{R}^P} \bigg[\frac{1}{N} \sum_{n=1}^N \ell(y_n, f(x_n | \theta)) + \mu R(\theta)\bigg], \quad \mu \ge 0$$
- Notice that, if $\mu=0$, we get the weight vector that minimizes the non-regularized training MSE, i.e. $\theta_0^* = \theta^*$

### Increasing $\mu$

-   If we scale a function by multiplying it by a positive constant, its minimizer(s) remain the same
We can divide the regularized loss by $\mu > 0$ to obtain:
$$\begin{aligned}
\argmin_{\theta \in \mathbb{R}^P} \bigg[ \frac{1}{N \mu} \sum_{n=1}^N \ell(y_n, f(x_n|\theta)) + R\big( \| \theta \| \big ) \bigg] = \theta_{\mu}^*
\end{aligned}$$
- Now consider increasing $\mu$ towards $\infty$:
$$\begin{aligned}
\theta_{\infty}^* & \triangleq \lim_{\mu \to +\infty} \argmin_{\theta \in \mathbb{R}^P} \bigg[ \frac{1}{N \mu} \sum_{n=1}^N \ell(y_n, f(x_n|\theta)) + R\big( \| \theta \| \big ) \bigg]\\
& = \arg \min_{\theta \in \mathbb{R}^P} R\big( \| \theta \| \big )\\
&= 0
\end{aligned}$$

### $\mu$ vs $\theta_{\mu}^*$


![Regularization Parameter](https://css-materials.s3.amazonaws.com/ML/regularization/regularization_parameter.png)

### Question -- what is $\mu$?

**What is $\mu$?**

1.  A model parameter
2.  A hyperparameter
3.  Other

### How to choose the best $\mu$?
-   By a validation procedure, of course!
    -   Q: What is the optimal value of $\mu$ that minimizes the
        regularized loss, and depends only on the training data?
    -   A: trick question -- we can't determine it from the training data!
-   $\mu$ is a hyperparameter
-   It is as if for each $\mu$ we have a different model, for which the optimal weights are $\theta_{\mu}^*$
-   Hence, $\mu$ indexes a family of models

### An Equivalent View

- Under some relatively mild regularity conditions, one can show that $\forall \mu, \exists r \ge 0$ such that
$$\begin{aligned}
&\min_{\theta \in \mathbb{R}^P} \bigg[\frac{1}{N} \sum_{n=1}^N \ell(y_n, f(x_n | \theta)) + \mu R(\theta)\bigg] \\
= &\min_{\theta \in \mathbb{R}^P: \| \theta \| \leq r} \frac{1}{N} \sum_{n=1}^N \ell(y_n, f(\mathbf{x}_n|\theta))
\end{aligned}$$
- The latter expression is called **Ivanov regularization**

### Ivanov vs. Tikhonov

![Ivanov Vs Tikhonov](https://css-materials.s3.amazonaws.com/ML/regularization/ivanov_vs_tikhonov.png)

## Common Choices of Regularizers $R(\cdot)$

- Squared Euclidean (L2) norm: $$R(\|\theta\|) = \|\theta\|_2^2 = \theta^T\theta$$
    - Differentiable (has gradient vector everywhere) 😊
    - Mathematically convenient(For linear regression we know the regularized solution in closed form) 😊
    - Does not promote parameter sparsity (will explain) 🙁

- $L_1$ norm: $$R(\|\theta\|) = \|\theta\|_1 = \sum_{p=1}^P | w_p |$$
    - Non-differentiable (no gradient at some points) $\Longrightarrow$ no closed-form solutions $\Longrightarrow$ necessitates more sophisticated algorithms to work with 🙁
    - Promotes **sparsity** 😊

### Parameter Sparsity


![l1 vs l2 regulariztion](https://css-materials.s3.amazonaws.com/ML/regularization/l1_vs_l2.png)

### Other Forms of Regularization

-   Some more exotic approaches (won't show why it works, though)
    -   **Data augmentation** Create new artificial training samples by randomly picking among the original ones and adding some noise to them (commonly used for image data)
    -   **Drop-out**: Randomly set some parameters to zero on each iteration during training (dropout is pretty much standard practice in modern deep learning)
-   Both of the above can be shown to be equivalent to special types of regularization