# Overfitting

- A model that underfits the data is said to have a high bias. 
- A model that overfits the data is said to have a high variance.

There are some ways to fix overfitting:
- Collecting more training data
- Only use certain features
- Regularization: reduce value of parameters instead of removing them

## Regularization

Regularization typically involves adding an additional term, called a regularizer $R(\theta)$ cost function such that

$$J_{\lambda}(\theta)=J(\theta)+\lambda R(\theta), \quad \lambda \geq 0$$

The $J_{\lambda}(\theta)$ is often called the regularized loss and $\lambda$ is called the regularization parameter. The regularizer is usually chosen to be a measure of the complexity of the model. Regularized loss attempts to balance finding a model that is a good fit (small $J(\theta)$) and has a small model complexity (small $R(\theta)$). The balance is controlled by $\lambda$. 

A commonly used regularizer is $R(\theta)= \frac{1}{2}||\theta||_{2}^{2}$ (the $m$ can be added in) which is $\ell^{2}$ regularization. The notation $||x||_{p}$, also known as the $\ell^{p}$, means

$$||x||_{p} = \left(\sum_{i=1}^{n} |x_i|^{p}\right)^{\frac{1}{p}} \rightarrow ||x||_{p}^{p} = \sum_{i=1}^{n} |x_i|^{p} $$

The $\ell^{2}$ regularization in deep learning this is often referred to as weight-decay due to following:

$$\vec{\theta} := \vec{\theta} - \alpha \nabla J_{\lambda}(\theta) = \vec{\theta} - \alpha \lambda \vec{\theta} - \alpha \nabla J(\theta) = (1-\alpha \lambda) \vec{\theta} - \alpha \nabla J(\theta)$$

The $(1-\alpha \lambda)$ acts as decaying term to the weights.