# Lecture06 Regularization

## Overfitting

**Overfitting**: If there are too many features, the learned hypothesis may fit the training set very well ($g(\theta) \approx 0$), but fail to generalize to new examples.

There are two methods for addressing overfitting:
1. Reduce the number of features:
    * manually select which features to keep.
    * use a model selection algorithm.
2. Regularization
    * keep all the features, but reduce the magnitude of parameters $\theta_j$.
    * regularization works well when we have a lot of slightly useful features.


## Regularization

The regularization modifies the cost function to regularize all of theta parameters in a single summation:

$$J(\theta) = \frac{1}{2m}\sum^m_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum^{n}_{j=1}\theta_j^2$$

where the $\lambda$ is called as **regularization parameter**. It determines how much the costs of the theta parameters are inflated.

Using the above cost function with the extra summation, it can smooth the output of the hypothesis function to reduce overfitting. But
* if lambda is chosen to be too large, it may smooth out the function too much and cause underfitting;
* if lambda is chosen to be too small, it may not reduce the overfitting.

## Regularized Linear Regression

**Gradient Descent**:

Repeat {

$\theta_0 := \theta_0 - \alpha \frac{1}{m}(h_{\theta}(x^{(i)}) - y^{(i)})x_0^{(i)}$

$\theta_j := \theta_j - \alpha \left [ \left ( \frac{1}{m}\sum^m_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j  \right ]$    $\space\space\space j \in \{1, \dots, n\}$

}

**Normal Equation**: $\theta = (X^TX  + \lambda \cdot L)X^Ty$ where $L = \begin{bmatrix} 0 &  &  &  & \\  &  1 &  &  & \\  &  &  1 &  & \\  &  &  & \ddots & \\  &  &  &  & 1 \end{bmatrix} $.

Recall that if $m < n$, then $X^TX$ is non-invertible, but if it adds $\lambda L$, it becomes invertible.

## Regularized Logistic Regression

The cost function: $J(\theta) = \frac{1}{m}\sum^m_{i=1} \left [  -y^{(i)}\log(h_{\theta}(x^{(i)})) - (1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \right ] + \frac{\lambda}{2m}\sum^n_{j=1}\theta_j^2$ where the second sum, $\sum^n_{j=1}\theta_j^2$ means to explicitly exclude the bias term, $\theta_0$.