## Overfitting
**Underfit**: the curve/hypothesis function cannot fit the training set well
- Means the algorithm has **High bias** (pre-conception)

**Overfit**: the learned hypothesis may fit the training set well, but fail to generalize to new examples 
- The generated hypothesis has **High variance**

### Solutions
- Reduce number of features
    - Manually select or use selection algorithm
    - At the cost of throwing away some useful information
- Regularization: Keep all the features, but **reduce magnitude** of parameters $\theta_j$
    - Work well when we have lots of features, each of which contributes slightly (not largely) to predicting $y$
    
### Regularization
Use **penalty (Cost)** to make parameters small: Add penalty to the cost function
- Select a few target parameters $\theta$ or apply to all parameters
    - $min_\theta J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 + \lambda \sum_{j=1}^n \theta^2_j$
    - Do not penalize the $\theta_0$
- Use Regularization Parameter $\lambda$ to control the tradeoff between the underfitting and overfitting
    - Extremely large $\lambda$ will force all $\theta$ to become 0, which results in underfitting
    
#### Regularized Linear Regression
##### Regularize the gradient descent
Repeat{

$\theta_0 := \theta_0 - \alpha*\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_0^{(i)}]$
    
$\theta_j := \theta_j - \alpha*\{\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}]+\frac{\lambda}{m}\theta_j\}$=$\theta_j(1-\alpha*\frac{\lambda}{m}\theta_j) - \alpha*\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}]$$, j \in {1, 2, ...,n}$

}

Observation: In addition to the regular reduction through gradient descent, the $\theta_j$ also **shrinks with the factor** $(1-\alpha*\frac{\lambda}{m}\theta_j)$

##### Regularize the Normal Equation
![W3-REG-LP-NE](Plots/W3-REG-LP-NE.png)
Observation: L is an identity matrix with size (n+1)*(n+1)

#### Regularized Logistic Regression
$J(\theta)=-\frac{1}{m}\sum^m_{i=1}[y^{(i)}log(h_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))]+\lambda \sum_{j=1}^n \theta^2_j$

##### Regularize the gradient descent
Repeat{

$\theta_0 := \theta_0 - \alpha*\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_0^{(i)}]$
    
$\theta_j := \theta_j - \alpha*\{\frac{1}{m}\sum_{i=1}^m[(h_{\theta}(x^{(i)})-y^{(i)})*x_j^{(i)}]+\frac{\lambda}{m}\theta_j\}, j \in {1, 2, ...,n}$

}