<a href="https://colab.research.google.com/github/PaulToronto/Stanford-Andrew-Ng-Machine-Learning-Specialization/blob/main/1_3_4_The_problem_of_overfitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.3.4 The problem of overfitting

## 1.3.4.1 The problem of overfitting

### Regression Example

<img src='https://drive.google.com/uc?export=view&id=1g73vbNNJVusL9Rht4lG91wKFPVOcXS1a'>

- **high bias** refers to a strong preconception or bias that the housing prices are going to be a completely linear function
- **high variance** refers to the fact that a slightly different training set would likely give rise to a very different model

### Logistic Regression Example

<img src='https://drive.google.com/uc?export=view&id=1pptyAySCxsikh1qlU6ZZ_n8mW1NmZjtj'>



## 1.3.4.2 Addressing overfitting

### Ways to address overfitting

1. Collect more training examples
    - this should be tried first, when possible
2. Select features
    - remove polynomial features
    - just use a subset of the available features
        - keeping the most relevant features
        - this is called feature selection
        - use your intuition
    - **disadvantage** to this is that you are losing information
    - in the next course we learn more about how to choose the best features
3. Regularization
    - reduce size of parameters
    -  setting a parameter to 0 is equivalent to eliminating a feature
    - regularization, which encourages the algorithm to shrink the values of parameters, is gentler alternative to eliminating a feature
    - by convention, regularization works on $w_j$ regularizing $b$ doesn't usually make a big difference

<img src='https://drive.google.com/uc?export=view&id=1aSoXIGqF6rGwwrTxYjZpD80vh_r_NxIY'>

## 1.3.4.3 Lab - Overfitting

https://colab.research.google.com/drive/1WuDJ8wVA3InRJf76sQQmnOqjT0MCsPY4

## 1.3.4.4 Cost function with regularization

<img src='https://drive.google.com/uc?export=view&id=1_jCe8DF0G749UyuuZyQacSEn0NtBl2RO'>

- 1000 is chosen here just because it is a big number
- with this modified cost function, the model is penalized if $w_3$ and $w_4$ are large
    - because if you want to minimize this function, the only way to do is is if $w_3$ and $w_4$ are small
    - when this function is minimized you will end up with $w_3$ and $w_4$ close to 0
    - the result is a fit to the data that is much closer to the quadratic function, including a small contribution from features $x^3$ and $x^4$

 <img src='https://drive.google.com/uc?export=view&id=1mk7E3PWazGDlGnhbIirPFXAVPDkwtAe9'>

- having small values for the parameters: $w_1, w_2, \cdots, w_n, b$ is a bit like having a simpler model
- in the last slide we only regularized (penalized) $w_3$ and $w_4$, but in practice you may not know which are the most important features, so you won't know which features to regularize, so typically, you'd regularize all the $w_j$ parameters
- it is possible to show that this results in fitting a smoother, simpler, less wiggly function that is less prone to overfitting
- $\lambda$ is called the **regularization parameter**
 - $\lambda \gt 0$
- like the first part of the cost function, the regularization part is scalled by $\frac{1}{2m}$
 - by matching the scaling it becomes easier to choose a good value for $\lambda$
 - makes it more likely that the chosen value for $\lambda$ will continue to work even after adding additional training examples
- by convention, $b$ is not regularized


 <img src='https://drive.google.com/uc?export=view&id=1df43iYp_6NvatUqfe-IR0CeUNQrxiBzp'>

- the choice of $\lambda$ is how you balance between the two goals
 1. fitting the data
 2. keeping $w_j$ small
- when $\lambda = 0$ you are not using the regularization term at all
- when $\lambda = 10^{10}$ you are placing a very heavy weight on the regularization term and in that case, the only way to minimize the cost function is to have $w_j$ values very clost to 0
 - This results a horizontal line: $f_{\vec{w},b}(\vec{x}) = b$

## 1.3.4.5 Regularized linear regression

### Cost Function for Regularized Linear Regression

$$
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=1}^{n} w_j^2
$$

where:

$$
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b
$$

### Gradient Descent

The goal is to find parameters $w$ and $b$ to minimize the cost function.

$$
\text{minimize}_{w,b}\;J(w,b)
$$


repeat {
$$
\begin{align}
w_j &= w_j - \alpha \cdot \frac{\partial}{\partial w_j} J(\vec{w},b) \\
b &= b - \alpha \cdot \frac{\partial}{\partial b} J(\vec{w},b)
\end{align}
$$
}

### The Gradients

It is not necessary to regularize $b$.

$$
\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 1}^{} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})
\end{align*}
$$

### Another way to consider the simultaneous updates of $w_j$ in gradient descent

$$
\begin{align}
w_j &= w_j - \alpha\left(\frac{1}{m}\sum_{i=1}^{m}\left(f_{\vec{w},b}\left(\vec{x}^{(i)}\right) - y^{(i)}\right)x_j^{(i)} + \frac{\lambda}{m}w_j\right) \\
&= 1w_j - \alpha\frac{\lambda}{m}w_j - \alpha\frac{1}{m}\sum_{i=1}^{m}\left(f_{\vec{w},b}\left(\vec{x}^{(i)}\right) - y^{(i)}\right)x_j^{(i)} \\
&= w_j\left(1 - \alpha\frac{\lambda}{m}\right) - \alpha\frac{1}{m}\sum_{i=1}^{m}\left(f_{\vec{w},b}\left(\vec{x}^{(i)}\right) - y^{(i)}\right)x_j^{(i)}
\end{align}
$$

- the second term is the term we've seen before
- the first term used to be $w_j$ but is now the following:

$$
w_j\left(1 - \alpha\frac{\lambda}{m}\right)
$$

- $\alpha$ is a very small positive number, say 0.01
- $\lambda$ is usually a small positive number, say 1 or 10
- $m$ is a positive number, say 50

In [1]:
alpha = 0.01
lmbda = 1
m = 50

1 - alpha * (lmbda/m)

0.9998

- essentially, we are multiply $w_j$ by 0.9998, a number slightly less than 1
- this makes $w_j$ smaller

### The Derivative Term

$$
\begin{align}
\frac{\partial}{\partial w_j}J\left(\vec{w},b\right) &= \frac{\partial}{\partial w_j}\left[\frac{1}{2m} \sum_{i = 1}^{m} \left(f_{\vec{w},b}(\mathbf{x}^{(i)}) - y^{(i)}\right)^2  + \frac{\lambda}{2m}  \sum_{j=1}^{n} w_j^2\right] \\
&= \frac{\partial}{\partial w_j}\left[\frac{1}{2m} \sum_{i = 1}^{m} \left(\vec{w} \cdot \vec{x} + b - y^{(i)}\right)^2  + \frac{\lambda}{2m}  \sum_{j=1}^{n} w_j^2\right] \\
&= \frac{1}{2m} \sum_{i-1}^{m}\left[\left(\vec{w}\cdot\vec{x} + b - y^{(i)}\right)2x_j^{(i)}\right] + \frac{\lambda}{2m}2 w_j \\
&= \frac{1}{m}\sum_{i=1}^{m}\left[\left(\vec{w}\cdot\vec{x} + b - y^{(i)}\right)x_j^{(i)}\right] + \frac{\lambda}{m}w_j \\
&= \frac{1}{m}\sum_{i=1}^{m}\left[\left(f\left(\vec{x}^{(i)}\right) - y^{(i)}\right)x_j^{(i)}\right] + \frac{\lambda}{m}w_j
\end{align}
$$

## 1.3.4.6 Regularized logistic regression

<img src='https://drive.google.com/uc?export=view&id=1J2fFjqSF79ZIagEWRTHHUxqRxeh0qxTF'>

- In general, when training a logistic regression model with a lot of features (polynomial or not), there is a high risk of overfitting

<img src='https://drive.google.com/uc?export=view&id=1sCTDSGzSuR6aAZVwt3Q7rC8x6Kwek2hF'>

- the gradient for regularized logistic regression looks exactly the same as that for regularized regression
    - the only difference is that $f$ is no longer a linear function
    - it is the logistic functinm applied to $z$
- as with regularized linear regression, the convention is to only regularize the $w_j$ parameters, not $b$.

## 1.3.4.7 Lab - Regularization

https://colab.research.google.com/drive/1NGGCi42u6B0uyIj7mrcSg8ygIWpuD2rB