# Worrying About Overfitting

A big issue is making sure we don't overfit our model

## Use Train-Validation-Test

- Think of **training** as what you study for a test
- Think of **validation** is using a practice test
- Think of **testing** as what you use to judge the model 

> ***holdout*** is when your test dataset is never used for training (unlike in cross-validation)

## Model Complexity Graph

- Underfitting
    + low complexity --> high bias, low variance
    + training error: large
    + testing error: large
- Overfitting
    + high complexity --> low bias, high variance
    + training error: low
    + testing error: large

In [None]:
import numpy as np
import matplotlib.pyplot as plt

test_error = np.array([5,3.5,2,3,4])
train_error = np.array([4.5,3,1.5,1,0.5])
n_epochs = np.array([5,50,100,200,300])

plt.scatter(n_epochs, train_error,)
plt.scatter(n_epochs, test_error)
plt.legend(['train error','test error'])
plt.xlabel('Number of Epochs')
plt.ylabel('Error')
plt.show()

### Early Stopping 

We can stop our training early when our test error stopped dropping

# When a Good Model Goes Bad

When a model has large weights, the model is "too confident"

We need to punish large (confident) weights by contributing them to the error function

## L1 Regularization - Absolute Value

- Tend to get sparse vectors (small weights go to 0)
- Reduce number of weights
- Good feature selection to pick out importance

$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}|w_i| \big]$$


## L2 Regularization - Squared Value

- Not sparse vectors (weights homogeneous & small)
- Gives better results for training
    + subtle; consider vectors: [1,0] & [0.5, 0.5] 
    + recall we want smallest value for our value
    + L2 prefers [0.5,0.5] over [1,0] 
    
$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}w_i^2 \big]$$

# Dropout

You want to even out your workouts, otherwise you may have some strange results...

<img src='images/homer-dropout-comparison.jpg'/>

Well, our neural network models are the same way. The model should get _evenly_ trained. We don't want to train the same node/pathway over and over again

## Avoiding the Self-Perpetuating Strength Training

When working out, we'd train our left and right arms evenly and switch our exercise routine throughout the week.

In neural networks, we switch around which nodes we use during our training.

Assign a probability of using a given node for that epoch (usually about 20% chance). When we have many epochs, we likely will even out the randomness

<img src='images/layered-neural-net.jpg'/>