<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Worrying-About-Overfitting" data-toc-modified-id="Worrying-About-Overfitting-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Worrying About Overfitting</a></span><ul class="toc-item"><li><span><a href="#Use-Train-Validation-Test" data-toc-modified-id="Use-Train-Validation-Test-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Use Train-Validation-Test</a></span></li><li><span><a href="#Model-Complexity-Graph" data-toc-modified-id="Model-Complexity-Graph-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model Complexity Graph</a></span><ul class="toc-item"><li><span><a href="#Early-Stopping" data-toc-modified-id="Early-Stopping-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Early Stopping</a></span></li></ul></li></ul></li><li><span><a href="#When-a-Good-Model-Goes-Bad" data-toc-modified-id="When-a-Good-Model-Goes-Bad-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>When a Good Model Goes Bad</a></span><ul class="toc-item"><li><span><a href="#L1-Regularization---Absolute-Value" data-toc-modified-id="L1-Regularization---Absolute-Value-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>L1 Regularization - Absolute Value</a></span></li><li><span><a href="#L2-Regularization---Squared-Value" data-toc-modified-id="L2-Regularization---Squared-Value-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>L2 Regularization - Squared Value</a></span></li><li><span><a href="#Comparing-L1-&amp;-L2-Regularization" data-toc-modified-id="Comparing-L1-&amp;-L2-Regularization-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Comparing L1 &amp; L2 Regularization</a></span></li><li><span><a href="#Code-Implementation" data-toc-modified-id="Code-Implementation-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Code Implementation</a></span><ul class="toc-item"><li><span><a href="#Overcomplicated-Model" data-toc-modified-id="Overcomplicated-Model-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>Overcomplicated Model</a></span></li><li><span><a href="#Regulated-Model" data-toc-modified-id="Regulated-Model-2.4.2"><span class="toc-item-num">2.4.2&nbsp;&nbsp;</span>Regulated Model</a></span></li></ul></li></ul></li><li><span><a href="#Dropout" data-toc-modified-id="Dropout-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dropout</a></span><ul class="toc-item"><li><span><a href="#Avoiding-the-Self-Perpetuating-Strength-Training" data-toc-modified-id="Avoiding-the-Self-Perpetuating-Strength-Training-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Avoiding the Self-Perpetuating Strength Training</a></span></li><li><span><a href="#Example-Code" data-toc-modified-id="Example-Code-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Example Code</a></span></li></ul></li></ul></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

import keras
from keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential


# Worrying About Overfitting

A big issue is making sure we don't overfit our model

## Use Train-Validation-Test

- Think of **training** as what you study for a test
- Think of **validation** is using a practice test (note sometimes called **dev**)
- Think of **testing** as what you use to judge the model 

> ***holdout*** is when your test dataset is never used for training (unlike in cross-validation)

> The **validation** & **test** sets should come from the same distribution.
>
> _Why would this matter?_

## Model Complexity Graph

- Underfitting
    + low complexity --> high bias, low variance
    + training error: large
    + testing error: large
- Overfitting
    + high complexity --> low bias, high variance
    + training error: low
    + testing error: large

In [None]:
validation_error = np.array([5,3.5,2,3,4])
train_error = np.array([4.5,3,1.5,1,0.5])
n_epochs = np.array([5,50,100,200,300])

plt.scatter(n_epochs, train_error,)
plt.scatter(n_epochs, validation_error)
plt.legend(['train error','validation error'])
plt.xlabel('Number of Epochs')
plt.ylabel('Error')
plt.show()

### Early Stopping 

Let's first create a model we can play around with:

In [None]:
# Get data to train with
digits = load_digits()
X = digits.data
y = digits.target # Note targets are simply 0-9 associated with class

# Convert target to one-hot encoded vector
y = keras.utils.to_categorical(y)
y.shape

> **NOTE**:
>
> We could have kept the targets as integers instead of using `to_categorical()` to make
> one-hot encoded vectors. In that case we would use [`SparseCategoricalCrossentropy`](https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class)
>
> For more on Keras' different built-in losses, see the documentation: https://keras.io/api/losses/

In [None]:
X_train, X_test, y_train, y_test =\
    train_test_split(X, y, random_state=27, test_size=0.2)

X_train, X_valid, y_train, y_valid =\
    train_test_split(X_train, y_train, random_state=27, test_size=0.2)    

In [None]:
X_train

In [None]:
y_train

In [None]:
model = Sequential()
model.add(Dense(12, activation='relu', input_dim=64))
model.add(Dense(10, activation='sigmoid'))

# Note we use 'categorical_crossentropy' since target is one-hot encoded
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

We train our model but only keep the best model it comes across. We can do this with a [ModelCheckpoint callback](https://keras.io/callbacks/#modelcheckpoint)

In [None]:
checkpoint = keras.callbacks.ModelCheckpoint("best_model.h5",
                                             save_best_only=True
)

history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint]
)

In [None]:
history.history.keys()

In [None]:
metrics = ['loss','val_loss']
for metric in metrics:
    plt.plot(history.history[metric], label=metric)

plt.legend()
plt.tight_layout()

In [None]:
# Now points to the best model found during the fit
model = keras.models.load_model("best_model.h5")

We can also stop our training early when our test error isn't really changing. We can do this with a [EarlyStopping callback](https://keras.io/callbacks/#earlystopping)

In [None]:
# Recreating/resetting the model
model = Sequential()
model.add(Dense(12, activation='relu', input_dim=64))
model.add(Dense(10, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])


checkpoint = keras.callbacks.EarlyStopping(
                                monitor='val_loss', # What to watch
                                min_delta=0.1, # How much change to get
                                patience=5 # No change after 5 epochs
)

history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid),
                    callbacks=[checkpoint]
)

In [None]:
metrics = ['loss','val_loss']
for metric in metrics:
    plt.plot(history.history[metric], label=metric)

plt.legend()
plt.tight_layout()

# When a Good Model Goes Bad

When a model has large weights, the model is "too confident"

We need to punish large (confident) weights by contributing them to the error function

![](images/punishing_model_metaphor.jpg)

## L1 Regularization - Absolute Value

- Tend to get sparse vectors (small weights go to 0)
- Reduce number of weights
- Good feature selection to pick out importance

$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}|w_i| \big]$$

## L2 Regularization - Squared Value

- Not sparse vectors (weights homogeneous & small)
- Tends to give better results for training

    
$$ J(W,b) = -\dfrac{1}{m} \sum^m_{i=1}\big[\mathcal{L}(\hat y_i, y_i)+ \dfrac{\lambda}{m}w_i^2 \big]$$

## Comparing L1 & L2 Regularization

> Typically you'll want to use L2 regularization 

+ subtle; consider vectors: [1,0] & [0.5, 0.5] 
+ recall we want smallest value for our value
+ L2 prefers [0.5,0.5] over [1,0] 

## Code Implementation

### Overcomplicated Model

In [None]:
def build_complex_model():
    model = Sequential()
    model.add(Dense(32, activation='relu', input_dim=64))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(10, activation='sigmoid'))
    
    return model

In [None]:
model = build_complex_model()

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid)
)

In [None]:
metrics = ['loss','val_loss']
for metric in metrics:
    plt.plot(history.history[metric], label=metric)

plt.legend()
plt.tight_layout()

We can see our overcomplicated model could use some regularization

### Regulated Model

In [None]:
def build_regulated_model():
    model = Sequential()
    model.add(
        Dense(
            32, 
            activation='relu',
            kernel_regularizer=keras.regularizers.l2(l2=0.01),
            input_dim=64)
    )
    model.add(
        Dense(
            24, 
            activation='relu',
            kernel_regularizer=keras.regularizers.l2(l2=0.01)
        )
    )
    model.add(
        Dense(
            24, 
            activation='relu',
            kernel_regularizer=keras.regularizers.l2(l2=0.01)
        )
    )
    model.add(Dense(10, activation='sigmoid'))
    
    return model

In [None]:
model = build_regulated_model()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid)
)

In [None]:
metrics = ['loss','val_loss']
for metric in metrics:
    plt.plot(history.history[metric], label=metric)

plt.legend()
plt.tight_layout()

# Dropout

You want to even out your workouts, otherwise you may have some strange results...

<img src='images/homer-dropout-comparison.jpg'/>

Well, our neural network models are the same way. The model should get _evenly_ trained. We don't want to train the same node/pathway over and over again

## Avoiding the Self-Perpetuating Strength Training

When working out, we'd train our left and right arms evenly and switch our exercise routine throughout the week.

In neural networks, we switch around which nodes we use during our training.

Assign a probability of using a given node for that epoch (usually about 20% chance). When we have many epochs, we likely will even out the randomness

<img src='images/layered-neural-net.jpg'/>

## Example Code

In [None]:
n_classes = 10

model = Sequential()

# Input Layer
model.add(Dense(32, input_dim=64, activation='relu', name='input_layer'))
model.add(Dropout(0.2, name='input_dropout'))
# Hidden Layer
model.add(Dense(24, activation='relu', name='hidden_layer1'))
model.add(Dropout(0.2, name='hidden_layer1_dropout'))
# Hidden Layer
model.add(Dense(24, activation='relu', name='hidden_layer2'))
model.add(Dropout(0.2, name='hidden_layer2_dropout'))
# Output Layer
model.add(Dense(n_classes, activation='softmax', name='output'))

model.summary()

In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history = model.fit(X_train, y_train, epochs=100,
                    validation_data=(X_valid, y_valid)
)

In [None]:
metrics = ['loss','val_loss']
for metric in metrics:
    plt.plot(history.history[metric], label=metric)

plt.legend()
plt.tight_layout()