<!-- TITLE: Underfitting and Overfitting -->

- [x] Early Stopping
- [x] Adding Capacity
- [ ] Illustration: Learning curves
- [ ] Animation: Underfitting
- [ ] Animation: Fitting with more capacity
- [ ] Example discussion
- [ ] Conclusion

# Introduction #

Recall from the example in the previous lesson that Keras will keep a history of the training and validation loss over the epochs that it is training the model. In this lesson, we're going to learn how to interpret these learning curves and how we can use them to guide model development. In particular, we'll examine at the learning curves for evidence of *underfitting* and *overfitting* and look at a couple of strategies for correcting it.

# Interpreting the Learning Curves #

You might remember graphs like these from Intro to ML when you were choosing hyperperameters for a decision tree. The learning curves play an especially important role in deep learning, so let's take a moment to review.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.gif" width="1200" alt="A graph of training and validation loss.">
<figcaption style="textalign: center; font-style: italic"><center>Learning curves. Underfitting. Overfitting a little. Overfitting a lot.
</center></figcaption>
</figure>

The first problem your model can have is underfitting the training data. **Underfitting** just means the network hasn't learned all it could have learned -- there's still good information in the training data the network didn't detect. The cause of underfitting is most often a model that's not flexible enough, that doesn't have enough *capacity*, in other words.

The second problem your model can have is overfitting, which is when it leans spurious patterns from the training data that don't generalize. The gap between the curves for training loss and validation loss gives you an estimate of the prediction error the model has created by learning these spurious patterns -- that is, the gap gives you evidence of **overfitting**.

Overfitting, in small amounts, isn't necessarily bad. As long as the validation loss keeps going down, you can be confident that the model is still learning "good" information, even if it happens to pick up some of the bad as well. Your best performing model will often have a bit of a gap remaining.

It can happen though that, in its search to drive down the loss, the network will start *unlearning* the useful true patterns in preference for the false spurious patterns. It starts throwing out the good to make way for the bad. When this happens you need to take action.

Let's look at a couple ways we can reach the kind of learning curves we want: *adding capacity* to fix underfitting and *early stopping* to fix overfitting.

# Adding Capacity #

A network's **capacity** refers to the size and complexity of the patterns it is able to learn. As a rule of thumb, more neural units in a network means more capacity. Underfitting occurs when a network lacks the capacity to learn all the useful information from the training data. The cure for underfitting, then, is to add more capacity.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="400" alt=" ">
<figcaption style="textalign: center; font-style: italic"><center>This model underfits the training data.
</center></figcaption>
</figure>

You can increase the capacity of a network either by making it *wider* (more units to existing layers) or by making it *deeper* (adding more layers). Wider networks have an easier time learning linear relationships, while deeper networks prefer nonlinear ones. We'll explore adding capacity to a network in the exercises.

# Early Stopping #

We mentioned that when a model begins overfitting on the training set, it is in danger of "forgetting" the useful information it has already learned, causing the validation loss to start increasing. To prevent this, we can simply stop the training whenever it seems the validation loss isn't decreasing anymore. We know then that the network has learned everything useful that it can from the training set. Interrupting the training this way is called **early stopping**.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="400" alt=" ">
<figcaption style="textalign: center; font-style: italic"><center>Stop the training before the validation loss begins to rise.
</center></figcaption>
</figure>

One of the advantages of early stopping is that it gives you some leeway in increasing capacity. It won't matter so much if the model is a bit too big, since you can stop the training before anything bad happens. Early stopping, however, isn't the last word in overfitting. As we'll see, it can still be worthwhile to "close the gap." A large gap between training and validation loss can mean there is still useful information in the training set that the model hasn't learned. We'll explore this in the exercises and in the next lesson.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/.png" width="400" alt=" ">
<figcaption style="textalign: center; font-style: italic"><center>We stop the training when the curve achieves its best fit to the validation data. <strong>Left: </strong>Without early stopping. <strong>Right: </strong>With early stopping.
</center></figcaption>
</figure>

## Adding Early Stopping ##

In Keras, we include early stopping in our training through a callback. A **callback** is just a function you want run every so ofter while the network trains. The early stopping callback will run after every epoch. (Keras has [a variety of useful callbacks](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks) pre-defined, but you can [define your own](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback), too.)

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)

These parameters say: "If there hasn't been at least an improvement of 0.01 in the validation loss over the previous 5 epochs, then stop the training and keep the best model you found." As we'll see in our example, we'll pass this callback to the `fit` method along with the loss and optimizer.

# Example - Train a Model with Early Stopping #

*Red Wine* dataset again.

In [None]:
#$HIDE_INPUT$
import pandas as pd
from IPython.display import display

red_wine = pd.read_csv('../input/dl-course-data/dl-course-data/red-wine.csv')

# Create training and validation splits
df_train = red_wine.sample(frac=0.7, random_state=0)
df_valid = red_wine.drop(df_train.index)
display(df_train.head(4))

# Scale to [0, 1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
df_valid = (df_valid - min_) / (max_ - min_)

# Split features and target
X_train = df_train.drop('quality', axis=1)
X_valid = df_valid.drop('quality', axis=1)
y_train = df_train['quality']
y_valid = df_valid['quality']

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dense(1024, activation='relu'),
    layers.Dense(1024, activation='relu'),
    layers.Dense(1),
])
model.compile(
    optimizer='adam',
    loss='mae',
)

Add the callback as an argument in `fit` (you can have several, so put it in a list). Choose a large number of epochs when using early stopping, more than you'll need.

In [None]:
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping],
    verbose=0,  # turn off training log
)

In [None]:
import pandas as pd
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();

# Conclusion #