# Too good to be true -- Combatting model overfitting

In our previous classification problems, the error on the training substantially underestimates the true error on the test set.

In fact, using sufficiently many layers and hidden units, it is not uncommon to achieve close to 100% accuracy on the training set. We call this phenomenon __overfitting__.

This is not so surprising, taking into account that for instance VGG has 140 million parameters that can be used to combat overfitting.

We discuss three common approaches to reduce overfitting: *data augmentation*, *Dropout* and *Batch Normalization*

## Data augmentation -- the art of inflating your training sample size

The fundamental problem of overfitting is that classifiers have difficulty generalizing to unseen data. Hence, a straightforward approach is to add more data.

In practice, adding more data is often expensive or even infeasible. In these cases, we can try to bootstrap new image data from what we already have. This approach is known as __data augmentation__.

Let's consider our by now familiar example of cats vs. dogs.

In [1]:
from tensorflow.contrib.keras.python.keras.preprocessing.image import ImageDataGenerator
import sys
import numpy as np
sys.path.insert(0, '../scripts')
import nn_helper
from nn_helper import show_array, show_array_list
from matplotlib import pyplot as plt
%matplotlib inline

Using TensorFlow backend.


In [2]:
#############################################
############FILE PATHS
#############################################
ROOT = '../data/processed/cats_vs_dogs'
FEATURE_PATH = '../features/cats_vs_dogs'

SEED = 42

Keras comes with an ImageDataGenerator -- a convenient tool to randomly warp given image data.

In [3]:
data_gen = ImageDataGenerator()

data_gen_aug = ImageDataGenerator(rotation_range=10, width_shift_range=0.05, zoom_range=0.05,
                                      channel_shift_range=10, height_shift_range=0.05, shear_range=0.05,
                   horizontal_flip=True)

Next, we define generators to extract original and augmented images.

In [17]:
np.random.seed(SEED)
train_gen = data_gen.flow_from_directory('{}/train'.format(ROOT), shuffle = True)

np.random.seed(SEED)
train_gen_aug = data_gen_aug.flow_from_directory('{}/train'.format(ROOT), shuffle = True)


Found 24000 images belonging to 2 classes.
Found 24000 images belonging to 2 classes.


Now, we can compare the original images with the augmented ones.

In [None]:
np.random.seed(SEED)
a = next(train_gen)

np.random.seed(SEED)
b = next(train_gen_aug)

show_array_list([a[0][1,:,:,:], b[0][1,:,:,:]])

## Regularization via Dropout

Make sure to watch https://www.youtube.com/watch?v=DleXA5ADG78

Large neural networks tend to match training data ridiculously well by creating highly elaborate interdepencies between different activation patterns.

When seeing a new image, these highly elaborate interdependencies break down and the model is lost. 

__Dropout__ prevents the development of intricate dependencies by randomly resetting outputs of groups of neurons to 0 during training.

### Example cats vs dogs

Let's see how this works for our fine-tuned cats-vs-dogs classifier. As 

In [80]:
from tensorflow.contrib.keras.python.keras.models import Sequential, Model
from tensorflow.contrib.keras.python.keras.layers import Dense
from tensorflow.contrib.keras.python.keras.optimizers import Adam
from tensorflow.contrib.keras.python.keras.layers import Dense, Dropout, Flatten

model = Sequential([
    Flatten(input_shape = (7, 7, 512)),
    
    Dense(32, activation = 'relu'),
    Dropout(0.2),
    
    Dense(32, activation = 'relu'),
    Dropout(0.2),
    
    Dense(1, activation='sigmoid')
    
])
model.compile(optimizer = Adam(lr = 1e-4), loss = 'binary_crossentropy', metrics = ['accuracy'])

We import the training data.

In [50]:
import numpy as np
FEATURE_PATH = '../features/cats_vs_dogs'



vgg_conv_features =  [np.load('{}/vgg_conv_features_{}.npy'.format(FEATURE_PATH, tv)) 
                      for tv in ['train', 'valid']]
labels = [np.load('{}/vgg_features_names_{}.npy'.format(FEATURE_PATH, tv))
       for tv in ['train', 'valid']]


In [82]:
model.fit(vgg_conv_features[0], labels[0], epochs = 5,
                        validation_data = (vgg_conv_features[1], labels[1]))

Train on 24000 samples, validate on 1000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f764e650358>

## Batch normalization

Batch normalization is based on a simple, yet universally accepted paradigm: Standardize your data!

In backpropagation trains all layers at the same time. This means, that the input for higher layers is unstable for a long time, since it comes from lower hidden layers that are themselves subject to the training process. That is, we experience an *internal coveriate shift*

The most immediate approach is to standardize the inputs before activations are computed. In essence, batch normalization does just this, but in a way that the standardization is part of backpropagatoin.

Mainly this leads to a speed up in the learning process. However, as the mean and standard deviation are batch-dependent, the predicted output of a single training example is subject to randomness. This is found to have a regularizing effect.

### Example cats vs dogs

We add batchnorm after the dense layers.

In [84]:
from tensorflow.contrib.keras.python.keras.layers import BatchNormalization

model = Sequential([
    Flatten(input_shape = (7, 7, 512)),
    
    Dense(32, activation = 'relu'),
    BatchNormalization(),
    
    Dense(32, activation = 'relu'),
    BatchNormalization(),
    
    Dense(1, activation='sigmoid')    
])
model.compile(optimizer = Adam(lr = 1e-4), loss = 'binary_crossentropy', metrics = ['accuracy'])

Now, we fit to the data.

In [85]:
model.fit(vgg_conv_features[0], labels[0], epochs = 5,
                        validation_data = (vgg_conv_features[1], labels[1]))

Train on 24000 samples, validate on 1000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f764e038a20>