Using Keras, let's look at the effect of regularization.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math

import keras
from keras.datasets import mnist
from keras.layers import Input, Dense, Flatten, MaxPooling2D, MaxPooling1D, Conv2D, Reshape, Dropout
from keras.models import Model, Sequential
from keras import regularizers
import numpy as np

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Load up MNIST digits.

In [2]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Normalize the images by rescaling on 0-1, using the max image value.

In [3]:
x_train = np.expand_dims(x_train / np.max(x_train), -1)
x_test = np.expand_dims(x_test / np.max(x_test), -1)

One hot encode the digit labels for the numbers 0...9.

In [4]:
train_labels = keras.utils.to_categorical(y_train, 10)
test_labels = keras.utils.to_categorical(y_test, 10)

Here is a simple convolutional network with pooling and a dense output.

In [5]:
input_shape = x_train[0].shape


model = Sequential()
model.add(Reshape(input_shape, input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7ff1d353b400>

Now -- look at the gap between the acc and val_acc -- that's the overfitting.

Now, let's try the same model but with L2 regularization. This applies a penalty function, you can think of it as a kind of additional loss. The benefit is -- less overfitting.

In [6]:
kernel_regularizer=regularizers.l2(0.01)

regularized = Sequential()
regularized.add(Reshape(input_shape, input_shape=input_shape))
regularized.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
regularized.add(Conv2D(64, (3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
regularized.add(MaxPooling2D(pool_size=(2, 2)))
regularized.add(Flatten())
regularized.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
regularized.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
regularized.add(Dense(10, activation='softmax'))


regularized.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
regularized.fit(x_train, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7ff1d04f4f60>

You can see the model is slower to learn, comparing the smaller initial accuracy and larger loss compared to the non-regularized prior run. This higher loss is simply less overfitting to the training data.

And -- look at the result -- the `val_acc` is actually higher than the `acc`, which while less accurate -- has led our model to be more generalized, which is to say -- less memorized on the training data.

This is an interestind tradeoff -- but can be overcome with more training data and more epochs of training!

Now, a different approach -- *Dropout*. This will randomly deactive individual parameters while training, but not while validating or predicting. The notion is that the network is forced to adapt to the data and generalize as it cannot rely on which parameters will be active in any given training run.

We'll add a 50% dropout after each learning layer.

In [7]:

dropped = Sequential()
dropped.add(Reshape(input_shape, input_shape=input_shape))
dropped.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.5))
dropped.add(Conv2D(64, (3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.5))
dropped.add(MaxPooling2D(pool_size=(2, 2)))
dropped.add(Flatten())
dropped.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.5))
dropped.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.5))
dropped.add(Dense(10, activation='softmax'))


dropped.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
dropped.fit(x_train, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7ff12f7ee5f8>

And -- again, a slower learning rate, but better generalization. Dropout is a bit easier to understand than L2 regularization, which is a bit *math-y*, and interacts with the loss function and learning rate. It can be difficult to reason about parameter values.

Dropout is simpler, you can vary the dropout effect by choosing different percentages, and more training loops. Here it is with 20%.

In [8]:
dropped = Sequential()
dropped.add(Reshape(input_shape, input_shape=input_shape))
dropped.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.2))
dropped.add(Conv2D(64, (3, 3), activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.2))
dropped.add(MaxPooling2D(pool_size=(2, 2)))
dropped.add(Flatten())
dropped.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.2))
dropped.add(Dense(128, activation='relu', kernel_regularizer=kernel_regularizer))
dropped.add(Dropout(0.2))
dropped.add(Dense(10, activation='softmax'))


dropped.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
dropped.fit(x_train, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fee584bd320>

That's a pretty good compromise!