Let's see the effect of different initialization on gradients. We'll use Keras.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math

In [10]:
import keras
from keras.datasets import mnist
from keras.layers import Input, Dense, Flatten, MaxPooling2D, MaxPooling1D, Conv2D, Reshape
from keras.models import Model, Sequential
import numpy as np

Load up MNIST digits.

In [11]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Normalize the images by rescaling on 0-1, using the max image value.

In [12]:
x_train = np.expand_dims(x_train / np.max(x_train), -1)
x_test = np.expand_dims(x_test / np.max(x_test), -1)

One hot encode the digit labels for the numbers 0...9.

In [13]:
train_labels = keras.utils.to_categorical(y_train, 10)
test_labels = keras.utils.to_categorical(y_test, 10)


Here is a simple convolutional network with pooling and a dense output.

In [18]:
input_shape = x_train[0].shape

def build_network(initializer):
    model = Sequential()
    model.add(Reshape(input_shape, input_shape=input_shape))
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_initializer = initializer))
    model.add(Conv2D(64, (3, 3), activation='relu', kernel_initializer = initializer))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer = initializer))
    model.add(Dense(128, activation='relu', kernel_initializer = initializer))
    model.add(Dense(10, activation='softmax', kernel_initializer = initializer))
    return model

Now, we've seen this kind of model work before, but let's change just one thing and initialize the learning with zeros.

Keras has named initializers, so all we do is pass in the string `'zeros'`.

In [19]:
model = build_network('zeros')
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, train_labels,
                    batch_size=64,
                    epochs=8,
                    validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


Same network -- but we didn't learn anything. By setting all the initial values to zero, we've run into the *vanishing gradients* problem. Without diving into the math, the  simple way to think about it is that a parameter at 0 -- the gradient is 0 -- and the model doesn't understand if it should increase or decrease the parameter in order to learn. So it just gets stuck!

Let's try the same thing -- but now initialize to all ones.

In [20]:
model = build_network('ones')
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, train_labels,
                    batch_size=64,
                    epochs=8,
                    validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


Again -- the network fails to learn. This is the *exploding gradients* problem, where the gradients at any parameter -- in this case all parameters -- are positive, the model can only increase the value of parameters by the learning rate. In this case the model isn't so much stuck, as it is wandeirng off in the wilderness.

Now we'll use a proper initialization schema -- Glorot Uniform -- which draws a random sample of small floating point numbers centered around zero. This will give us a mix of positive and negative parameters, with positive and negative gradients, allowing the model to learn.

In [21]:
model = build_network('glorot_uniform')
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, train_labels,
                    batch_size=64,
                    epochs=8,
                    validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


And all we did was change the initialization! This is an important point to keep in mind, when we are doing stochastic methods, we need random initialization that has the ability to generate small numbers and not get 'stuck' at zero, or exploding off to very large values.

One final initialization method to consider He Normal, is conceptually similar to Glorot Uniform, but generates slightly larger numeric values.

In [22]:
model = build_network('he_uniform')
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(x_train, train_labels,
                    batch_size=64,
                    epochs=8,
                    validation_data=(x_test, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


Roughly the same -- still learning!