Using Keras, let's look at the effect of normalization.

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import math

import keras
from keras.datasets import mnist
from keras.layers import Input, Dense, Flatten, MaxPooling2D, MaxPooling1D, Conv2D, Reshape, BatchNormalization
from keras.models import Model, Sequential
from keras import regularizers
import numpy as np

Load up MNIST digits. Explicitly *not* normalizing the pixels to the range 0-1.

In [11]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train_not_normalized = np.expand_dims(x_train, -1)
x_test_not_normalized = np.expand_dims(x_test, -1)

One hot encode the digit labels for the numbers 0...9.

In [12]:
train_labels = keras.utils.to_categorical(y_train, 10)
test_labels = keras.utils.to_categorical(y_test, 10)

Here is a simple convolutional network with pooling and a dense output.

In [13]:
input_shape = x_train_not_normalized[0].shape


model = Sequential()
model.add(Reshape(input_shape, input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_not_normalized, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test_not_normalized, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f8d186ae438>

OK -- that didn't learn. The gradients 'exploded' -- went to very large values and wandered off. It also doesn't help that the learning rate by default is small enough that it cannot work well on parameter values outside of 0-1. This is an important point, and a bit of a cookbook recipe -- keep all your numbers in the range 0.1 to avoid trouble.

Now -- Normalize the images by rescaling on 0-1, using the max image value.

In [15]:
x_train_normalized = np.expand_dims(x_train / np.max(x_train), -1)
x_test_normalized = np.expand_dims(x_test / np.max(x_test), -1)

In [16]:
input_shape = x_train_normalized[0].shape


model = Sequential()
model.add(Reshape(input_shape, input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_normalized, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test_normalized, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f8d006c4c18>

Ahh -- back to decent results!

And now -- a slightly more advanced technique -- *batch normalization*. This will automatically normalize your data each batch, mapping the numbers down to a range of 0-1.

We'll work on the raw pixel input to illustrate the effect.

In [17]:
input_shape = x_train_not_normalized[0].shape


model = Sequential()
model.add(Reshape(input_shape, input_shape=input_shape))
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(10, activation='softmax'))


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train_not_normalized, train_labels,
          batch_size=64,
          epochs=8,
          validation_data=(x_test_not_normalized, test_labels))

Train on 60000 samples, validate on 10000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f8d18138d68>

Now that's a pretty good layer. Even though we 'forgot' to normalize out input data -- we made out network self-normalizing as it runs. 

I tend to use this technique myself, as it avoids pre-processing of data which tends to be on the CPU and much slower, while this `BatchNormalization`, when I run it on the GPU for sure adds some runtime to the model -- notice the difference in seconds -- but is a lot more forgiving of your input data.