Simple CNN for notMNIST
=============

In [1]:
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle

This is a basic four-layer convolutional neural network originally taken from Udacity's deep learning course's *Assignment 4*.  I heavily reorganized it, added some extra machine learning tricks, updated it for TensorFlow 1.0, and replaced some pieces of it with bits of Google's *cifar10.py* code.

In version 4.4 I'll add more layers and turn it into LeNet-5.

In version 4.5 I'll add some more fancy tricks like (as the assignment suggests) dropout and learning rate decay.

## Read the data from a pickle I made earlier

In [2]:
pickle_file = 'notMNIST.pickle'

with open(pickle_file, 'rb') as f:
    save = pickle.load(f)
    train_dataset = save['train_dataset']
    train_labels = save['train_labels']
    valdn_dataset = save['valid_dataset']
    valdn_labels = save['valid_labels']
    test_dataset = save['test_dataset']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
    print('Training set', train_dataset.shape, train_labels.shape)
    print('Validation set', valdn_dataset.shape, valdn_labels.shape)
    print('Test set', test_dataset.shape, test_labels.shape)

Training set (200000, 28, 28) (200000,)
Validation set (10000, 28, 28) (10000,)
Test set (18724, 28, 28) (18724,)


## Preprocessing

### Preprocessing: Normalization
Apparently, these networks learn better if the input data (the pixel values, in this context) have a mean of 0 and a variance of 1.  So, we want to adjust the pixel values a bit before we dump them into the machine (currently they're integers ranging from 0 to 255 representing greyscale values).

Let $K = N \times H \times W$, where $N$ is the number of images, $H$ is the height of each image, and $W$ is the width.  Let $x_1, x_2, \ldots, x_K$ be the big list of all the pixel values from all the training images.

Getting a mean of 0 is easy.  Just compute the mean $\mu$ of the pixels in the training data and subtract that from each of them.  We'll also need treat that $\mu$ as an estimator of the mean of our other datasets and subtract it from them too.

But how do we get varianace 1?  Well, after that first step, our new pixel values will be $x_i' = (x_i-\mu)$.  We can adjust the variance of those pixels by scaling them.  So, we're looking for a scalar $c$ that will make those values $c(x_i-\mu)$ have a variance of 1.  The variance of these adjusted pixels is given by this expression (where $\sigma^2$ is the variance of the original pixels):

$$\begin{eqnarray}
\sum_1^N \left(c \left(x_i - \mu\right)\right)^2
&=& c^2 \sum_1^N \left(x_i - \mu\right)^2\\
&=& c^2 \sigma^2\\
\end{eqnarray}$$

So, there you go.  Our scalar $c = \frac 1 \sigma$.

In [3]:
# Find mean of training data
mu = np.mean(train_dataset)

# Use mu as an estimator of the mean of all datasets and subtract it
train_dataset = train_dataset - mu
valdn_dataset = valdn_dataset - mu
test_dataset = test_dataset - mu

# Find standard deviation of training data
sigma = np.std(train_dataset)

# Divide by sigma to get training variance of 1 (and adjust other datasets)
train_dataset = train_dataset / sigma
valdn_dataset = valdn_dataset / sigma
test_dataset = test_dataset / sigma

#### Sanity Check

In [19]:
print("%f <-- should be 0" % np.mean(train_dataset))
print("%f <-- should be close to 0" % np.mean(valdn_dataset)) 
print("%f <-- should be close to 0" % np.mean(test_dataset)) 
print("%f <-- should be 1" % np.var(train_dataset))
print("%f <-- should be close to 1" % np.var(valdn_dataset)) 
print("%f <-- should be close to 1" % np.var(test_dataset)) 

0.000000 <-- should be 0
0.005162 <-- should be close to 0
0.015914 <-- should be close to 0
1.000001 <-- should be 1
1.002455 <-- should be close to 1
1.019278 <-- should be close to 1


### Preprocessing: Reformat data into a TensorFlow-friendly shapes
- The design matrix is currently N x H x W, since they're greyscale images (no channel dimension like you'd get with colour images).  tf.nn.conv2d needs them formatted as N x H x W x #channels, however.  So we need to add a dimension at the end of the shape vector for the input data.
- Labels need to be one-hot encoded floats.

In [5]:
def one_hot_encode(labels, num_categories):
    '''Converts an array of labels to an array of one-hot-encoded labels.
    [NB: Just noticed there's now a "tf.one_hot" function.  Should switch to that.]
    Args:
        labels: list (row vector) of N labels (where N is the number of examples)
        num_categories: the number of possible labels
    Returns:
        N x num_categories array of one-hot encoded labels
    '''
    # IMPLEMENTATION
    # np.expand_dims(labels, 1) is an Nx1 array (a.k.a. a column vector) 
    # [0..num_categories] == k is evaluated for each label k in the column vector
    # This produces a one-hot vector [false, false, ..., true, false, ..., false] where "true" is in position k
    # Then astype(float32) converts the array of booleans to an array of 0s and 1s (as floats)
    return (np.arange(num_categories) == np.expand_dims(labels, 1)).astype(np.float32)

In [6]:
# Testing one_hot_encode
pretend_labels = np.random.randint(10, size=5)
print("pretend labels = ", pretend_labels)
one_hot_encode(pretend_labels, 10)

pretend labels =  [5 6 6 4 8]


array([[ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.]], dtype=float32)

In [7]:
image_size = 28  # 32 for MNIST, but we're doing notMNIST
num_labels = 10  # notMNIST is ABCDEFGHIJ, which is also 10 categories
num_channels = 1 # grayscale

with tf.name_scope("Training-Input") as scope:
    train_dataset = np.expand_dims(train_dataset, 3) # NxHxW --> NxHxWx1
    train_labels = one_hot_encode(train_labels, num_labels)
with tf.name_scope("Validation-Input") as scope:
    valdn_dataset = np.expand_dims(valdn_dataset, 3) # NxHxW --> NxHxWx1
    valdn_labels = one_hot_encode(valdn_labels, num_labels)
with tf.name_scope("Testing-Input") as scope:
    test_dataset = np.expand_dims(test_dataset, 3) # NxHxW --> NxHxWx1
    test_labels = one_hot_encode(test_labels, num_labels)
print('Training dataset\t', train_dataset.shape, "\tTraining labels\t\t", train_labels.shape)
print('Validation dataset\t', valdn_dataset.shape, "\tValidation labels\t", valdn_labels.shape)
print('Test dataset\t\t', test_dataset.shape, "\tTesting labels\t\t", test_labels.shape)

Training dataset	 (200000, 28, 28, 1) 	Training labels		 (200000, 10)
Validation dataset	 (10000, 28, 28, 1) 	Validation labels	 (10000, 10)
Test dataset		 (18724, 28, 28, 1) 	Testing labels		 (18724, 10)


## The Computation Graph

### Hyperparameters

In [9]:
with tf.name_scope("Hyperparameters") as scope:
    batch_size = 16
    patch_size = 5    # kernel 
    depth = 16        # depth of first hidden layer
    num_hidden = 64
    test_batch_size = 1000  # no need to overthink think this one; just find a size that fits on the GPU

### Create Graph and Set Up Placeholder Nodes in GPU Memory for the Input Data

In [10]:
graph = tf.Graph()
with graph.as_default():
    with tf.name_scope('Training-Input-Data'):
        tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
        tf_train_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    with tf.name_scope('Validation-Input-Data'):
        tf_valdn_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size, image_size, num_channels))
        tf_valdn_labels  = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
    with tf.name_scope('Test-Input-Data'):
        tf_test_dataset = tf.placeholder(tf.float32, shape=(test_batch_size, image_size, image_size, num_channels))
        tf_test_labels  = tf.placeholder(tf.float32, shape=(test_batch_size, num_labels))

### Allocate and Initialize All Weights and Biases
We'll use tf.get_variable instead of tf.Variable to make it easier for multi-GPU runs. 

In [11]:
with graph.as_default():
    
    # First Conv Layer
    with tf.variable_scope('ConvPoolRelu1'):
        weights = tf.get_variable(name = "weights", 
                                  shape = (patch_size, patch_size, num_channels, depth), 
                                  initializer = tf.truncated_normal_initializer(mean = 0, 
                                                                               stddev = 0.1, 
                                                                               seed = None), 
                                  dtype = tf.float32)
        biases = tf.get_variable(name = "biases",
                                 shape = (depth),
                                 initializer = tf.constant_initializer(0),
                                 dtype = tf.float32)
    # Second Conv Layer
    with tf.variable_scope('ConvPoolRelu2'):  
        weights = tf.get_variable(name = "weights", 
                                  shape = (patch_size, patch_size, depth, depth), 
                                  initializer = tf.truncated_normal_initializer(mean = 0, 
                                                                               stddev = 0.1, 
                                                                               seed = None), 
                                  dtype = tf.float32)
        biases = tf.get_variable(name = "biases",
                                 shape = (depth),
                                 initializer = tf.constant_initializer(0),
                                 dtype = tf.float32)
    # First Fully-Connected Layer
    with tf.variable_scope('FullyConnected3'):
        weights = tf.get_variable(name = "weights", 
                                  shape = ((image_size//4)**2 * depth, num_hidden), 
                                  initializer = tf.truncated_normal_initializer(mean = 0, 
                                                                               stddev = 0.1, 
                                                                               seed = None), 
                                  dtype = tf.float32)
        biases = tf.get_variable(name = "biases",
                                 shape = (num_hidden),
                                 initializer = tf.constant_initializer(0),
                                 dtype = tf.float32)
    # Second Fully-Connected Layer
    with tf.variable_scope('FullyConnected4'):
        weights = tf.get_variable(name = "weights", 
                                  shape = (num_hidden, num_labels), 
                                  initializer = tf.truncated_normal_initializer(mean = 0, 
                                                                               stddev = 0.1, 
                                                                               seed = None), 
                                  dtype = tf.float32)
        biases = tf.get_variable(name = "biases",
                                 shape = (num_labels),
                                 initializer = tf.constant_initializer(0),
                                 dtype = tf.float32)

### Layer-building Functions

In [12]:
def conv_pool_relu_layer(X, W, b):
    conv_X = tf.nn.conv2d(X, W, [1, 1, 1, 1], padding='SAME')
    pool_conv_X = tf.nn.max_pool(conv_X, [1,2,2,1], [1,2,2,1], padding='SAME')
    relu_pool_conv_X = tf.nn.relu(pool_conv_X + b)
    return relu_pool_conv_X

def fc_layer(X, W, b):
    return tf.matmul(X, W) + b

def flatten_activation_map(x):
    """Converts a 4D tensor into a 2D tensor (so it can be fed to a fully-connected layer).
    Args:
        x: tensor with shape (N, H, W, C)
    Returns:
        tensor with shape (N, H*W*C)
    """
    s = x.get_shape().as_list()
    return tf.reshape(x, [s[0], -1])

### Network-assembly Function

In [13]:
def model(data):
    """Retrieves the weights and biases and constructs the computation graph with them.
    Args:
        data: A batch tensor of input examples K x H x W x D (where K is the batch size)
    Returns: the network output K x C (where C is the number of categories)
    """
    with tf.variable_scope('ConvPoolRelu1', reuse=True):
        W = tf.get_variable("weights")
        B = tf.get_variable("biases")
        layer_1_output = conv_pool_relu_layer(data, W, B)
    with tf.variable_scope('ConvPoolRelu2', reuse=True):
        W = tf.get_variable("weights")
        B = tf.get_variable("biases")
        layer_2_output = conv_pool_relu_layer(layer_1_output, W, B)
        flat_X = flatten_activation_map(layer_2_output)
    with tf.variable_scope('FullyConnected3', reuse=True):
        W = tf.get_variable("weights")
        B = tf.get_variable("biases")
        fc1_layer_output = tf.nn.relu(fc_layer(flat_X, W, B))
    with tf.variable_scope('FullyConnected4', reuse=True):
        W = tf.get_variable("weights")
        B = tf.get_variable("biases")
        fc2_layer_output = fc_layer(fc1_layer_output, W, B)
    return fc2_layer_output

### Interpret the Output, Set Up the Loss Function, Specify the Learning Algorithm 

In [15]:
with graph.as_default():
    # Training computation.
    logits = model(tf_train_dataset)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels))
    tf.summary.scalar("Loss", loss)
    optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)

    # Predictions for the training, validation, and test data.
    train_prediction = tf.nn.softmax(logits)
    valdn_prediction = tf.nn.softmax(model(tf_valdn_dataset))
    test_prediction = tf.nn.softmax(model(tf_test_dataset))

### Training-related Functions

In [8]:
def accuracy(predictions, labels):
    '''
    "Grades" a list of predictions (like marking a multiple choice test).
    Accepts a list of one-hot predictions and a list of ground truth labels (the right answers) 
    Args:
        predictions: NxK float array of one-hot-encoded network output, 
                     where N is the number of examples and K is the number of categories
        labels: NxK array of one-hot-encoded ground truth labels (why one-hot encode those??)
    Returns:
        Percetage of correct predictions.
    '''
    # argmax(predictions, 1) gives the "mostly likely" category ("one-decodes" the network output)
    # argmax(labels, 1) does the same thing 
    #    Why did we even one-hot encode those to begin with??
    #    Ah... Right.  Because the loss function compares them to the network output.
    # By putting the == inside a sum, we're implicitly casting the equality test to an int.
    # Correct predictions yield 1, incorrect ones yield 0.
    # Then we divide the tally of 1s by the number of predictions in the set.
    return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) 
          / predictions.shape[0])

def variable_summaries(var, name):
    """Attach a lot of summaries to a Tensor."""
    with tf.name_scope('summaries'):
        mean = tf.reduce_mean(var)
        tf.summary.scalar('mean/' + name, mean)
        with tf.name_scope('stddev'):
            stddev = tf.sqrt(tf.reduce_sum(tf.square(var - mean)))
        tf.summary.scalar('sttdev/' + name, stddev)
        tf.summary.scalar('max/' + name, tf.reduce_max(var))
        tf.summary.scalar('min/' + name, tf.reduce_min(var))
        tf.summary.histogram(name, var)
    
def clear_tensorboard_dir(path):
    if tf.gfile.Exists(path):
        tf.gfile.DeleteRecursively(path)
    tf.gfile.MakeDirs(path)

def make_batch(data, max_size, batch_num):
    n = data.shape[0]
    offset = (batch_num * max_size) % n
    return data[offset: offset+max_size]

### Conduct the Training

In [16]:
num_steps = 5000

# Clear the tensorboard directory
clear_tensorboard_dir("tensorboard")

# config = tf.ConfigProto()
# config.gpu_options.allocator_type = 'BFC'
# with tf.Session(config = config, graph=graph) as session:
with tf.Session(graph=graph) as session:
    merged_summaries = tf.summary.merge_all()
    train_writer = tf.summary.FileWriter('./tensorboard', session.graph)
    tf.global_variables_initializer().run()
    print('Training...')
    for step in range(num_steps):
        batch_data = make_batch(train_dataset, batch_size, step)
        batch_labels = make_batch(train_labels, batch_size, step)
        feed_dict = {tf_train_dataset: batch_data, tf_train_labels: batch_labels}
        merged_sums, _, l, predictions = session.run([merged_summaries, optimizer, loss, train_prediction], 
                                                     feed_dict=feed_dict)
        if (step % 100 == 0):
            train_writer.add_summary(merged_sums, step)
            print('.', end="")
    
    #--------------------- Testing Phase -------------------
    total_accuracy = 0
    n_test_cases = int(test_dataset.shape[0])
    print("n_test_cases = %d" % n_test_cases)
    n_batches = int(n_test_cases / test_batch_size)
    print("n_batches = ", str(n_batches))
    for step in range(n_batches):
        batch_data = make_batch(test_dataset, test_batch_size, step)
        batch_labels = make_batch(test_labels, test_batch_size, step)
        feed_dict = {tf_test_dataset: batch_data, tf_test_labels: batch_labels}
        test_output_batch = session.run([test_prediction], feed_dict=feed_dict)[0]
        this_batch_size = float(batch_labels.shape[0])
        batch_accuracy = accuracy(np.asarray(test_output_batch), np.asarray(batch_labels))
        total_accuracy = total_accuracy + batch_accuracy * this_batch_size / float(n_test_cases)

    print('Test accuracy: %.1f%%' % total_accuracy)
    train_writer.close()

Training...
..................................................n_test_cases = 18724
n_batches =  18
Test accuracy: 90.3%
