## Convolutional Neural Network

Convolutional Neural Networks (CNNs) are similar to "ordinary" Neural Networks: they are made up of neurons that have learnable weights and biases. These neurons receive some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer.

So how is it different from other neural networks? ConvNet architectures assume that the inputs are images, which allows us to encode certain properties into the architecture. These parameters then make the forward function more efficient to implement, and vastly reduce the amount of parameters in the network.

Now, from the previous session, Softmax Regression for MNIST classification, getting a 92% accuracy on the MNIST is bad. This session is meant for fixing that. Shifting from a very simple model to a sophisticated one: a small convolutional neural network. This will reach an accuracy of around 99.2% -- no state of the art, but it is something.

Here is a diagram of the model we are going to build (created with TensorBoard):

![](../figures/mnist_deep_graph.png)

### Weight and Bias Initialization

To implement this model, we are going to create a lot of weights and biases. The weights should be initialized to prevent 0 gradients (in simple terms, gradients could be described as "flow" of learning). To avoid "dead neurons", it is a good practice to initialize positive bias. Lastly, instead of repeatedly initializing weights and biases, we may write two functions to do it for us.

In [1]:
# import the TensorFlow library
import tensorflow as tf

In [2]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)
    
def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

### CNN Architecture

CNNs are currently the state-of-the-art model for image classification tasks. CNNs apply a series of filters to the raw pixel data of an image, to extract and to learn high-level features, which the model can use for classification. CNNs contain three components:

* __Convolutional layers__ apply a specified number of filters to the image. For each image subregion, this layer performs a set of mathematical operations to produce a single value in the output feature map (simply, a value to represent that subregion).

* __Pooling layers__ reduce the size of the image data based on the filters applied by the convolutional layer, to reduce processing time of the network. The most common pooling algorithm used is max pooling, which extracts the maximum value in a subregion of the feature map.

* __Dense (fully-connected) layers__ perform the classification on the features extracted by the convolutional layer and reduced by the pooling layers.

Typically, a CNN is composed of a stack of convolutional modules that perform feature extraction. Each module consists of a convolutional layer followed by a pooling layer. The last convolutional module is followed by one or more dense layers to perform classification. The final dense layer in a CNN contains a single node for each target class in the model, conventionally with a softmax function to generate a probability distribution for each node. The softmax values for a given image can be interpreted as how likely that the image belongs to a target class.

For this session, the following CNN architecture shall be implemented:

1. Convolutional layer #1. Applying 32 5x5 filters (extracting 5x5-pixel subregions), with ReLU activation function.
2. Pooling layer #1. Performs max pooling with a 2x2 filter and a stride of 2 (which specifies that pooled regions do not overlap).
3. Convolutional layer #2. Applies 64 5x5 filters, with ReLu activation function.
4. Pooling layer #2. Again, performs max pooling with a 2x2 filter and a stride of 2.
5. Dense layer #1. 1,024 neurons, with dropout regularization rate of 0.4 (probability that any given elemtn will be dropped during training).
6. Dense layer #2. 10 neurons, one for each digit target class (0-9).

First, let us import our input data.

In [3]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('/home/darth/MNIST_data', one_hot=True)

Extracting /home/darth/MNIST_data/train-images-idx3-ubyte.gz
Extracting /home/darth/MNIST_data/train-labels-idx1-ubyte.gz
Extracting /home/darth/MNIST_data/t10k-images-idx3-ubyte.gz
Extracting /home/darth/MNIST_data/t10k-labels-idx1-ubyte.gz


Then, let us define the input placeholders. One for the training data, `x`, and one for the actual labels, `y`.

In [4]:
x = tf.placeholder(dtype=tf.float32, shape=[None, 784], name='x_input')

y = tf.placeholder(dtype=tf.float32, shape=[None, 10], name='actual_labels')

Before building the specified architecture, let us define the functions to build the convolutional layer and pooling layer, so that we will not have to repeat our actions of defining them.

In [5]:
def conv2d(x, w):
    return tf.nn.conv2d(x, w, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

On to building the CNN architecture:

In [6]:
# First conv layer
w_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, w_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

# Second conv layer
w_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, w_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

# Fully-connected layer (dense layer)
w_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, w_fc1) + b_fc1)

# Droput regularization
# To reduce overfitting, apply dropout before the readout layer
keep_prob = tf.placeholder(dtype=tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# Readout layer
# Layer that produces the classification,
# like Softmax regression from the previous session
w_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_ = tf.matmul(h_fc1_drop, w_fc2) + b_fc2

Using Softmax as the activation layer, with cross-entropy for measuring the model's loss. To train the model, `Adam` will be used with a learning rate of `1e-4`.

In [7]:
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_))
train_op = tf.train.AdamOptimizer(learning_rate=1e-4).minimize(loss)

For measuring the accuracy of the model,

In [8]:
correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, dtype=tf.float16))

Start training the defined model.

In [9]:
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init_op)
    
    for index in range(20000):
        # train by batch size of 50
        batch_x, batch_y = mnist.train.next_batch(50)
        
        # input data to train operation
        feed_dict = {x: batch_x, y: batch_y, keep_prob:0.5}
        
        # run the training operation with the previously-defined input
        sess.run(train_op, feed_dict=feed_dict)
        
        # show training accuracy every 100 steps
        if index % 100 == 0: 
            # do not perform dropout
            feed_dict = {x: batch_x, y: batch_y, keep_prob: 1.0}
            
            train_accuracy = sess.run(accuracy, feed_dict=feed_dict)
            
            print('step : {}, training accuracy : {}'.format(index, train_accuracy))
        
        # input data for model testing
    feed_dict = {x: mnist.test.images, y: mnist.test.labels, keep_prob: 1.0}
        
    test_accuracy = sess.run(accuracy, feed_dict=feed_dict)
        
    # display the accuracy of the model
    # on unseen data, i.e. validation
    print('Test Accuracy : {}'.format(test_accuracy))

step : 0, training accuracy : 0.1199951171875
step : 100, training accuracy : 0.85986328125
step : 200, training accuracy : 0.9599609375
step : 300, training accuracy : 0.93994140625
step : 400, training accuracy : 0.93994140625
step : 500, training accuracy : 0.8798828125
step : 600, training accuracy : 0.919921875
step : 700, training accuracy : 0.919921875
step : 800, training accuracy : 1.0
step : 900, training accuracy : 0.93994140625
step : 1000, training accuracy : 0.93994140625
step : 1100, training accuracy : 0.97998046875
step : 1200, training accuracy : 0.97998046875
step : 1300, training accuracy : 0.9599609375
step : 1400, training accuracy : 0.89990234375
step : 1500, training accuracy : 0.93994140625
step : 1600, training accuracy : 0.97998046875
step : 1700, training accuracy : 0.97998046875
step : 1800, training accuracy : 0.9599609375
step : 1900, training accuracy : 0.97998046875
step : 2000, training accuracy : 0.97998046875
step : 2100, training accuracy : 1.0
step

As you can see, the final test accuracy is approximately 99.2%.