<H1> Deep Learning with TensorFlow </H1>

<h2> 1. Installing and importing TensorFlow </h2>

At first, we must install the TensorFlow library onto the DataBricks cluster platform. Then we will be able to import it and to access its functionalities.<br>
The following block of code attempts to import TF. If it has not been installed already, an error will be triggered. We will catch such an error and proceed to the installation of TF on DataBricks.

In [3]:
try:
    import tensorflow as tf
    print("TensorFlow is already installed!")
except ImportError:
    print("Installing TensorFlow...")
    import subprocess
    subprocess.check_call(["/databricks/python/bin/pip", "install", 
               "https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl"])
    print("TensorFlow has been successfully installed on this cluster")
    import tensorflow as tf

<h2> 2. 'Shallow' Multi-Layer Perceptron (MLP) </h2>

The goal of this section is to learn how to build and train a simple MLP architecture. Afterwards, we will apply it to the classification problem of Fisher's iris flower dataset, which we also used in previous lectures.<br>
Given that there are only four features available and 150 instances (50 from each of the three flower classes), the use of 'deep' ML solutions may not be the most appropriate choice for this situation. Thus, we will build a traditional <b><i>'shallow'</i></b> MLP with: one input layer of neurons, two hidden layers, and one output layer.

<h3> Part 0: Load and preprocess data </h3>

Divide the data in two subsets: 75% of the instances for the purpose of training the MLP classifier, the remaining 25% to test its accuracy on an independent set (once the training will be complete).<br>
Use the training data to obtain a <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">scaling transformation</a> which <b>standardizes features</b>: to have zero mean and unit standard deviation for each. Apply the same scaling to both the training and the test subsets. In this manner, we are protecting ourselves against potential issues that could be caused by features having different units, or very dissimilar ranges of variation, etc.<br>
Besides, we need to <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.label_binarize.html">binarize</a> the labels which indicate us to which class belongs each instance. For example, if an item belongs to the 3rd class (<i>'iris virginica'</i>), it will be labeled with number 2 in Y. We wish to obtain the <b>binarized equivalent</b> as a <b>vector</b> [0 0 1].

In [6]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import StandardScaler

# Initialize the random generator seed to compare results
np.random.seed(0)

# Load the Fisher's iris flower dataset
iris = datasets.load_iris()
X = iris.data[:]
Y = iris.target

# Divide into subsets
X_train, X_test, Y_train, Y_test = train_test_split(<FILL_IN>)

# Apply feature standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(<FILL_IN>)
X_test = scaler.<FILL_IN>

# Binarize the labels
set_classes = np.unique(Y)
Y_train_bin = label_binarize(<FILL_IN>)
Y_test_bin = label_binarize(<FILL_IN>)

<h3> Part 1: Define the MLP architecture </h3>

In this stage we will choose the structure of our MLP neural network.<br>
For the <b>input layer, the number of neurons must always be equal to the number of features</b> that we have in our problem (here 4). Similarly, for the <b>output layer there must be one neuron per class</b> (here 3). In this manner, when the network is fed with an input which belongs to the 3rd class -for example-, then the 3rd neuron in the output layer should ideally experience a high activation; whereas the other neurons in the output layer should not activate at all (or very weakly, at most).<br>
Besides, we need to manually <b>specify the structure of the inner layers</b> of the MLP network (traditionally named as 'hidden'). For now, let's assume that we choose to have two hidden layers with 6 and 5 neurons, respectively.

In [8]:
n_feats = X.shape[1]
n_class = set_classes.shape[0]

# Network structure
n_in = n_feats
n_out = n_class

n_hidden1 = 6
n_hidden2 = 5

The following step consists in creating some structures that will allow us to interact with TensorFlow.<br>
Our inputs and outputs will be implemented as <a href="https://www.tensorflow.org/programmers_guide/reading_data#feeding">'placeholders'</a>, which are special symbolic variables used by TF to manage data that must be fed on execution time.<br>
Besides, we need to store the weights and biases that characterize the behaviour of our MLP.

In [10]:
# TensorFlow inputs and outputs
x = tf.placeholder("float", [None, n_in]) # by setting None, we allow for a dynamic number of instances
y = tf.placeholder("float", [None, n_out])
                   
# Store weights and biases for each layer
weights = {
    'wh1': tf.Variable(tf.random_normal([n_in, n_hidden1])),
    'wh2': tf.Variable(tf.random_normal([n_hidden1, n_hidden2])),
    'wo': tf.Variable(tf.random_normal([n_hidden2, n_out]))
}
biases = {
    'bh1': tf.Variable(tf.random_normal([n_hidden1])),
    'bh2': tf.Variable(tf.random_normal([n_hidden2])),
    'bo': tf.Variable(tf.random_normal([n_out]))
}

Practical note: To prevent numerical issues with convergence, it is important to initialize weights and biases with random, non-zero values. In the code above, we are using a normal (i.e. Gaussian) distribution with zero mean and unit standard deviation.

<h3> Part 2: MLP function </h3>

In the theoretical lectures we learnt that the behaviour of a neuron can be undestood as having two stages:
 <ol>
  <li>the inputs to the neuron are merged by a <b>weighted linear combination, plus an additional bias term</b>;</li>
  <li>then the <b>activation</b> corresponds to a <b>non-linear</b> function of the result from step 1.</li>
</ol> 
But MLPs do not only contain a single neuron. Instead, they have sets of neurons arranged in layers, which all behave in a very similar manner (just with different weights and bias values). This facilitates the calculations, since they can be implemented efficiently as matrix operations - in fact, <b>'tensor'</b> operations: where a <i>'tensor'</i> is a generalization of a matrix to allow for more than two dimensions. Actually, this inspired the name TensorFlow.

In [13]:
# Create our own MLP model with TensorFlow functions
def MLP_TF(x_in, weights, biases):
    # 1st hidden layer, with sigmoidal activation
    layer1 = tf.add(tf.matmul(x_in, weights['wh1']), biases['bh1']) # linear part: weights_hidden1*x_in+biases_hidden1
    layer1 = tf.nn.sigmoid(layer1) # non-linear part: sigmoidal activation
    # 2nd hidden layer, again with sigmodial activation
    layer2 = tf.add(tf.matmul(<FILL_IN>), <FILL_IN>)
    layer2 = tf.nn.sigmoid(<FILL_IN>)
    # Output layer, with linear activation
    y_out = tf.add(tf.matmul(layer2, weights['wo']), biases['bo'])
    return y_out

As you can see, we decided to <b>omit the non-linearity from the last layer (i.e. the output)</b>. This is a common practise in TF with networks used for classification problems. In this way, the output neurons yield a result which is not directly the predicted likelihood of the input to belong to a certain class; but instead, an estimation of the <b>logarithm of such likelihood</b>. Afterwards, the <a href="https://www.tensorflow.org/api_docs/python/tf/nn/softmax">softmax</a> function will transform these activations to choose the most likely class assignment for prediction.<br>

In the following block of code, the MLP is set up.<br>
We must also specify our <b>loss function</b>. Here we choose to employ the <a href="https://www.tensorflow.org/api_docs/python/tf/reduce_mean">mean</a> across all instances of the <a href="https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits">cross-entropy</a> between:
<ol type="A">
<li><b>predictions (a.k.a. logits)</b>, and</li>
<li><b>actual classes</b>.</li>
</ol>

In [15]:
# Construct model
y_pred = MLP_TF(x, weights, biases)

# Define our loss function, also called cost 
costs_all = tf.nn.softmax_cross_entropy_with_logits(logits=y_pred, labels=y)
cost = tf.reduce_mean(costs_all)

<h3> Part 3: Training our MLP </h3>

The procedure of training the network consists in an <b>iterative adjustment of the values of neurons' weights and biases, aiming to minimize the total misclassification cost</b> that we defined above. To do so, we need to select an optimization algorithm, which is responsible for updating weights and biases.<br>
TF includes several of these <b>optimizers</b>; but in this example we will use the <a href="https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer">GradientDescentOptimizer</a>, because it is the simplest algorithm available. It estimates the <b>direction of maximal change in the cost function (<i>'gradient'</i>)</b>, and performs the update according to such a direction. Besides, the magnitude of the update depends on a <b>'learning rate'</b>: the higher this learning rate is, the more rapidly we expect to converge to an optimum (i.e. a minimum cost). However, excessive rates may lead to instabilities and non-convergence issues.

In [17]:
# Training and optimizer parameters
learn_rate = 0.02
train_iterations = 2500 
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learn_rate).minimize(cost)

TF works with <i>'sessions'</i>, meaning that we need to create a session and initialize all our variables before we can actually run the code.

In [19]:
# Initialize TF session and its variables
init = tf.initialize_all_variables()

sess = tf.Session() # new TF session
sess.run(init)

display_step = 100 # just for showing texts indicating progress
# Training cycle
for it in range(train_iterations):
  # Run optimization (backpropagation) and the calculation of costs (to get loss values)
  _, c = sess.run([optimizer, cost], feed_dict={x: X_train, y: Y_train_bin})
  
  # Display logs per step
  if it % display_step == 0:
    print("Iteration: %d Cost= %0.6f" %(it, c))
  
print("Optimization finished!")

<h3> Part 4: Evaluating our MLP </h3>
Let's now compute the performance of the trained model, when applied on the separate testing set.

In [21]:
# Compare model predictions against 'ground truth' labels
is_pred_correct = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y, 1))

# Calculate overall accuracy
accuracy = tf.reduce_mean(tf.cast(is_pred_correct, "float"))
accur_test = accuracy.eval({x: X_test, y: Y_test_bin}, session=sess)
print("Accuracy test: %2.2f%%" %(100*accur_test))

<h3> Advanced work </h3>
Instead of having a fixed number of neurons in the hidden layers, adapt the code from above to do an <b>automated selection of the MLP structure</b>. You will need not only a training (e.g. 60%) and a testing subset (e.g. 20%) to assess the final performance of the MLP, but also a validation subset (e.g. 20%) to select the most suitable values for the number of neurons in each of the two layer. Assume other important aspects of the network (e.g. activation functions) and about the training procedure (loss function, type of optimization algorithm, learning rate, number of training epochs, etc.) to remain unaltered.

In [23]:
import itertools

from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import StandardScaler

# Initialize the random generator seed to compare results
np.random.seed(0)

# Load the Fisher's iris flower dataset
iris = datasets.load_iris()
X = iris.data[:]
Y = iris.target

# Divide into subsets
X_trainval, X_test, Y_trainval, Y_test = train_test_split(<FILL_IN>)
X_train, X_val, Y_train, Y_val = train_test_split(<FILL_IN>)

# Apply feature standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(<FILL_IN>)
X_valid = scaler.<FILL_IN>
X_test = scaler.<FILL_IN>

# Binarize the labels
set_classes = np.unique(Y)
Y_train_bin = label_binarize(<FILL_IN>)
Y_valid_bin = label_binarize(<FILL_IN>)
Y_test_bin = label_binarize(<FILL_IN>)

# Set network structure
n_feats = X.shape[1]
n_class = set_classes.shape[0]
n_in = n_feats
n_out = n_class

# TensorFlow inputs and outputs
x = tf.placeholder(<FILL_IN>)
y = tf.placeholder(<FILL_IN>)

# Create our own MLP model with TensorFlow functions
def MLP_TF(x_in, weights, biases):
  # 1st hidden layer, with sigmoidal activation
  layer1 = tf.add(tf.matmul(<FILL_IN>), <FILL_IN>)
  layer1 = tf.nn.sigmoid(<FILL_IN>)
  # 2nd hidden layer, again with sigmodial activation
  layer2 = tf.add(tf.matmul(<FILL_IN>), <FILL_IN>)
  layer2 = tf.nn.sigmoid(<FILL_IN>)
  # Output layer, with linear activation
  y_out = tf.add(tf.matmul(<FILL_IN>), <FILL_IN>)
  return y_out

# Training and optimizer parameters
learn_rate = 0.02
train_iterations = 2500

# Manual implementation of a grid search strategy
n_hidden1_opts = range(4,9) # options to explore: neurons in the 1st hidden layer
n_hidden2_opts = range(4,9) # options to explore: neurons in the 2nd hidden layer
options = list(itertools.product(n_hidden1_opts, n_hidden2_opts)) # all possible combinations of options
accur_train, accur_valid, accur_test = [], [], []
print("Please be patient, this code may take some minutes to run!")
for opt in options:
  # Candidate MLP architecture
  nh1, nh2 = opt
  weights = {
    'wh1': tf.Variable(<FILL_IN>),
    'wh2': tf.Variable(<FILL_IN>),
    'wo': tf.Variable(<FILL_IN>)
  }
  biases = {
    'bh1': tf.Variable(<FILL_IN>),
    'bh2': tf.Variable(<FILL_IN>),
    'bo': tf.Variable(<FILL_IN>)
  }
  
  # Construct model
  y_pred = MLP_TF(<FILL_IN>)

  # Define loss function and optimizer
  costs_all = tf.nn.softmax_cross_entropy_with_logits(logits=<FILL_IN>, labels=<FILL_IN>)
  cost = tf.reduce_mean(<FILL_IN>)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate=<FILL_IN>).minimize(<FILL_IN>)
  
  # Initialize TF session and its variables
  init = tf.initialize_all_variables()
  sess = tf.Session() # new TF session
  sess.run(init)

  # Training
  for iter in range(train_iterations):
  # Run optimization (backpropagation) and cost (to get loss value)
    sess.run([optimizer, cost], feed_dict={x: <FILL_IN>, y: <FILL_IN>})
    
  # Evaluation cycle
  is_pred_correct = tf.equal(tf.argmax(<FILL_IN>, 1), tf.argmax(<FILL_IN>, 1))
  accuracy = tf.reduce_mean(tf.cast(<FILL_IN>, "float"))
  
  accur_train.append( accuracy.eval({x: <FILL_IN>, y: <FILL_IN>}, session=sess) )
  accur_valid.append( accuracy.eval({x: <FILL_IN>, y: <FILL_IN>}, session=sess) )
  accur_test.append( accuracy.eval({x: <FILL_IN>, y: <FILL_IN>}, session=sess) )
  
# Once all possible combinations have been explored, find which yields the maximal accuracy on the validation set
print("Optimization finished!")
#idx_best = np.argmax(np.array(<FILL_IN>))
idx_best = np.argmax(np.array(accur_valid))
print("Valid accuracy @ best valid option: %2.2f%%" %(100*accur_valid[<FILL_IN>]))
print("Test accuracy @ best valid option: %2.2f%%" %(100*accur_test[<FILL_IN>]))
opt_best = options[<FILL_IN>]
print("MLP architecture @ best valid option: HiddenLayer#1 = %d, HiddenLayer#2 = %d" %(<FILL_IN>, <FILL_IN>))

<h2> 3. Convolutional Neural Network (CNN) </h2>

In this section we will create a 'deep' CNN to classify small images. The aim is to discern digits present in pictures of handwritten text.<br>
Thus, the problem will be formulated as a classification task with ten possible classes (i.e. digits from 0 to 9).

<h3> Part 0: Dataset </h3>
We will use LeCun's <b>MNIST</b> <a href="http://yann.lecun.com/exdb/mnist/">database</a>, a common dataset for benchmarking purposes. It contains 60k training instances, plus other 10k for testing. Each instance is a 28x28 pixel greyscale image matrix, or equivalently a flattened vector of 784 dimensions (components).

The following block of code will download, read and format the dataset for our use. Note that the training (55k), validation (5k) and testing (10k) subsets are already prepared.

In [26]:
import numpy as np
import matplotlib.pyplot as plt

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

# The MNIST dataset has 10 classes, representing the digits 0 through 9.
NUM_CLASSES = 10

# The MNIST images are always 28x28 pixels.
IMAGE_SIZE = 28
IMAGE_PIXELS = IMAGE_SIZE * IMAGE_SIZE

# Initialize the random generator seed to compare results
np.random.seed(0)

Let's now plot a few examples.

In [28]:
examples_first = 0
examples_num_cols, examples_num_rows = 8, 8

fig, axes = plt.subplots(nrows=examples_num_rows, ncols=examples_num_cols)
count = 0
for ax in axes.flat:
  example = mnist.train.images[examples_first+count]
  example = np.reshape(example, (<FILL_IN>, <FILL_IN>))
  ax.imshow(example, cmap='gray')
  ax.axis('off')
  count += 1

plt.show()
display(fig)

<h3> Part 1: CNN architecture and function </h3>

Firstly, we must define the placeholders for our TF inputs and outputs.<br>
Taking into account that the MNIST dataset contains a large volume of instances, in practise it is not very convenient to process all of them at once. Instead, we will onwards work with smaller subsets, known as <b>batches</b> (<b><i>'lotes'</i></b> in Spanish).

In [30]:
BATCH_SIZE = 1000

x = tf.placeholder("float", [BATCH_SIZE, IMAGE_PIXELS])
y = tf.placeholder("float", [BATCH_SIZE, NUM_CLASSES])

The <b>convolutional</b> behaviour of the CNN resides in exploiting the <b>spatial information</b> present in the images, i.e. relationships among the greyscale value of a certain pixel and those of its local neighbours.<br>
For this reason, we must reshape each individual data instance a into matrix image of size 28x28 pixels. The following code does that task in an efficient manner, processing the whole batch at once. By specifying the first dimension as -1, we keep the batch size without change. By setting the last dimension to 1, we mean that there exists a single color channel (greyscale, in our case). Note that coloured images (e.g. RGB) will in general need several channels, typically 3.<br>
Thus, in the end we will obtain a <b>4D-tensor</b> of size [1000 (batch size), 28, 28, 1].

In [32]:
# Convert the batch of images into a 4D-tensor
x_image = tf.reshape(x, [-1,IMAGE_SIZE,IMAGE_SIZE,1])

The response of the CNN architecture (in its different layers) will be characterized by a series of <b>variables</b>, equivalent or analogous to the weights and biases in the MLP function from the past section. For convenience, we will now define a couple of auxiliary functions to facilitate us the creation of such variables.<br>
In most circumstances, we would initialize them with random values, e.g. following a <b>normal</b> distribution. However, for the specific case of biases that enter a neuron with ReLU activation, it is more practical to start with small positive values. The <b>constant</b> generator below will deal with that task.

In [34]:
def init_random_normal_variable(shape, std_dev):
  initial = tf.random_normal(shape, stddev=std_dev)
  return tf.Variable(initial)

def init_constant_variable(shape, const_val):
  initial = tf.constant(const_val, shape=shape)
  return tf.Variable(initial)

As we saw in the theoretical lectures, the basic operations in CNNs are <b>convolution</b> and <b>pooling</b>, which will be repeated through the convolutional layers. Therefore, it is also practical to define these schemes as function blocks.<br>
Our choice below is to perform a 'classical' 2D convolution, but TF has other <a href=https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/convolution>convolution possibilities</a> available. For instance, 3D convolution may be of high relevance when dealing with video or 3D images (e.g. medical volumetric scans). As for the pooling operation, 'max' is the most common choice in various applications, but again TF has implemented <a href=https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/pooling>other pooling options</a>, e.g. average.

In [36]:
def conv2d_fcn(x_img, filt_weights):
  return tf.nn.conv2d(x_img, filt_weights, strides=[1, 1, 1, 1], padding='SAME')
  
def max_pool_fcn(x_img, pool_size):
  return tf.nn.max_pool(x_img, ksize=[1, pool_size, pool_size, 1], strides=[1, pool_size, pool_size, 1], padding='SAME')

Note that the <b>stride</b> establishes how much the convolutional/pooling block is moved in each dimension after one iteration.<br>
<b>Padding</b> specifies how the edges of the image are treated when doing the operations. For TF, <i>'SAME'</i> means zero padding (i.e. values beyond the edges are assumed to equal zero); whereas <i>'VALID'</i> indicates that we do not operate beyond the edges.

Now we have defined the main building pieces for a CNN layer. Let's proceed to create one.

In [39]:
CONV1_SIZE, CONV1_NUM_FEATS = 5, 32
stdev_w1, const_b1 = 0.2, 0.1 # for example
# Create a bank of 32 convolutional filters of size 5x5 pixels
conv1_weights = init_random_normal_variable([CONV1_SIZE, CONV1_SIZE, 1, CONV1_NUM_FEATS], stdev_w1)
conv1_bias = init_constant_variable([CONV1_NUM_FEATS], const_b1)

Besides, we must choose an <a href=https://www.tensorflow.org/versions/r0.10/api_docs/python/nn/activation_functions_> activation function</a>. In this case, let's use the rectified linear unit (ReLU), since it tends to provide quick convergence properties in various types of scenarios. Nonetheless, bear in mind that other activations may be more suitable for different problems.

In [41]:
conv1_in = x_image
conv1_activ = tf.add(conv2d_fcn(conv1_in, conv1_weights), conv1_bias) # convolution 2D and biases added
conv1_activ = tf.nn.relu(conv1_activ) # ReLU activation

Next, let's incorporate the pooling stage.

In [43]:
POOL1_SIZE = 2
pool1 = max_pool_fcn(conv1_activ, POOL1_SIZE)

<b>Task</b>: Create yourself a similar second convolutional layer with:
* Convolutional blocks of size 3x3, 48 features and the appropriate number of channels.
* Softplus activation (which implies random initialization of biases).
* Average pooling with size 2x2.

In [45]:
CONV2_SIZE, CONV2_NUM_FEATS = <FILL_IN>, <FILL_IN>
stdev_w2, stdev_b2 = <FILL_IN>, <FILL_IN>

conv2_weights = <FILL_IN>
conv2_bias = <FILL_IN>

conv2_in = <FILL_IN>
conv2_activ = tf.add(conv2d_fcn(<FILL_IN>, <FILL_IN>), <FILL_IN>) # convolution 2D and biases added
conv2_activ = tf.nn.<FILL_IN> # softplus activation

def avg_pool_fcn(x_img, pool_size):
  return tf.nn.<FILL_IN>

POOL2_SIZE = <FILL_IN>
pool2 = <FILL_IN>

At last, it is a common practise to add a final fully connected layer (as in a 'classical' MLP), with softmax activation.

In [47]:
# Flatten the input tensor (i.e. the output of the 2nd convolutional layer) into a vector
full_in = pool2
full_in_flat = tf.reshape(full_in, [BATCH_SIZE, -1])
NUM_FULL_INPUTS = full_in_flat.get_shape()[1].value

# Hidden layer
NUM_HIDDEN_NEURONS = 128
stdev_wfih, stdev_bfih = 0.2, 0.2 # for example
full_in_hid_weights = init_random_normal_variable([NUM_FULL_INPUTS, NUM_HIDDEN_NEURONS], stdev_wfih)
full_in_hid_bias = init_random_normal_variable([NUM_HIDDEN_NEURONS], stdev_bfih)

full_hid_activ = tf.add(tf.matmul(full_in_flat, full_in_hid_weights), full_in_hid_bias)
full_hid_activ = tf.nn.sigmoid(full_hid_activ) # sigmoid activation

# Output layer
NUM_FULL_OUTPUTS = NUM_CLASSES
stdev_wfho, stdev_bfho = 0.2, 0.2 # for example
full_hid_out_weights = init_random_normal_variable([NUM_HIDDEN_NEURONS, NUM_FULL_OUTPUTS], stdev_wfho)
full_hid_out_bias = init_random_normal_variable([NUM_FULL_OUTPUTS], stdev_bfho)

full_out_activ = tf.add(tf.matmul(full_hid_activ, full_hid_out_weights), full_hid_out_bias) # linear activation

# Final, global output
y_cnn = full_out_activ # remember that softmax will be applied later on

That completes the structure of our CNN. Next, the overall <b>cost function</b> will be as follows:

In [49]:
# Define our loss function 
costs_all_cnn = tf.nn.softmax_cross_entropy_with_logits(logits=y_cnn, labels=y)
cost_cnn = tf.reduce_mean(costs_all_cnn)

<h3> Part 2: Training our CNN </h3>

As for the previous case of the MLP architecture, the procedure of training the CNN aims at <b>minimizing the misclassification cost</b> (cross-entropy, here) over the training subset of data. It consists in an <b>iterative adjustment</b> of the values of the constituent pieces of the network: the coefficients of the convolutional filters, the biases of the activation functions, the weights and biases of the fully connected part, etc.<br>

We must again select an <b>optimization algorithm</b> which updates these elements. We could use the same GradientDescentOptimizer that we suggested for the MLP, but let's now explore another computationally efficient method called <a href="https://www.tensorflow.org/versions/r0.10/api_docs/python/train/optimizers#AdamOptimizer">AdamOptimizer</a> (if interested, you can find a detailed theoretical description of the algorithm <a href="https://arxiv.org/abs/1412.6980">here</a>).<br>

Besides, there is a <a href="https://www.tensorflow.org/versions/r0.10/api_docs/python/train/optimizers">list</a> of optimizers available in TF.

Adam has a series of <b>hyper-parameters</b> which define its behaviour as optimizer. In broad terms, it requires us to set a learning rate, as well as two exponential forgetting factors (beta1, beta2). Let's now introduce the default values for these three hyper-parameters.

In [52]:
adam_learn_rate = 0.001
adam_beta1, adam_beta2 = 0.9, 0.999
# Creation of the optimizer
adam_optimizer = tf.train.AdamOptimizer(learning_rate=adam_learn_rate, beta1=adam_beta1, beta2=adam_beta2).minimize(cost_cnn)

Let's also add some code to calculate the total results in terms of classification performance.

In [54]:
# Compute a vector of classification error indicators, comparing true labels with CNN predictions
is_pred_correct_cnn = tf.equal(tf.argmax(<FILL_IN>, 1), tf.argmax(<FILL_IN>, 1))

# Estimate training accuracy as the rate of errors in the given dataset
accuracy_cnn = tf.reduce_mean(tf.cast(<FILL_IN>, "float"))

We are ready to run the TF session. We will see how to apply batch gradient updates:

In [56]:
NUM_IMAGES_TRAIN = mnist.train.images.shape[0] # 55 thousand
NUM_BATCHES_TRAIN = NUM_IMAGES_TRAIN/BATCH_SIZE

sess = tf.Session()
sess.run(tf.initialize_all_variables())

adam_train_iterations = 50
display_step = 5
print("Please be patient, this code will need a very long time to run (approx. 60 minutes in DataBricks)!")
for i in range(adam_train_iterations):
  for b in range(NUM_BATCHES_TRAIN):
    batch_train = mnist.train.next_batch(BATCH_SIZE) # get the following batch of training images
    # Perform batch training
    sess.run(adam_optimizer, feed_dict = {x: batch_train[0], y: batch_train[1]})
    
  if i % display_step == 0:
    accur_train = sess.run(accuracy_cnn, feed_dict = {x: batch_train[0], y: batch_train[1]})
    print("Iteration %d, Training accuracy = %2.2f%%" %(i, 100*accur_train))
    
print("Optimization finished!")

<h3> Part 3: CNN evaluation </h3>
Finally, we should check how accurately the trained CNN performs when applied to the testing set.

In [58]:
NUM_IMAGES_TEST = mnist.test.images.shape[0] # 10 thousand
NUM_BATCHES_TEST = NUM_IMAGES_TEST/BATCH_SIZE

true_labels, pred_labels = np.empty([0, 1]), np.empty([0, 1])
for b in range(NUM_BATCHES_TEST):
  batch_init, batch_end = BATCH_SIZE*b, BATCH_SIZE*(b+1)
  # Get the next batch of test images
  x_batch_test = mnist.test.images[batch_init:batch_end]
  y_batch_test = mnist.test.labels[batch_init:batch_end]
  # Perform evaluation on this test batch
  batch_pred = sess.run(tf.argmax(y_cnn, 1), feed_dict = {x: x_batch_test})
  batch_true = sess.run(tf.argmax(y, 1), feed_dict = {y: y_batch_test})
  # Store the results together with those from previous test batches
  pred_labels = np.vstack((pred_labels, batch_pred.reshape([-1, 1])))
  true_labels = np.vstack((true_labels, batch_true.reshape([-1, 1])))

accur_test = np.mean(np.equal(true_labels, pred_labels))
print("Test accuracy = %2.2f%%" %(100*accur_test))

<b>Task 1</b>: Check the dimensions of each of the tensors involved in the CNN. To do so, use TF's get_shape() function. Explain why the dimensions are those.

In [60]:
<FILL_IN>

<b>Task 2</b>: Check the CNN decisions for the wrongly classified digits in the testing set. Find out which examples had a failure. Plot them, indicating both the true digit and the predicted label.

In [62]:
idx_wrong_test = np.nonzero(np.not_equal(<FILL_IN>, <FILL_IN>))[0]
num_wrong_test = <FILL_IN>
print("Number of wrongly classified test instances: %d" %(num_wrong_test))

num_cols = 10 # to arrange the plots
num_rows = np.int8(np.ceil(np.true_divide(num_wrong_test, num_cols)))

fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols)
count = 0
for ax in axes.flat:
  if count >= num_wrong_test:
    ax.axis('off')
    continue # no more wrong items to plot
  
  idx = idx_wrong_test[count]
  # Get the image
  example = mnist.test.images[<FILL_IN>]
  example = np.reshape(<FILL_IN>, <FILL_IN>)
  ax.imshow(example, cmap='gray')
  ax.axis('off')
  
  # Get the labels
  true_label = <FILL_IN>
  pred_label = <FILL_IN>
  example_title = "True: %d; Pred: %d" %(true_label, pred_label)
  ax.title.set_text(example_title)
  ax.title.set_fontsize(5)
  
  count += 1

plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=1) # this line is just for a more comfortable display of images and titles
plt.show()
display(fig)

<h3> Advanced work </h3>
<b>Task 1</b>: Plot the weights of the trained convolutional filters for layers 1 and 2. This will give you a visual hint why the CNN can be interpreted as performing some kind of automated feature extraction.

In [64]:
# Convolutional layer 1
<FILL_IN>

In [65]:
# Convolutional layer 2
<FILL_IN>

<b>Task 2</b>: Create your own CNN architecture. Be creative: add or remove layers, change the size and/or the number of convolutional filters, the type and size of pooling, the non-linear activation functions, etc. Try modifying the architecture of the fully connected part of the network. Explore different optimizer algorithms.

In [67]:
<FILL_IN>