# Deep Neural Networks in Tensorflow

In this lab session, we will use the knowledge acquired in the tutorial to implement and train a deep neural network. In particular, we will:

1. Implement a neural network with *one hidden layer* for multiclass classification using MNIST.

2. Implement a neural network with *two hidden layers* for multiclass classification using MNIST.

## Load the Data / Auxiliary Functions

We copy here the auxiliary functions that we already used in the tutorial, including those to load the data.

In [1]:
import tensorflow as tf
import os, struct
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
from sklearn.datasets import fetch_mldata

In [2]:
# Load the dataset
mnist_dict = fetch_mldata("MNIST original")

# Get the data
X_all = mnist_dict['data']
Y_all = mnist_dict['target']
print("Shape of X_all: ", X_all.shape)
print("Shape of Y_all: ", Y_all.shape)

# Get the number of classes (10) and the dimensionality of the input data (28*28)
num_classes = len(np.unique(Y_all))
num_pixels = X_all.shape[1]
print("Number of classes: ", num_classes)
print("Number of pixels: ", num_pixels)

Shape of X_all:  (70000, 784)
Shape of Y_all:  (70000,)
Number of classes:  10
Number of pixels:  784


In [3]:
# Number of observations in each group
n_training_cases = 60000
n_test_cases = 8000

# Create a permutation vector to shuffle the images
perm = np.random.permutation(X_all.shape[0])

# Split into training/test/validation
X_train = X_all[perm[:n_training_cases], :].astype(np.float32)
Y_train = Y_all[perm[:n_training_cases], None].astype(np.int32)

X_test = X_all[perm[n_training_cases:n_training_cases+n_test_cases], :].astype(np.float32)
Y_test = Y_all[perm[n_training_cases:n_training_cases+n_test_cases], None].astype(np.int32)

X_val = X_all[perm[n_training_cases+n_test_cases:], :].astype(np.float32)
Y_val = Y_all[perm[n_training_cases+n_test_cases:], None].astype(np.int32)

print("Shape of X_train: ", X_train.shape, "\t Shape of Y_train: ", Y_train.shape)
print("Shape of X_test: ", X_test.shape, "\t Shape of Y_test: ", Y_test.shape)
print("Shape of X_val: ", X_val.shape, "\t Shape of Y_val: ", Y_val.shape)

Shape of X_train:  (60000, 784) 	 Shape of Y_train:  (60000, 1)
Shape of X_test:  (8000, 784) 	 Shape of Y_test:  (8000, 1)
Shape of X_val:  (2000, 784) 	 Shape of Y_val:  (2000, 1)


In [4]:
# [In case you want to switch to a smaller dataset]
train_subset = 60000   # <--- Replace with some other number if you wish to select a smaller dataset

X_train_small = X_train[:train_subset, :]  # First images:
Y_train_small = Y_train[:train_subset, :]  # We don't have to choose a random set because images have already been shuffled

print("Shape of X_train_small: ", X_train_small.shape, "\t Shape of Y_train_small: ", Y_train_small.shape)

Shape of X_train_small:  (60000, 784) 	 Shape of Y_train_small:  (60000, 1)


In [5]:
def accuracy(predictions, labels):
    # Return % of correctly classified images
    return (100.0 * np.sum(np.argmax(predictions, axis=1) == np.squeeze(labels)) / predictions.shape[0])

In [6]:
def next_minibatch(X_, Y_, batch_size):
    # Create a vector with batch_size random integers
    perm = np.random.permutation(X_.shape[0])
    perm = perm[:batch_size]
    # Generate the minibatch
    X_batch = X_[perm, :]
    Y_batch = Y_[perm, :]
    # Return the images and the labels
    return X_batch, Y_batch

In [7]:
def create_weight(shape):
    # creates and initializes a weight matrix of the specified size
    return tf.Variable( tf.truncated_normal(shape, stddev=0.01) )

def create_bias(shape):
    # creates and initializes a bias term of the specified size
    return tf.Variable( tf.constant(0.1, shape=shape) )

## 1. Multinomial Logistic Regression (One Hidden Layer)

Modify the code from the tutorial to implement a neural network with one hidden layer. In particular, we will consider the following model. Let $\mathbf{X}$ be the matrix containing the images (with one row per image).

**First layer (hidden layer).** The first layer of the NN computes the auxiliary variables $\mathbf{Z}^{(0)}$ as
$$
\mathbf{Z}^{(0)} = f\left(\mathbf{X}\mathbf{W}^{(0)} + \mathbf{b}^{(0)}\right),
$$
where

- $\mathbf{W}^{(0)}$ denotes the weights of the top layer. The size of this matrix is $D\times K$, where $D$ is the dimensionality of the inputs (i.e., the number of pixels of each image), and $K$ is the number of hidden units. You can use $K=2000$.

- $\mathbf{b}^{(0)}$ denotes the biases of the top layer (a vector of length $K$).

- $f(\cdot)$ is a non-linear function. For concreteness, we will use a rectified linear unit (ReLU), which is defined as
$$
f(x) = \max(0,x)
$$
In Tensorflow, we can use the ReLU function `tf.nn.relu()`.

*Note:* $\mathbf{Z}^{(0)}$ is an auxiliary variable, but it is not a variable to be optimized. This means that it must *not* be declared in block b (Variables) of the computational graph. The only variables that must appear in block b (Variables) are the weights and biases of both layers.

**Second layer (output layer).** The second layer of the NN is a softmax layer like the one implemented in the previous tutorial. The only difference is that the input to this layer is $\mathbf{Z}^{(0)}$ instead of $\mathbf{X}$. That is,
$$
p(y_n=j\;|\; \mathbf{z}_n^{(0)}) = \frac{e^{{\mathbf{z}_n^{(0)}}\mathbf{w}_j^{(1)} + b_j^{(1)}} }{\sum_{j^\prime=1}^J e^{{\mathbf{z}_n^{(0)}}\mathbf{w}_{j^\prime}^{(1)} + b_{j^\prime}^{(1)} }},\qquad j=1,\ldots,J
$$
Similarly to the tutorial, we need $J$ weight vectors $\mathbf{w}_j^{(1)}$ (each of lengts $K$) and $J$ biases $b_j^{(1)}$, which we can group together in a $K\times J$ matrix $\mathbf{W}^{(1)}$ and a $J$-vector $\mathbf{b}^{(1)}$. This is as shown in the tutorial; the only difference now is the size of $\mathbf{W}^{(1)}$, which is now $K\times J$ instead of $D\times J$ to match the size of the input $\mathbf{Z}^{(0)}$.

**[Task]** Implement and train the network. Find the validation and test accuracy. Is it better or worse than in the tutorial?

Some hints:

1. Use 10000 iterations of SGD with minibatch size of 100. Use AdamOptimizer with learning rate $10^{-5}$.

2. Think about how many latent variables there are and what are their sizes.

3. You may use the following function that encapsulates the operations across the entire network:

```python
def all_nn_computations(X, weights_0, biases_0, weights_1, biases_1):
    Z_0 = tf.nn.relu( tf.matmul(X, weights_0) + biases_0 )
    logits = tf.matmul(Z_0, weights_1) + biases_1
    return logits
```

In [8]:
def all_nn_computations(X, weights_0, biases_0, weights_1, biases_1):
    Z_0 = tf.nn.relu( tf.matmul(X, weights_0) + biases_0 )
    logits = tf.matmul(Z_0, weights_1) + biases_1
    return logits

# Create the computational graph
learning_rate = 1.0e-5
batch_size = 100
max_iterations = 10000
K = 2000 # number of neurons in the hidden layer

graph_MLR_2layers = tf.Graph()
with graph_MLR_2layers.as_default():
    # (a) Input data
    tf_train_data = tf.placeholder(tf.float32, shape=(batch_size, num_pixels))
    tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)

    # (b) Variables
    weights_0 = create_weight([num_pixels, K])
    biases_0 = create_bias([K])
    weights_1 = create_weight([K, num_classes])
    biases_1 = create_bias([num_classes])

    # (c) Computations
    train_logits = all_nn_computations(tf_train_data, weights_0, biases_0, weights_1, biases_1)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )

    # (d) Optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

    # (e) Other tasks
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights_0, biases_0, weights_1, biases_1) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights_0, biases_0, weights_1, biases_1) )

In [9]:
with tf.Session(graph=graph_MLR_2layers) as session:
    # 1. We initialize the weights and biases. This is a one-time operation
    tf.initialize_all_variables().run()
    print("Initialized")

    # 2. Run SGD
    for step in range(max_iterations):
        # Get a new minibatch of data
        X_batch, Y_batch = next_minibatch(X_train_small, Y_train_small, batch_size)

        # Prepare a dictionary telling the session where to feed the minibatch
        feed_dict = { tf_train_data   : X_batch,
                      tf_train_labels : np.squeeze(Y_batch) }

        # Run the computations
        _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict )

        # Every 500 iterations
        if (step % 500 == 0):
            # Print the loss
            print("Minibatch loss at step %d: %f" % (step, l))
            # Obtain and print the accuracy on the training set
            print(" +Minibatch accuracy: %.1f%%" % accuracy(predictions, Y_batch))
            # Obtain and print the accuracy on the validation set
            print(" +Validation accuracy: %.1f%%" % accuracy(val_prediction.eval(), Y_val))

    # 3. Accuracty on the test set
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), Y_test))

Initialized
Minibatch loss at step 0: 6.597075
 +Minibatch accuracy: 5.0%
 +Validation accuracy: 12.2%
Minibatch loss at step 500: 0.265746
 +Minibatch accuracy: 92.0%
 +Validation accuracy: 91.4%
Minibatch loss at step 1000: 0.114103
 +Minibatch accuracy: 96.0%
 +Validation accuracy: 94.1%
Minibatch loss at step 1500: 0.110969
 +Minibatch accuracy: 97.0%
 +Validation accuracy: 95.0%
Minibatch loss at step 2000: 0.036320
 +Minibatch accuracy: 98.0%
 +Validation accuracy: 96.0%
Minibatch loss at step 2500: 0.116913
 +Minibatch accuracy: 96.0%
 +Validation accuracy: 96.5%
Minibatch loss at step 3000: 0.085171
 +Minibatch accuracy: 98.0%
 +Validation accuracy: 97.2%
Minibatch loss at step 3500: 0.052354
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 97.0%
Minibatch loss at step 4000: 0.017966
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 97.0%
Minibatch loss at step 4500: 0.033881
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 97.2%
Minibatch loss at step 5000: 0.009183
 +Min

## 2. Multinomial Logistic Regression (Two Hidden Layers)

Now, extend the model with a second hidden layer. The resulting model is described below.

**First layer (hidden layer).** We will use a standard linear layer with a ReLU output,
$$
\mathbf{Z}^{(0)} = f\left(\mathbf{X}\mathbf{W}^{(0)} + \mathbf{b}^{(0)}\right),
$$
where $f(\cdot)$ is the ReLU function.

**Second layer (hidden layer).** Again, we will use a standard linear layer with a ReLU output,
$$
\mathbf{Z}^{(1)} = f\left(\mathbf{Z}^{(0)}\mathbf{W}^{(1)} + \mathbf{b}^{(1)}\right)
$$

**Third layer (output layer).** As before, we will use a softmax function,
$$
p(y_n=j\;|\; \mathbf{z}_n^{(1)}) = \frac{e^{{\mathbf{z}_n^{(1)}}\mathbf{w}_j^{(2)} + b_j^{(2)}} }{\sum_{j^\prime=1}^J e^{{\mathbf{z}_n^{(1)}}\mathbf{w}_{j^\prime}^{(2)} + b_{j^\prime}^{(2)} }},\qquad j=1,\ldots,J
$$

Now, the size of $\mathbf{W}^{(0)}$ is $D\times K_0$, the size of $\mathbf{W}^{(1)}$ is $K_0\times K_1$, and the size of $\mathbf{W}^{(2)}$ is $K_1\times J$. You may use $K_0=1000$ and $K_1=100$. The bias vectors $\mathbf{b}^{(0)}$ and $\mathbf{b}^{(1)}$ have length $K_0$ and $K_1$, respectively.

**[Task]** Implement and train the network.

Hints:

1. Use SGD (15000 iterations, batch size of 100) and AdamOptimizer. Set the learning rate to $10^{-5}$.

2. You need to re-define `all_nn_computations()`. Make sure you do that appropriately.

In [10]:
def all_nn_computations(X, weights_0, biases_0, weights_1, biases_1, weights_2, biases_2):
    Z_0 = tf.nn.relu( tf.matmul(X, weights_0) + biases_0 )
    Z_1 = tf.nn.relu( tf.matmul(Z_0, weights_1) + biases_1 )
    logits = tf.matmul(Z_1, weights_2) + biases_2
    return logits

# Create the computational graph
learning_rate = 1.0e-5
batch_size = 100
max_iterations = 15000
K_0 = 1000 # number of neurons in each hidden layer
K_1 = 100

graph_MLR_3layers = tf.Graph()
with graph_MLR_3layers.as_default():
    # (a) Input data
    tf_train_data = tf.placeholder(tf.float32, shape=(batch_size, num_pixels))
    tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size))
    tf_test_data = tf.constant(X_test)
    tf_val_data = tf.constant(X_val)

    # (b) Variables
    weights_0 = create_weight([num_pixels, K_0])
    biases_0 = create_bias([K_0])
    weights_1 = create_weight([K_0, K_1])
    biases_1 = create_bias([K_1])
    weights_2 = create_weight([K_1, num_classes])
    biases_2 = create_bias([num_classes])

    # (c) Computations
    train_logits = all_nn_computations(tf_train_data, weights_0, biases_0, weights_1, biases_1, weights_2, biases_2)
    loss = tf.reduce_mean( tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits, tf_train_labels) )

    # (d) Optimizer
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

    # (e) Other tasks
    train_prediction = tf.nn.softmax( train_logits )
    val_prediction = tf.nn.softmax( all_nn_computations(tf_val_data, weights_0, biases_0, weights_1, biases_1, weights_2, biases_2) )
    test_prediction = tf.nn.softmax( all_nn_computations(tf_test_data, weights_0, biases_0, weights_1, biases_1, weights_2, biases_2) )
    

In [11]:
with tf.Session(graph=graph_MLR_3layers) as session:
    # 1. We initialize the weights and biases. This is a one-time operation
    tf.initialize_all_variables().run()
    print("Initialized")

    # 2. Run SGD
    for step in range(max_iterations):
        # Get a new minibatch of data
        X_batch, Y_batch = next_minibatch(X_train_small, Y_train_small, batch_size)

        # Prepare a dictionary telling the session where to feed the minibatch.
        # The key of the dictionary is the placeholder node of the graph to be fed,
        # and the value is the numpy array to feed to it.
        feed_dict = { tf_train_data   : X_batch,
                      tf_train_labels : np.squeeze(Y_batch) }

        # Run the computations
        _, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict )

        # Every 1000 iterations
        if (step % 1000 == 0):
            # Print the loss
            print("Minibatch loss at step %d: %f" % (step, l))
            # Obtain and print the accuracy on the training set
            print(" +Minibatch accuracy: %.1f%%" % accuracy(predictions, Y_batch))
            # Obtain and print the accuracy on the validation set
            print(" +Validation accuracy: %.1f%%" % accuracy(val_prediction.eval(), Y_val))

    # 3. Accuracty on the test set
    print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), Y_test))

Initialized
Minibatch loss at step 0: 2.336803
 +Minibatch accuracy: 10.0%
 +Validation accuracy: 6.8%
Minibatch loss at step 1000: 0.321411
 +Minibatch accuracy: 91.0%
 +Validation accuracy: 93.3%
Minibatch loss at step 2000: 0.091415
 +Minibatch accuracy: 98.0%
 +Validation accuracy: 95.0%
Minibatch loss at step 3000: 0.085133
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 96.0%
Minibatch loss at step 4000: 0.066861
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 96.7%
Minibatch loss at step 5000: 0.068224
 +Minibatch accuracy: 97.0%
 +Validation accuracy: 97.0%
Minibatch loss at step 6000: 0.066474
 +Minibatch accuracy: 98.0%
 +Validation accuracy: 97.9%
Minibatch loss at step 7000: 0.032801
 +Minibatch accuracy: 100.0%
 +Validation accuracy: 98.0%
Minibatch loss at step 8000: 0.038483
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 97.9%
Minibatch loss at step 9000: 0.039843
 +Minibatch accuracy: 99.0%
 +Validation accuracy: 98.0%
Minibatch loss at step 10000: 0.085233
 +