# Intro to Deep Learning, HW3 
# Yifan Li, yl506 

**Problem 2: Implementation of a 2-layer CNN with TensorFlow (24 points)**

To gain experience with the implementation of CNNs in TensorFlow, please implement a 2-layer
CNN followed by 2 fully connected layers as an architecture for a model to classify MNIST digits.

a) Specify the network without stride and 3x3 or 5x5 filter sizes. You can choose the number
of filters in each layer and the number of hidden units of the first fully connected layer.

b) Calculate the receptive field of your 2-layer CNN for a 28x28 MNIST image.

Notes:

• For the implementation make sure of using TensorFlow’s Core API. You are welcome to
try other implementations using higher level APIs (e.g., layers, Slim).

• You are only required to use the built-in cross entropy with logits and stochastic gradient
descent. If you would like to try any of the optimization rules discussed earlier, you are
welcome to.

• When initializing the weights of the model (not including biases) it is important to set their
initial values to random values as discussed in class. You can use truncated normal
distributions as discussed in class.

In [17]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Import data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# Split the dataset
x_train = np.concatenate((mnist.train.images, mnist.validation.images), axis=0)
y_train = np.concatenate((mnist.train.labels, mnist.validation.labels), axis=0)
print(x_train.shape)


Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
(60000, 784)


In [12]:
#Q2(a). TF Core Implementation
# filter size 5*5, no pooling, no dropout
num_epochs = 10
batch_size = 100
n_classes = 10
# TF Graph Input
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def maxpool2d(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


weights = {
    # First 5*5 convolution, 1 input image, 32 outputs
    'W_conv1': tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1)),
    # Second 5*5 convolution, 32 inputs, 64 outputs
    'W_conv2': tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1)),
    # First fully connected, 28*28*64 inputs, 1024 outputs
    'W_fc': tf.Variable(tf.truncated_normal([28*28*64, 1024], stddev=0.1)),
    # Output fully connected, 1024 inputs, 10 outputs
    'out': tf.Variable(tf.truncated_normal([1024, n_classes], stddev=0.1))
}

biases = {
    'b_conv1': tf.Variable(tf.random_normal([32])),
    'b_conv2': tf.Variable(tf.random_normal([64])),
    'b_fc': tf.Variable(tf.constant(0.0, shape=[1024])),
    'out': tf.Variable(tf.constant(0.0, shape=[10]))
}

# Reshape input to a 4D tensor 
inputs = tf.reshape(x, shape=[-1, 28, 28, 1])
# Convolution Layer
conv1 = tf.nn.relu(conv2d(inputs, weights['W_conv1']) + biases['b_conv1'])
conv2 = tf.nn.relu(conv2d(conv1, weights['W_conv2']) + biases['b_conv2'])
# Fully-connected Layer
fc = tf.reshape(conv2, [-1, 28*28*64])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc']) + biases['b_fc'])
predictions = tf.matmul(fc, weights['out']) + biases['out']

  
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=predictions))
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) 

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(num_epochs):
  print('----------------Epoch {}--------------------' .format(epoch+1))
  for i in range(int(num_train/batch_size)):
      batch_xs = x_train[i*100 : i*100+batch_size, :]
      batch_ys = y_train[i*100 : i*100+batch_size, :]
      sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys})
      if i % 100 == 0 or i == 599:
        correct_prediction = tf.equal(tf.argmax(tf.nn.softmax(predictions), 1), tf.argmax(y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        print('Iteration {}, Validation Accuracy: {}' .format(i, sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}) ))


sess.close()



----------------Epoch 1--------------------
Iteration 0, Validation Accuracy: 0.09749999642372131
Iteration 100, Validation Accuracy: 0.3068999946117401
Iteration 200, Validation Accuracy: 0.4569999873638153
Iteration 300, Validation Accuracy: 0.6028000116348267
Iteration 400, Validation Accuracy: 0.6535999774932861
Iteration 500, Validation Accuracy: 0.7569000124931335
Iteration 599, Validation Accuracy: 0.7842000126838684
----------------Epoch 2--------------------
Iteration 0, Validation Accuracy: 0.7703999876976013
Iteration 100, Validation Accuracy: 0.8151999711990356
Iteration 200, Validation Accuracy: 0.8288999795913696
Iteration 300, Validation Accuracy: 0.838699996471405
Iteration 400, Validation Accuracy: 0.8439000248908997
Iteration 500, Validation Accuracy: 0.8605999946594238
Iteration 599, Validation Accuracy: 0.8677999973297119
----------------Epoch 3--------------------
Iteration 0, Validation Accuracy: 0.8689000010490417
Iteration 100, Validation Accuracy: 0.87559998035

**Q2(b) Answer**

To calculate the receptive field, it is assumed that the input and the filters are square. Denote the filter size as $k*k$, stride = $s$, current receptive field size = $r*r$, the stride (distance between 2 adjacent features) =$j$. The derived expressions to calculate receptive field size from problem 1 are: 

$j_{out} = j_{in} * s$

$r_{out} = r_{in} + (k-1)j_{in}$

Also, for the input image, $j_0 = r_0 = 1$ , and for the network I am using,  $s = 1, k = 5$, the output feature map has the same size as the input.

 So, after the first convolutional layer, $j_1 = j_0 s = 1, r_1 = r_0 + (k-1)j_0 = 5$
 And after the second convolutional layer, $j_2 = 1, r_2 = 9$

**Problem 3: Adding pooling and dropout to a 2-layer CNN with TensorFlow (24 points)**

To gain a better understanding of how pooling and dropout affect the performance of CNNs,
please try the following:

a) Add 2x2 pooling layers after each convolutional layer for the specification in Problem 2.

b) Add dropout after the first fully connected layer for the specification in Problem 2. You
can choose the probability of keeping (not setting to zero) hidden unit elements.

Notes:
• For the pooling layer, you are free to choose between max and average pooling, however
we encourage you to use max pooling.

In [14]:
#Q3(a). TF Core Implementation
# filter size 5*5, 2*2 pooling with stride 2, no dropout
num_epochs = 10
batch_size = 100
n_classes = 10
# TF Graph Input

x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def maxpool2d(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


weights = {
    # First 5*5 convolution, 1 input image, 32 outputs
    'W_conv1': tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1)),
    # Second 5*5 convolution, 32 inputs, 64 outputs
    'W_conv2': tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1)),
    # First fully connected, 7*7*64 inputs, 1024 outputs
    'W_fc': tf.Variable(tf.truncated_normal([7*7*64, 1024], stddev=0.1)),
    # Output fully connected, 1024 inputs, 10 outputs
    'out': tf.Variable(tf.truncated_normal([1024, n_classes], stddev=0.1))
}

biases = {
    'b_conv1': tf.Variable(tf.random_normal([32])),
    'b_conv2': tf.Variable(tf.random_normal([64])),
    'b_fc': tf.Variable(tf.constant(0.0, shape=[1024])),
    'out': tf.Variable(tf.constant(0.0, shape=[10]))
}

# Reshape input to a 4D tensor 
inputs = tf.reshape(x, shape=[-1, 28, 28, 1])
# Convolution Layer
conv1 = tf.nn.relu(conv2d(inputs, weights['W_conv1']) + biases['b_conv1'])
pool1 = maxpool2d(conv1)
conv2 = tf.nn.relu(conv2d(pool1, weights['W_conv2']) + biases['b_conv2'])
pool2 = maxpool2d(conv2)
# Fully-connected Layer
fc = tf.reshape(pool2, [-1, 7*7*64])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc']) + biases['b_fc'])
predictions = tf.matmul(fc, weights['out']) + biases['out']

  
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=predictions))
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) 

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(num_epochs):
    print('----------------Epoch {}--------------------' .format(epoch+1))
    for i in range(int(num_train/batch_size)):
        batch_xs = x_train[i*100 : i*100+batch_size, :]
        batch_ys = y_train[i*100 : i*100+batch_size, :]
        sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys})
        if i % 100 == 0 or i == 599:
          correct_prediction = tf.equal(tf.argmax(tf.nn.softmax(predictions), 1), tf.argmax(y, 1))
          accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
          print('Iteration {}, Validation Accuracy: {}' .format(i, sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels}) ))

sess.close()

----------------Epoch 1--------------------
Iteration 0, Validation Accuracy: 0.11349999904632568
Iteration 100, Validation Accuracy: 0.5482000112533569
Iteration 200, Validation Accuracy: 0.7017999887466431
Iteration 300, Validation Accuracy: 0.7153000235557556
Iteration 400, Validation Accuracy: 0.8154000043869019
Iteration 500, Validation Accuracy: 0.8363000154495239
Iteration 599, Validation Accuracy: 0.8562999963760376
----------------Epoch 2--------------------
Iteration 0, Validation Accuracy: 0.8567000031471252
Iteration 100, Validation Accuracy: 0.8679999709129333
Iteration 200, Validation Accuracy: 0.8641999959945679
Iteration 300, Validation Accuracy: 0.871399998664856
Iteration 400, Validation Accuracy: 0.8835999965667725
Iteration 500, Validation Accuracy: 0.8863999843597412
Iteration 599, Validation Accuracy: 0.891700029373169
----------------Epoch 3--------------------
Iteration 0, Validation Accuracy: 0.8937000036239624
Iteration 100, Validation Accuracy: 0.899200022220

In [15]:
#Q3(b). TF Core Implementation
# filter size 5*5, 2*2 pooling with stride 2, dropout rate 0.1
num_epochs = 10
batch_size = 100
n_classes = 10
# TF Graph Input
keep_prob = tf.placeholder(tf.float32)
x = tf.placeholder(tf.float32, [None, 784])
y = tf.placeholder(tf.float32, [None, 10])

def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def maxpool2d(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')


weights = {
    # First 5*5 convolution, 1 input image, 32 outputs
    'W_conv1': tf.Variable(tf.truncated_normal([5, 5, 1, 32], stddev=0.1)),
    # Second 5*5 convolution, 32 inputs, 64 outputs
    'W_conv2': tf.Variable(tf.truncated_normal([5, 5, 32, 64], stddev=0.1)),
    # First fully connected, 7*7*64 inputs, 1024 outputs
    'W_fc': tf.Variable(tf.truncated_normal([7*7*64, 1024], stddev=0.1)),
    # Output fully connected, 1024 inputs, 10 outputs
    'out': tf.Variable(tf.truncated_normal([1024, n_classes], stddev=0.1))
}

biases = {
    'b_conv1': tf.Variable(tf.random_normal([32])),
    'b_conv2': tf.Variable(tf.random_normal([64])),
    'b_fc': tf.Variable(tf.constant(0.0, shape=[1024])),
    'out': tf.Variable(tf.constant(0.0, shape=[10]))
}

# Reshape input to a 4D tensor 
inputs = tf.reshape(x, shape=[-1, 28, 28, 1])
# Convolution Layer
conv1 = tf.nn.relu(conv2d(inputs, weights['W_conv1']) + biases['b_conv1'])
pool1 = maxpool2d(conv1)
conv2 = tf.nn.relu(conv2d(pool1, weights['W_conv2']) + biases['b_conv2'])
pool2 = maxpool2d(conv2)
# Fully-connected Layer
fc = tf.reshape(pool2, [-1, 7*7*64])
fc = tf.nn.relu(tf.matmul(fc, weights['W_fc']) + biases['b_fc'])
fc_drop = tf.nn.dropout(fc, keep_prob)
predictions = tf.matmul(fc_drop, weights['out']) + biases['out']

  
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=predictions))
optimizer = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) 

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(num_epochs):
    print('----------------Epoch {}--------------------' .format(epoch+1))
    for i in range(int(num_train/batch_size)):
        batch_xs = x_train[i*100 : i*100+batch_size, :]
        batch_ys = y_train[i*100 : i*100+batch_size, :]
        sess.run(optimizer, feed_dict={x: batch_xs, y: batch_ys, keep_prob: 0.9})
        if i % 100 == 0 or i == 599:
          correct_prediction = tf.equal(tf.argmax(tf.nn.softmax(predictions), 1), tf.argmax(y, 1))
          accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
          print('Iteration {}, Validation Accuracy: {}' .format(i, sess.run(accuracy, feed_dict={x: mnist.test.images, y: mnist.test.labels, keep_prob: 1}) ))

sess.close()

----------------Epoch 1--------------------
Iteration 0, Validation Accuracy: 0.07900000363588333
Iteration 100, Validation Accuracy: 0.45210000872612
Iteration 200, Validation Accuracy: 0.6517000198364258
Iteration 300, Validation Accuracy: 0.7516000270843506
Iteration 400, Validation Accuracy: 0.7817000150680542
Iteration 500, Validation Accuracy: 0.8209999799728394
Iteration 599, Validation Accuracy: 0.8327999711036682
----------------Epoch 2--------------------
Iteration 0, Validation Accuracy: 0.8422999978065491
Iteration 100, Validation Accuracy: 0.8568999767303467
Iteration 200, Validation Accuracy: 0.866100013256073
Iteration 300, Validation Accuracy: 0.8787000179290771
Iteration 400, Validation Accuracy: 0.8847000002861023
Iteration 500, Validation Accuracy: 0.8895000219345093
Iteration 599, Validation Accuracy: 0.88919997215271
----------------Epoch 3--------------------
Iteration 0, Validation Accuracy: 0.8939999938011169
Iteration 100, Validation Accuracy: 0.900799989700317

**Problem 4: Performance comparison (24 points)**

a) What is the validation accuracy of the CNN with and without pooling?

b) Did you observe any performance improvements after adding dropout?

c) How does the CNN model compare, in terms of performance, to the multi-class logistic
regression and multi-class MLP from HW2?

d) How does the number of trainable parameters in the CNN models compare to that of the
multi-class logistic regression and multi-class MLP from HW2?


**Q4 Answers:**

The CNNs I built have the following parameters: 

*   Training set contains 60,000 images that are originally from both the training set and the validation set. And validation set contains 10,000 images originally from the test set.
*   Convolutional filter size = 5*5, stride = 1
*   Padding is set to 'SAME' for all convolutional layers
*   Batch size = 100, number of epochs = 10. So there are 600 iterations for each epoch.
*   Stochastic gradient descent with learning rate of 0.001 is used as the optimizer. And cross-entropy is used as loss function.
*   Weights are generated using truncated normal distribution with stdev=0.1. The biases of filters are generated using random normal distribution, and the biases of fully connected layers are initialized to 0.

**(a). **
The validation accuracy of the vanilla CNN that consists of 2 convolutional layers followed by 2 fully connected layers is 94.55%, as shown before. The result is decent given the fact that the CNN is very simple, and the model was only trained for a short time. And the accuracy improves a little bit to 94.77% after pooling layers were added. What's more, I recorded the time it took to complete the learning process, and the CNN without pooling layers took about 353 seconds, while the CNN with pooling layers only took about 89 seconds to get the output results.

**(b)**
Indeed,  the accuracy improves to 95.35% after adding the dropout, and the training time is almost the same as the one without dropout.

**(c)**
The CNN models are definitely better than the logistic regression models from previous HW. But actually, my most complex multi-class MLP performs slightly better than the CNN models by about 1%. There are several factors that might cause this. For example, the learning rate of the MLP was insensely explored and the best value is chosen and applied to the MLP, while the learning rate of 0.001 might not be the optimal value for CNNs. Also, the MLP has more layers than the CNNs, and given the fact that MNIST dataset is not extremely complex, MLPs are also able to perform well.

**(d)**
The trainable parameters in the CNN models are definitely much more than that of the multi-class MLPs or logistic regression classifiers. CNNs are computationally expensive. They were trained using GPUs and took much longer than the training process of MLPs using CPUs. But we expect that such additional costs would pay off when large-scale computation or networks are trained using CNNs.


**Problem 5: Bookkeeping (4 points)**

**(a)** About 10 hours. It took some time to get familiar with DCC, and I was actually having some trouble requesting a GPU on DCC (I did manage to run programs on DCC a few weeks ago), so I completed the homework on Google Colab.

**(b)** I adhered to the Duke Community Standard in the completion of this assignment