In [12]:
from __future__ import division, print_function, unicode_literals
import numpy as np
import tensorflow as tf


# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# Tensorflow and Deep Learning

In this lab assignment, first you will learn how to build and train a neural network that recognises handwritten digits, and then you will build LeNet-5 CNN architecture, which is widely used for handwritten digit recognition. At the end of this lab assignment, you will make AlexNet CNN architecture, which won the 2012 ImageNet ILSVRC challenge.

---
# 1. Dataset
In the first part of the assignment, we use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents. There are 70,000 images, and each image has 784 features. This is because each image is 28×28=784 pixels, and each feature simply represents one pixel's intensity, from 0 (white) to 255 (black). The following figure shows a few images from the MNIST dataset to give you a feel for the complexity of the classification task.

<img src="figs/1-mnist.png" style="width: 300px;"/>

To begin the assignment, first, use `mnist_data.read_data_sets` and download images and labels. It return two lists, called `mnist.test` with 10K images+labels, and `mnist.train` with 60K images+labels.

In [13]:
# TODO: Replace <FILL IN> with appropriate code

from tensorflow.examples.tutorials.mnist import input_data as mnist_data

mnist = mnist_data.read_data_sets("official/mnist/dataset.py", one_hot=True)

print('Number of training examples: ' + str(mnist.train.num_examples))
print('Number of test examples: ' + str(mnist.test.num_examples))

Extracting official/mnist/dataset.py/train-images-idx3-ubyte.gz
Extracting official/mnist/dataset.py/train-labels-idx1-ubyte.gz
Extracting official/mnist/dataset.py/t10k-images-idx3-ubyte.gz
Extracting official/mnist/dataset.py/t10k-labels-idx1-ubyte.gz
Number of training examples: 55000
Number of test examples: 10000


---
# 2. A One-Layer Neural Network
<img src="figs/2-comic1.png" style="width: 500px;"/>

Let's start by building a one-layer neural network. Handwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a **one-layer neural network**. Each neuron in the network does a weighted sum of all of its inputs, adds a bias and then feeds the result through some non-linear activation function. Here we design a one-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9).
<img src="figs/3-one_layer.png" style="width: 400px;"/>


For a classification problem, an *activation function* that works well is **softmax**. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector.
<img src="figs/4-softmax.png" style="width: 300px;"/>

We can summarise the behaviour of this single layer of neurons into a simple formula using a *matrix multiply*. If we give input data into the network in *mini-batch* of 100 images, it produces 100 predictions as the output. We define the **weights matrix $W$** with 10 columns, in which each column indicates the weight of a one class (a single digit), from 0 to 9. Using the first column of $W$, we can compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron that points to the number 0. Using the second column of $W$, we do the same for the second neuron (number 1) and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images in the mini-batch. If we call $X$ the matrix containing our 100 images (each row corresponds to one digit), all the weighted sums for our 10 neurons, computed on 100 images are simply $X.W$. Each neuron must now add its bias. Since we have 10 neurons, we have 10 bias constants. We finally apply the **softmax activation function** and obtain the formula describing a one-layer neural network, applied to 100 images.
<img src="figs/5-xw.png" style="width: 600px;"/>
<img src="figs/6-softmax2.png" style="width: 500px;"/>

Then, we need to use the **cross-entropy** to measure how good the predictions are, i.e., the distance between what the network tells us and what we know to be the truth. The cross-entropy is a function of weights, biases, pixels of the training image and its known label. If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases, we obtain a **gradient**, computed for a given image, label and present value of weights and biases. We can update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images.
<img src="figs/7-cross_entropy.png" style="width: 600px;"/>

### Define Variables and Placeholders
First we define TensorFlow **variables** and **placeholders**. *Variables* are all the parameters that you want the training algorithm to determine for you (e.g., weights and biases). *Placeholders* are parameters that will be filled with actual data during training (e.g., training images). The shape of the tensor holding the training images is [None, 28, 28, 1] which stands for:
  - 28, 28, 1: our images are 28x28 (784) pixels x 1 value per pixel (grayscale). The last number would be 3 for color images and is not really necessary here.
  - None: this dimension will be the number of images in the mini-batch. It will be known at training time.

We also need an additional placeholder for the training labels that will be provided alongside training images.

In [14]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 1 layer of 10 softmax neurons
#
# · · · · · · · · · ·       (input data, flattened pixels)       X [batch, 784] 
# \x/x\x/x\x/x\x/x\x/    -- fully connected layer (softmax)      W      b[10]
#   · · · · · · · ·                                              Y_hat 

# input X: 28x28 grayscale images, the first dimension (None) will index the images in the mini-batch
X = tf.placeholder(tf.float32, [None, 28,28,1], name="X")
print(X)


# weights W[784, 10], 784 = 28 * 28
W = tf.Variable(tf.zeros([784, 10]))
                  

# biases b[10]
b = tf.Variable(tf.zeros([10]))

# correct answers will go here                  
Y_true= tf.placeholder(tf.float32, [None, 10], name="Y_true")
print(Y_true)

Tensor("X_1:0", shape=(?, 28, 28, 1), dtype=float32)
Tensor("Y_true:0", shape=(?, 10), dtype=float32)


### Build The Model
Now, we can make a **model** for a one-layer neural network. The formula is the one we explained before, i.e., $\hat{Y} = softmax(X . W + b)$. You can use the `tf.nn.softmax` and `tf.matmul` to build the model. Here, we need to use the `tf.reshape` to transform our 28x28 images into single vectors of 784 pixels.

In [15]:
# TODO: Replace <FILL IN> with appropriate code

# flatten the images into a single line of pixels
XX = tf.reshape(X, [100, 784])

# The model
z = tf.matmul(XX,W)+b
Y_hat = tf.nn.softmax(z)


### Define The Cost Function
Now, we have model predictions $\hat{Y}$ and correct labels $Y$, so for each instance $i$ (image) we can compute the cross-entropy as the **cost function**: $cross\_entropy = -\sum(Y_i * log(\hat{Y}i))$. You can use `reduce_mean` to add all the components in a tensor.

In [16]:
# TODO: Replace <FILL IN> with appropriate code
cros_entropy = -Y_true * tf.log(Y_hat) - (1 - Y_true) * tf.log(1 - Y_hat)
cros_entropy = tf.reduce_mean(cros_entropy)

### Traine the Model
Now, select the gradient descent optimiser `GradientDescentOptimizer` and ask it to minimise the cross-entropy cost. In this step, TensorFlow computes the partial derivatives of the cost function relatively to all the weights and all the biases (the gradient). The gradient is then used to update the weights and biases. Set the learning rate is $0.005$.

In [17]:
# TODO: Replace <FILL IN> with appropriate code

learning_rate = 0.005
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_step = optimizer.minimize(cros_entropy)


### Execute the Model
It is time to run the training loop. All the TensorFlow instructions up to this point have been preparing a computation graph in memory but nothing has been computed yet. The computation requires actual data to be fed into the placeholders. This is supplied in the form of a Python dictionary, where the keys are the names of the placeholders. During the trainig print out the cost every 200 steps. Moreove, after training the model, print out the accurray of the model by testing it on the test data.

In [26]:
# TODO: Replace <FILL IN> with appropriate code

# init
init = tf.global_variables_initializer()


batch_size=100
n_epochs = 300 #reduce for a quick execution, cost vanishes at epoch 2000 aprox.
slope=mnist.train.num_examples // batch_size
print(mnist.train.num_examples)
print(mnist.test.num_examples)
print(slope)

with tf.Session() as sess:
    init.run()
    
    for epoch in range(n_epochs):
        for iteration in range(slope):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            #print(y_batch.size)
            #print(X_batch.size)
            _, loss_val = sess.run([train_step,cros_entropy], feed_dict={XX: X_batch, Y_true: y_batch}) #run the train step and cost function for the input data
            #feed_dict feeds the placeholders with data
            if iteration%200==0:
                print(iteration)
                print( "Cost = %s" % loss_val)#Cost at a certain step
        print(epoch)#print the epoch in which the function is  
    
    #accuracy test    
    acc=tf.reduce_max(Y_hat, 1)
    test_X, test_y= mnist.test.next_batch(batch_size)
    print(acc.eval(feed_dict={XX:test_X, Y:test_y }))  #test the data with the test set to see the accuracy of the model

   
    
         

55000
10000
550
0
Cost = 0.32508314
200
Cost = 0.31210315
400
Cost = 0.29987627
0
0
Cost = 0.29213715
200
Cost = 0.28010443
400
Cost = 0.26891443
1
0
Cost = 0.26218578
200
Cost = 0.2565016
400
Cost = 0.2415565
2
0
Cost = 0.24660689
200
Cost = 0.22682707
400
Cost = 0.23079038
3
0
Cost = 0.23057832
200
Cost = 0.21536942
400
Cost = 0.19280526
4
0
Cost = 0.20838009
200
Cost = 0.19906619
400
Cost = 0.20144755
5
0
Cost = 0.18333569
200
Cost = 0.18855198
400
Cost = 0.1790279
6
0
Cost = 0.16281362
200
Cost = 0.1753126
400
Cost = 0.16258904
7
0
Cost = 0.16687177
200
Cost = 0.1608328
400
Cost = 0.15971848
8
0
Cost = 0.16770692
200
Cost = 0.14212179
400
Cost = 0.17602938
9
0
Cost = 0.16602331
200
Cost = 0.15733893
400
Cost = 0.17334458
10
0
Cost = 0.13741134
200
Cost = 0.12526998
400
Cost = 0.14444046
11
0
Cost = 0.1281639
200
Cost = 0.14091326
400
Cost = 0.14127633
12
0
Cost = 0.12905674
200
Cost = 0.13145027
400
Cost = 0.12779611
13
0
Cost = 0.13750698
200
Cost = 0.11705674
400
Cost = 0.1231396

200
Cost = 0.055380926
400
Cost = 0.06120811
121
0
Cost = 0.0599365
200
Cost = 0.080620624
400
Cost = 0.07950398
122
0
Cost = 0.082313016
200
Cost = 0.06288645
400
Cost = 0.046082366
123
0
Cost = 0.074027404
200
Cost = 0.08885612
400
Cost = 0.06578214
124
0
Cost = 0.07393935
200
Cost = 0.05366336
400
Cost = 0.056770198
125
0
Cost = 0.05885835
200
Cost = 0.0965691
400
Cost = 0.050256904
126
0
Cost = 0.05783076
200
Cost = 0.07351293
400
Cost = 0.075822264
127
0
Cost = 0.0699606
200
Cost = 0.056903016
400
Cost = 0.08180697
128
0
Cost = 0.059116118
200
Cost = 0.07804675
400
Cost = 0.046790138
129
0
Cost = 0.05947342
200
Cost = 0.069321945
400
Cost = 0.0638456
130
0
Cost = 0.06617261
200
Cost = 0.06543575
400
Cost = 0.059919637
131
0
Cost = 0.060813714
200
Cost = 0.06117127
400
Cost = 0.060949832
132
0
Cost = 0.07952425
200
Cost = 0.07647197
400
Cost = 0.061985042
133
0
Cost = 0.05619465
200
Cost = 0.06515457
400
Cost = 0.0753953
134
0
Cost = 0.03792768
200
Cost = 0.07368756
400
Cost = 0.05

400
Cost = 0.05083828
239
0
Cost = 0.05500557
200
Cost = 0.061244503
400
Cost = 0.05907586
240
0
Cost = 0.05704519
200
Cost = 0.045893677
400
Cost = 0.061222907
241
0
Cost = 0.076169446
200
Cost = 0.044740517
400
Cost = 0.047478728
242
0
Cost = 0.0777979
200
Cost = 0.053856656
400
Cost = 0.04294823
243
0
Cost = 0.07162991
200
Cost = 0.06693899
400
Cost = 0.043948386
244
0
Cost = 0.070287414
200
Cost = 0.057280786
400
Cost = 0.05975762
245
0
Cost = 0.046964426
200
Cost = 0.05832266
400
Cost = 0.07150976
246
0
Cost = 0.05375249
200
Cost = 0.08889392
400
Cost = 0.07568609
247
0
Cost = 0.03641413
200
Cost = 0.06203193
400
Cost = 0.06373622
248
0
Cost = 0.07181938
200
Cost = 0.07716192
400
Cost = 0.06043835
249
0
Cost = 0.060638797
200
Cost = 0.060573403
400
Cost = 0.04682659
250
0
Cost = 0.09325732
200
Cost = 0.04720495
400
Cost = 0.07184334
251
0
Cost = 0.07291748
200
Cost = 0.044299584
400
Cost = 0.07713153
252
0
Cost = 0.04280526
200
Cost = 0.050750867
400
Cost = 0.062167794
253
0
Cost 

In [29]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with five layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1 [200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2 [100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3 [60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4 [30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5 [10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28,28,1], name="X")
Y = tf.placeholder(tf.float32, [None, 10], name="Y")

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
W1 = tf.get_variable("W1", dtype=tf.float32,initializer=tf.zeros((784, 200)))
B1 = tf.get_variable("B1", dtype=tf.float32, initializer=tf.zeros((200)))

W2 = tf.get_variable("W2", dtype=tf.float32,initializer=tf.zeros((200, 100)))
B2 = tf.get_variable("B2", dtype=tf.float32, initializer=tf.zeros((100)))

W3 = tf.get_variable("W3", dtype=tf.float32,initializer=tf.zeros((100, 60)))
B3 = tf.get_variable("B3", dtype=tf.float32, initializer=tf.zeros((60)))

W4 = tf.get_variable("W4", dtype=tf.float32,initializer=tf.zeros((60, 30)))
B4 = tf.get_variable("B4", dtype=tf.float32, initializer=tf.zeros((30)))

W5 = tf.get_variable("W5", dtype=tf.float32,initializer=tf.zeros((30, 10)))
B5 = tf.get_variable("B5", dtype=tf.float32, initializer=tf.zeros((10)))

########################################
# build the model
########################################
XX = tf.reshape(X, [100, 784])

# make the network
Y1_hat = tf.nn.sigmoid(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.sigmoid(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.sigmoid(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.sigmoid(tf.matmul(Y3_hat, W4) + B4)
Y_hat = tf.nn.softmax(tf.matmul(Y4_hat, W5) + B5)

########################################
# define the cost function
########################################
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(logits=Y_hat, labels=Y)
cost = tf.reduce_mean(cross_entropy)
########################################
# define the optimizer
########################################
learning_rate = 0.005
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_step = optimizer.minimize(cost)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

batch_size = 100

n_epochs = 300
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        training_X, training_y = mnist.train.next_batch(batch_size)
        sess.run([train_step, cost], feed_dict={XX: training_X, Y: training_y})#run the model with training data
    acc=tf.reduce_max(Y_hat, 1)
    test_X, test_y= mnist.test.next_batch(batch_size)#test the accuraccy with test data
    print(acc.eval(feed_dict={XX:test_X, Y:test_y })) # accuracy is stucck at 0.1

[0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229
 0.10018229 0.10018229 0.10018229 0.10018229 0.10018229 0.1001

---
# 4. Special Care for Deep Networks
As layers were added, neural networks tended to converge with more difficulties. For example, the accuracy could stuck at 0.1. Here, we want to apply some updates to the network we built in the previous part to improve its performance. 

### ReLU Activation Function
<img src="figs/10-comic3.png" style="width: 500px;"/>
The sigmoid activation function is actually quite problematic in deep networks. It squashes all values between 0 and 1 and when you do so repeatedly, neuron outputs and their gradients can vanish entirely. An alternative activation function is **ReLU** that shows better performance compare to sigmoid. It looks like as below:
<img src="figs/11-relu.png" style="width: 300px;"/>

### A Better Optimizer
In very high dimensional spaces like here, **saddle points** are frequent. These are points that are not local minima, but where the gradient is nevertheless zero and the gradient descent optimizer stays stuck there. One possible solution to tackle this probelm is to use better optimizers, such as Adam optimizer `tf.train.AdamOptimizer`.

### Random Initialisations
When working with ReLUs, the best practice is to initialise bias values to small positive values, so that neurons operate in the non-zero range of the ReLU initially.

### Learning Rate
<img src="figs/12-comic4.png" style="width: 500px;"/>
With two, three or four intermediate layers, you can now get close to 98% accuracy, if you push the iterations to 5000 or beyond. But, the results are not very consistent, and the curves jump up and down by a whole percent. A good solution is to start fast and decay the learning rate exponentially from $0.005$ to $0.0001$ for example. In order to pass a different learning rate to the `AdamOptimizer` at each iteration, you will need to define a new placeholder and feed it a new value at each iteration through `feed_dict`. Here is the formula for exponential decay: $learning\_rate = lr\_min + (lr\_max - lr\_min) * e^{\frac{-i}{2000}}$, where $i$ is the iteration number.

### NaN?
In the network you built in the last section, you might see accuracy curve crashes and the console outputs NaN for the cross-entropy. It may happen, because you are attempting to compute a $log(0)$, which is indeed Not A Number (NaN). Remember that the cross-entropy involves a log, computed on the output of the softmax layer. Since softmax is essentially an exponential, which is never zero, we should be fine, but with 32 bit precision floating-point operations, exp(-100) is already a genuine zero. TensorFlow has a handy function that computes the softmax and the cross-entropy in a single step, implemented in a numerically stable way. To use it, you will need to separate the weighted sum plus bias on the last layer, before softmax is applied and then give it with the true values to the function `tf.nn.softmax_cross_entropy_with_logits`.

In the code below, apply the following changes and show their impact on the accuracy of the model on training data, as well as the test data:
* Replace the sigmoid activation function with ReLU
* Use the Adam optimizer
* Initialize weights with small random values between -0.2 and +0.2, and make sure biases are initialised with small positive values, for example 0.1
* Update the learning rate in different iterations. Start fast and decay the learning rate exponentially from $0.005$ to $0.0001$, i.e., 
```
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0
```
* Use `tf.nn.softmax_cross_entropy_with_logits` to prevent getting NaN in output.

In [30]:
# TODO: Replace <FILL IN> with appropriate code
import math

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28,28,1], name="X")
Y = tf.placeholder(tf.float32, [None, 10], name="Y")
learnint_rate = tf.placeholder(tf.float32, shape=[], name = "learning_rate")
# variable learning rate
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0


# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.Variable(tf.random_uniform([784, 200], -0.2, 0.2), name="W1")
B1 = tf.Variable(tf.random_uniform([200], 0.1, 0.1), name="B1")

W2 = tf.Variable(tf.random_uniform([200, 100], -0.2, 0.2), name="W2")
B2 = tf.Variable(tf.random_uniform([100], 0.1, 0.1), name="B2")

W3 = tf.Variable(tf.random_uniform([100, 60], -0.2, 0.2), name="W3")
B3 = tf.Variable(tf.random_uniform([60], 0.1, 0.1), name="B3")

W4 = tf.Variable(tf.random_uniform([60, 30], -0.2, 0.2), name="W4")
B4 = tf.Variable(tf.random_uniform([30], 0.1, 0.1), name="B4")

W5 = tf.Variable(tf.random_uniform([30, 10], -0.2, 0.2), name="W5")
B5 = tf.Variable(tf.random_uniform([10], 0.1, 0.1), name="B5")

########################################
# build the model
########################################
XX = tf.reshape(X, [100, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y2_hat = tf.nn.relu(tf.matmul(Y1_hat, W2) + B2)
Y3_hat = tf.nn.relu(tf.matmul(Y2_hat, W3) + B3)
Y4_hat = tf.nn.relu(tf.matmul(Y3_hat, W4) + B4)
Y_hat = tf.nn.softmax(tf.matmul(Y4_hat, W5) + B5)

########################################
# defining the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=Y_hat, labels=Y) 
cost = tf.reduce_mean(cross_entropy) * 100

########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(min_learning_rate)
train_step = optimizer.minimize(cross_entropy)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

batch_size=100
slope=mnist.train.num_examples // batch_size

n_epochs = 100
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(slope):
            #print(iteration) 
            learning_r = min_learning_rate + (max_learning_rate-min_learning_rate)*math.exp((-iteration)/decay_speed)
            feed_dict={learning_rate:learning_r} 
            training_X, training_y = mnist.train.next_batch(batch_size)
            sess.run([train_step, cost], feed_dict={XX: training_X, Y: training_y})
        print(epoch) 
    acc=tf.reduce_max(Y_hat, 1)
    test_X, test_y= mnist.test.next_batch(batch_size)#test the accuraccy with test data
    print(acc.eval(feed_dict={XX:test_X, Y:test_y })) #test accuracy with test data

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
[1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         0.9680848  1.
 1.         1.         0.9999999  1.         1.         0.9999988
 0.99998844 1.         1.         1.         1.         1.
 1.         0.96208954 1.         1.         1.         1.
 1.         1.         1.         1.         1.         0.9999974
 0.9999976  1.         1.         0.99999964 1.         1.
 1.         1.         1.         0.9999999  1.         1.
 1.         1.         0.99999917 0.99999905 1.         1.
 1.         1.         1.         1.         1.         1.
 0.9999212  1.         1.         1.         0.99999917 1.
 1.         1.         1.         1.         1.

---
# 5. Overfitting and Dropout
<img src="figs/13-comic5.png" style="width: 500px;"/>
You will have noticed that cross-entropy curves for test and training data start disconnecting after a couple thousand iterations. The learning algorithm works on training data only and optimises the training cross-entropy accordingly. It never sees test data so it is not surprising that after a while its work no longer has an effect on the test cross-entropy which stops dropping and sometimes even bounces back up. 
<img src="figs/14-overfit.png" style="width: 500px;"/>
This disconnect is usually labeled **overfitting** and when you see it, you can try to apply a regularisation technique called **dropout**. In dropout, at each training iteration, you drop random neurons from the network. You choose a probability `pkeep` for a neuron to be kept, usually between 50% and 75%, and then at each iteration of the training loop, you randomly remove neurons with all their weights and biases. Different neurons will be dropped at each iteration. When testing the performance of your network of course you put all the neurons back (`pkeep = 1`).
<img src="figs/15-dropout.png" style="width: 500px;"/>
TensorFlow offers a dropout function to be used on the outputs of a layer of neurons. It randomly zeroes-out some of the outputs and boosts the remaining ones by `1 / pkeep`. You can add dropout after each intermediate layer in the network now. 

In the following code, use the dropout between each layer during the training, and set the probability `pkeep` once to $50%$ and another time to $75%$ and compare their results.

In [31]:
# TODO: Replace <FILL IN> with appropriate code

# neural network with 5 layers
#
# · · · · · · · · · ·          (input data, flattened pixels)       X [batch, 784]   
# \x/x\x/x\x/x\x/x\x/       -- fully connected layer (sigmoid)      W1 [784, 200]      B1[200]
#  · · · · · · · · ·                                                Y1_hat [batch, 200]
#   \x/x\x/x\x/x\x/         -- fully connected layer (sigmoid)      W2 [200, 100]      B2[100]
#    · · · · · · ·                                                  Y2_hat [batch, 100]
#     \x/x\x/x\x/           -- fully connected layer (sigmoid)      W3 [100, 60]       B3[60]
#      · · · · ·                                                    Y3_hat [batch, 60]
#       \x/x\x/             -- fully connected layer (sigmoid)      W4 [60, 30]        B4[30]
#        · · ·                                                      Y4_hat [batch, 30]
#         \x/               -- fully connected layer (softmax)      W5 [30, 10]        B5[10]
#          ·                                                        Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28,28,1], name="X")
Y = tf.placeholder(tf.float32, [None, 10], name="Y")

# variable learning rate
max_learning_rate = 0.005
min_learning_rate = 0.0001
decay_speed = 2000.0

# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = 0.75

# five layers and their number of neurons, i.e., 200, 100, 60, 30, and 10
# when using RELUs, make sure biases are initialised with small positive values, for example 0.1
W1 = tf.Variable(tf.random_uniform([784, 200], -0.2, 0.2), name="W1")
B1 = tf.Variable(tf.random_uniform([200], 0.1, 0.1), name="B1")

W2 = tf.Variable(tf.random_uniform([200, 100], -0.2, 0.2), name="W2")
B2 = tf.Variable(tf.random_uniform([100], 0.1, 0.1), name="B2")

W3 = tf.Variable(tf.random_uniform([100, 60], -0.2, 0.2), name="W3")
B3 = tf.Variable(tf.random_uniform([60], 0.1, 0.1), name="B3")

W4 = tf.Variable(tf.random_uniform([60, 30], -0.2, 0.2), name="W4")
B4 = tf.Variable(tf.random_uniform([30], 0.1, 0.1), name="B4")

W5 = tf.Variable(tf.random_uniform([30, 10], -0.2, 0.2), name="W5")
B5 = tf.Variable(tf.random_uniform([10], 0.1, 0.1), name="B5")

########################################
# build the model
########################################
XX = tf.reshape(X, [100, 784])

Y1_hat = tf.nn.relu(tf.matmul(XX, W1) + B1)
Y1_hat_dropout = tf.nn.dropout(Y1_hat, pkeep)
Y2_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y1_hat_dropout, W2) + B2), pkeep)
Y3_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y2_hat_dropout, W3) + B3), pkeep)
Y4_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(Y3_hat_dropout, W4) + B4), pkeep)
Y_hat = y = tf.nn.softmax(tf.nn.relu(tf.matmul(Y4_hat_dropout, W5) + B5))

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=Y_hat, labels=Y) 
cost = tf.reduce_mean(cross_entropy)

########################################
# define the optimizer
########################################
optimizer = tf.train.AdamOptimizer(min_learning_rate)
train_step = optimizer.minimize(cost)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()

batch_size=100
slope=mnist.train.num_examples//batch_size

n_epochs = 10
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(slope):
            #print(iteration) 
            learning_r = min_learning_rate + (max_learning_rate-min_learning_rate)*math.exp((-iteration)/decay_speed)
            feed_dict={learning_rate:learning_r} 
            training_X, training_y = mnist.train.next_batch(batch_size)
            sess.run([train_step, cost], feed_dict={XX: training_X, Y: training_y})
        print(epoch)
    acc=tf.reduce_max(Y_hat, 1)
    test_X, test_y= mnist.test.next_batch(batch_size)#test the accuraccy with test data
    print(acc.eval(feed_dict={XX:test_X , Y:test_y })) #test accuracy with test data

0
1
2
3
4
5
6
7
8
9
[0.9751526  0.9998053  0.7996005  0.77306837 0.99988174 0.84628135
 1.         0.9999033  0.99983096 0.9996511  0.99974924 1.
 0.9943494  0.99896955 1.         0.9990095  0.79495865 0.9996693
 0.99981993 0.9997708  0.9999801  0.9999722  1.         0.9998932
 0.99999475 1.         1.         1.         0.95756626 1.
 0.9671855  0.93731266 0.9997944  0.9999616  0.9999695  1.
 0.9996511  0.99882144 0.99999917 0.9986759  0.99982786 0.9993949
 0.5741834  0.881731   0.9998621  0.99959034 0.9697167  0.9979095
 1.         0.9999964  0.9999999  1.         0.98137313 0.99146795
 1.         1.         0.8848164  0.99956995 0.98467755 0.9998356
 0.9995757  0.9999988  0.99733436 1.         0.9999629  0.9999826
 0.99788374 0.9949293  1.         0.9797335  0.9997662  1.
 1.         0.99999416 1.         0.9992094  0.9999963  1.
 0.9998017  0.9999963  0.99137115 0.9751642  0.9999906  0.67751634
 0.8899663  0.9999999  0.8158268  0.999956   0.6064799  0.9999845
 0.89847803 0.9999814 

---
# 6. Convolutional Network
<img src="figs/16-comic6.png" style="width: 500px;"/>
In the previous sections, all pixels of images flattened into a single vector, which was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, we can use **convolutional neural networks (CNN)** to take advantage of shape information. CNNs apply *a series of filters* to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for classification. CNNs contains three components:
  - **Convolutional layers**: apply a specified number of convolution filters to the image. For each subregion, the layer performs a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a ReLU activation function to the output to introduce nonlinearities into the model.
  - **Pooling layers**: downsample the image data extracted by the convolutional layers to reduce the dimensionality of the feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
  - **Dense (fully connected) layers**: perform classification on the features extracted by the convolutional layers and downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
  
Typically, a CNN is composed of a *stack of **convolutional modules*** that perform feature extraction. Each *module* consists of a *convolutional layer* followed by a *pooling layer*. The last convolutional module is followed by one or more dense layers that perform classification. The final dense layer in a CNN contains a single neuron for each target class in the model, with a softmax activation function to generate a value between 0-1 for each neuron. We can interpret the softmax values for a given image as relative measurements of how likely it is that the image falls into each target class.

Now, let us build a convolutional network for handwritten digit recognition. In this assignment, we will use the architecture shown in the following figure that has three convolutional layers, one fully-connected layer, and one softmax layer. Notice that the second and third convolutional layers have a stride of two that explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. A convolutional layer requires a weights tensor like `[4, 4, 3, 2]`, in which the first two numbers define the size of a filter (map), the third number shows the *depth* of the filter that is the number of *input channel*, and the last number shows the number of *output channel*. The output channel defines the number of times that we repeat the same thing with a different set of weights in one layer. In our implementation, we assume the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected layer is 200.
<img src="figs/17-arch1.png" style="width: 600px;"/>

Convolutional layers can be implemented in TensorFlow using the `tf.nn.conv2d` function, which performs the scanning of the input image in both directions using the supplied weights. This is only the weighted sum part of the neuron. You still need to add a bias and feed the result through an activation function. The padding strategy that works here is to copy pixels from the sides of the image. All digits are on a uniform background so this just extends the background and should not add any unwanted shapes.

In [33]:
# · · · · · · · · · ·      (input data, 1-deep)               X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @   -- conv. layer 5x5x1=>4 stride 1      W1 [5, 5, 1, 4]        B1 [4]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 4]
#   @ @ @ @ @ @ @ @     -- conv. layer 5x5x4=>8 stride 2      W2 [5, 5, 4, 8]        B2 [8]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 8]
#     @ @ @ @ @ @       -- conv. layer 4x4x8=>12 stride 2     W3 [4, 4, 8, 12]       B3 [12]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 12] => reshaped to YY [batch, 7*7*12]
#      \x/x\x\x/        -- fully connected layer (relu)       W4 [7*7*12, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/         -- fully connected layer (softmax)    W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

# load data
X_train = mnist.train.images
Y_train= mnist.train.labels
X_test = mnist.test.images
Y_test = mnist.test.labels

# Reshaping to format which CNN expects (batch, height, width, channels)
X_train = X_train.reshape(X_train.shape[0],28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')


########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y = tf.placeholder(tf.float32, [None, 10])
learning_rate = 0.003

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 4, 8, 12, and the size of fully connected
# layer is 200
W1 = tf.Variable(tf.truncated_normal([5, 5, 1, 4], stddev=0.1))
B1 = tf.Variable(tf.constant(0.1, tf.float32, [4])) #layer depth of 4

W2 = tf.Variable(tf.truncated_normal([5, 5, 4, 8], stddev=0.1))
B2 = tf.Variable(tf.constant(0.1, tf.float32, [8])) #layer depth of 8

W3 = tf.Variable(tf.truncated_normal([4, 4, 8, 12], stddev=0.1))
B3 = tf.Variable(tf.constant(0.1, tf.float32, [12])) #layer depth of 12

W4 = tf.Variable(tf.truncated_normal([7*7*12, 200], stddev=0.1))
B4 = tf.Variable(tf.constant(0.1, tf.float32, [200]))

W5 = tf.Variable(tf.truncated_normal([200, 10], stddev=0.1))
B5 = tf.Variable(tf.constant(0.1, tf.float32, [10]))

########################################
# build the model
########################################
stride = 1  # output is 28x28
Y1_hat = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, stride, stride, 1], padding='SAME') + B1)

stride = 2  # output is 14x14
Y2_hat = tf.nn.relu(tf.nn.conv2d(Y1_hat, W2, strides=[1, stride, stride, 1], padding='SAME') + B2)

stride = 2  # output is 7x7
Y3_hat = tf.nn.relu(tf.nn.conv2d(Y2_hat, W3, strides=[1, stride, stride, 1], padding='SAME') + B3)

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(Y3_hat, shape=[-1, 7*7*12])
Y4_hat = tf.nn.relu(tf.matmul(YY_hat, W4) + B4) #ReLu from 3rd conv reshaped and the dense layer
Ylogits = tf.matmul(Y4_hat, W5) + B5
Y_hat = tf.nn.softmax(Ylogits)

########################################
# define the cost function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=Ylogits, labels= Y)
cost = tf.reduce_mean(cross_entropy) * 100

########################################
# define the optmizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cost)

# accuracy of the trained model, between 0 (worst) and 1 (best)
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_hat, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
predictions = tf.argmax(Y, 1)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
batch_size=100
n_epochs = 10 #reduce for a quick execution, cost vanishes at epoch 2000 aprox.
slope=mnist.train.num_examples // batch_size

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        epoch_loss =sess.run([train_step], feed_dict={X: X_train, Y: Y_train})
        print(epoch)
    correct = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))

    accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
    print('Accuracy:',accuracy.eval({X:X_test, Y:Y_test}))

0
1
2
3
4
5
6
7
8
9
Accuracy: 0.6859


# 7. Improve The Performance
A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem. In the above model, we set the output channel to 4 in the first convolutional layer, which means that we repeat the same filter shape (but with different weights) four times. If we assume that those filters evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are made from more than 4 elemental shapes. So let us bump up the filter sizes a little, and also increase the number of filters in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. The following figure shows the new architecture you should build. Please complete the following code based on the given architecture and dropout technique.
<img src="figs/18-arch2.png" style="width: 600px;"/>

In [34]:
# TODO: Replace <FILL IN> with appropriate code

# · · · · · · · · · ·    (input data, 1-deep)                 X [batch, 28, 28, 1]
# @ @ @ @ @ @ @ @ @ @ -- conv. layer 6x6x1=>6 stride 1        W1 [6, 6, 1, 6]        B1 [6]
# ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                         Y1_hat [batch, 28, 28, 6]
#   @ @ @ @ @ @ @ @   -- conv. layer 5x5x6=>12 stride 2       W2 [5, 5, 6, 12]        B2 [12]
#   ∶∶∶∶∶∶∶∶∶∶∶∶∶∶∶                                           Y2_hat [batch, 14, 14, 12]
#     @ @ @ @ @ @     -- conv. layer 4x4x12=>24 stride 2      W3 [4, 4, 12, 24]       B3 [24]
#     ∶∶∶∶∶∶∶∶∶∶∶                                             Y3_hat [batch, 7, 7, 24] => reshaped to YY [batch, 7*7*24]
#      \x/x\x\x/ ✞    -- fully connected layer (relu+dropout) W4 [7*7*24, 200]       B4 [200]
#       · · · ·                                               Y4_hat [batch, 200]
#       \x/x\x/       -- fully connected layer (softmax)      W5 [200, 10]           B5 [10]
#        · · ·                                                Y_hat [batch, 10]

# to reset the Tensorflow default graph
reset_graph()

# load data
X_train = mnist.train.images
Y_train= mnist.train.labels
X_test = mnist.test.images
Y_test = mnist.test.labels

# Reshaping to format which CNN expects (batch, height, width, channels)
X_train = X_train.reshape(X_train.shape[0],28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

########################################
# define variables and placeholders
########################################
X = tf.placeholder(tf.float32, [None, 28, 28, 1])
Y = tf.placeholder(tf.int32, [None, 10])
lr = 0.003

# probability of keeping a node during dropout = 1.0 at test time (no dropout) and 0.75 at training time
pkeep = 0.75

# three convolutional layers with their channel counts, and a fully connected layer 
# (the last layer has 10 softmax neurons)
# the output depth of first three convolutional layers, are 6, 12, 24, and the size of fully connected
# layer is 200
W1 = tf.Variable(tf.truncated_normal([6, 6, 1, 6], stddev=0.1))
B1 = tf.Variable(tf.constant(0.1, tf.float32, [6])) #layer depth of 6

W2 = tf.Variable(tf.truncated_normal([5, 5, 6, 12], stddev=0.1))
B2 = tf.Variable(tf.constant(0.1, tf.float32, [12])) #layer depth of 12

W3 = tf.Variable(tf.truncated_normal([4, 4, 12, 24], stddev=0.1))
B3 = tf.Variable(tf.constant(0.1, tf.float32, [24])) #layer depth of 24

W4 = tf.Variable(tf.truncated_normal([7*7*24, 200], stddev=0.1))
B4 = tf.Variable(tf.constant(0.1, tf.float32, [200]))

W5 = tf.Variable(tf.truncated_normal([200, 10], stddev=0.1))
B5 = tf.Variable(tf.constant(0.1, tf.float32, [10]))

########################################
# build the model

########################################
stride = 1  # output is 28x28
Y1_hat = tf.nn.relu(tf.nn.conv2d(X, W1, strides=[1, stride, stride, 1], padding='SAME') + B1)# use tf.nn.conv2d

stride = 2  # output is 14x14
Y2_hat = tf.nn.relu(tf.nn.conv2d(Y1_hat, W2, strides=[1, stride, stride, 1], padding='SAME') + B2)

stride = 2  # output is 7x7
Y3_hat = tf.nn.relu(tf.nn.conv2d(Y2_hat, W3, strides=[1, stride, stride, 1], padding='SAME') + B3)

# reshape the output from the third convolution for the fully connected layer
YY_hat = tf.reshape(Y3_hat, shape=[-1, 7*7*24])
Y4_hat_dropout = tf.nn.dropout(tf.nn.relu(tf.matmul(YY_hat, W4) + B4), pkeep) #we add dropout to the function
Ylogits = tf.matmul(Y4_hat_dropout, W5) + B5
Y_hat = tf.nn.softmax(Ylogits)

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=Ylogits, labels= Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100


########################################
# define the optmizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

# accuracy of the trained model, between 0 (worst) and 1 (best)
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_hat, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
predictions = tf.argmax(Y, 1)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
batch_size=100
n_epochs = 10 #reduce for a quick execution, cost vanishes at epoch 2000 aprox.
slope=mnist.train.num_examples // batch_size

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        epoch_loss =sess.run([train_step], feed_dict={X: X_train, Y: Y_train})
        print(epoch)
    correct = tf.equal(tf.argmax(Y_hat, 1), tf.argmax(Y, 1))

    accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
    print('Accuracy:',accuracy.eval({X:X_test, Y:Y_test}))

0
1
2
3
4
5
6
7
8
9
Accuracy: 0.6892


---
# 8. Tensorflow Layers Module
The TensorFlow **layers** `tf.layers` module provides a high-level API that makes it easy to construct a neural network. It provides methods that facilitate: (i) the creation of dense (fully connected) layers and convolutional layers, (ii) adding activation functions, and (iii) applying dropout regularization. In this section use the module `tf.layers` to build the network you made in section 7.

In [120]:
# TODO: Replace <FILL IN> with appropriate code
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# to reset the Tensorflow default graph
reset_graph()

# load data
X_train = mnist.train.images
Y_train= mnist.train.labels
X_test = mnist.test.images
Y_test = mnist.test.labels

# Reshaping to format which CNN expects (batch, height, width, channels)
X_train = X_train.reshape(X_train.shape[0],28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

X = tf.placeholder(tf.float32, [None, 28, 28, 1])
y_true = tf.placeholder(tf.int32, [None, 10])

y1 = tf.layers.conv2d(inputs=X, filters=6, kernel_size=[6, 6], padding="same",strides=(1, 1),
  activation=tf.nn.relu)
y2 = tf.layers.conv2d(inputs=y1, filters=12, kernel_size=[5, 5], padding="same", strides=(2, 2),
  activation=tf.nn.relu)
y3 = tf.layers.conv2d(inputs=y2, filters=24, kernel_size=[4, 4], padding="same", strides=(2, 2),
  activation=tf.nn.relu)

conv3re= tf.reshape(y3, [-1, 7 * 7 * 24])

y4 = tf.layers.dense(inputs=conv3re, units=200, activation=tf.nn.relu)

dropout = tf.layers.dropout(inputs=y4, rate=0.75)

logits = tf.layers.dense(inputs=dropout, units=10)
    
# define the cost and accuracy functions
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=y_true)
cost = tf.reduce_mean(cross_entropy) * 100

# define the optimizer
lr = 0.003
optimizer = tf.train.AdamOptimizer(lr)
train_step = optimizer.minimize(cost)
                         
# execute the model
init = tf.global_variables_initializer()
n_epochs = 10
print(n_epochs)
with tf.Session() as sess:
    sess.run(init)
    for i in range(n_epochs):
        print(i)
        sess.run(train_step, feed_dict={X: X_train, y_true:Y_train})
        correct = tf.equal(tf.argmax(logits, 1), tf.argmax(y_true, 1))
        accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
    print('Accuracy:',accuracy.eval({X:X_test, y_true:Y_test}))

10
0
1
2
3
4
5
6
7
8
9
Accuracy: 0.7502


---
# 9. Keras
Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production. `tf.keras` is TensorFlow's implementation of the Keras API specification. To work with Keras, you need to import `tf.keras` as part of your TensorFlow program setup.
```
import tensorflow as tf
from tensorflow.keras import layers
```
#### Build a model
In Keras, you assemble **layers** to build a model, i.e., a graph of layers. The most common type of model is a stack of layers: the `tf.keras.Sequential` model. For example, the following code builds a simple, fully-connected network (i.e., multi-layer perceptron):
```
model = tf.keras.Sequential()
# adds a densely-connected layer with 64 units to the model:
model.add(layers.Dense(64, activation='relu'))
# add another
model.add(layers.Dense(64, activation='relu'))
# add a softmax layer with 10 output units:
model.add(layers.Dense(10, activation='softmax'))
```
There are many `tf.keras.layers` available with some common constructor parameters:
* `activation`: set the activation function for the layer, which is specified by the name of a built-in function or as a callable object.
* `kernel_initializer` and `bias_initializer`: the initialization schemes that create the layer's weights (weight and bias).
* `kernel_regularizer` and `bias_regularizer`: the regularization schemes that apply the layer's weights (weight and bias), such as L1 or L2 regularization.

#### Train and evaluate
After you construct a model, you can configure its learning process by calling the `compile` method:
```
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
```
The method `tf.keras.Model.compile` takes three important arguments:
* `optimizer`: it specifies the training procedure, e.g., `tf.train.AdamOptimizer` and `tf.train.GradientDescentOptimizer`.
* `loss`: the cost function to minimize during optimization, e.g., mean square error (mse), categorical_crossentropy, and binary_crossentropy.
* `metrics`: used to monitor training, e.g., `accuracy`.

The next step after confiuring the model is to train it by calling the `model.fit` method and giving it training data as its input. After training the model you can call `tf.keras.Model.evaluate` and `tf.keras.Model.predict` methods to evaluate the inference-mode loss and metrics for the data provided or predict the output of the last layer in inference for the data provided, respectively.

You can read more about Keras [here](https://www.tensorflow.org/guide/keras).

In this task, please use Keras to rebuild the network you made in section 7.

In [4]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()
def __init__(self, model_path): 
     self.model = load_model(model_path) 
     self.session = K.get_session() 
     self.graph = tf.get_default_graph() 
     self.graph.finalize() 

# :)

import tensorflow as tf
from tensorflow.keras import layers
from keras.datasets import mnist
from keras import utils

from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, Dropout

learning_rate = 0.003
epochs = 2
batch_size = 2000
outputs = 10

# loading datasets
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0],28, 28, 1).astype('float32')
Y_train = utils.to_categorical(Y_train, outputs)
X_test = X_test.reshape(X_test.shape[0],28, 28, 1).astype('float32')
Y_test = utils.to_categorical(Y_test, outputs)

#create model
model = Sequential()
#add model layers
model.add(Conv2D(6, kernel_size=6, activation='relu'))
model.add(Conv2D(12, kernel_size=5, activation='relu'))
model.add(Conv2D(24, kernel_size=4, activation='relu'))
model.add(Flatten())
model.add(Dense(200, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))


#compile model using accuracy to measure model performance
model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

#train the model
model.fit(X_train, Y_train, batch_size=batch_size, epochs=epochs)

##fter the first batch is trained Keras estimates the training duration (ETA: estimated time of arrival) of one epoch which is equivalent to one round of training with all your samples.
##In addition to that you get the losses (the difference between prediction and true labels) and your metric (in your case the accuracy) for both the training and the validation samples.

#evaluate the model
model.evaluate(X_test, Y_test, verbose=0)

#predict first 4 images in the test set
test_loss =model.predict(X_test, batch_size=batch_size)
print(test_loss)
#actual results for first 4 images in test set
Y_test[:4]

Epoch 1/2
Epoch 2/2
[[8.3665280e-10 1.4173106e-09 4.1102592e-07 ... 9.9999940e-01
  5.4476477e-08 1.7147254e-07]
 [3.0031853e-08 1.6074511e-06 9.9998808e-01 ... 1.6587054e-09
  1.3437347e-09 2.9114142e-12]
 [6.5978412e-10 9.9989915e-01 9.9145552e-07 ... 2.8224300e-05
  2.4966599e-05 2.3126846e-07]
 ...
 [1.3910916e-07 3.3140129e-08 2.0044492e-08 ... 1.5181646e-06
  9.2020291e-06 6.7442779e-05]
 [7.0996586e-10 6.6988054e-11 3.9498498e-11 ... 7.8231831e-12
  4.4656736e-06 1.6719638e-12]
 [3.1997774e-13 5.2060715e-16 2.5328225e-12 ... 3.7876082e-13
  1.0716174e-13 1.6577902e-15]]


array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

---
# 10. Implement LeNet-5
In this section, you should implement **LeNet-5** either using Tensorflow or Keras. Please take a look at its [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) before starting to implement it.
The LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for handwritten digit recognition (MNIST). It is composed of the layers shown in the following table.
<img src="figs/19-letnet5.png" style="width: 600px;"/>
There are a few extra details to be noted:
* MNIST images are 28×28 pixels, but they are zero-padded to 32×32 pixels and normalized before being fed to the network. The rest of the network does not use any padding, which is why the size keeps shrinking as the image progresses through the network.
* The average pooling layers are slightly more complex than usual: each neuron computes the mean of its inputs, then multiplies the result by a learnable coefficient and adds a learnable bias term, then finally applies the activation function.
* Most neurons in layer C3 maps are connected to neurons in only three or four S2 maps (instead of all six S2 maps). See table 1 in the [paper](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf) for details.
* The output layer is a bit special: instead of computing the dot product of the inputs and the weight vector, each neuron outputs the square of the Euclidian distance between its input vector and its weight vector. Each output measures how much the image belongs to a particular digit class. The cross-entropy cost function is now preferred, as it penalizes bad predictions much more, producing larger gradients and thus converging faster.

In [38]:
# to reset the Tensorflow default graph // https://github.com/sujaybabruwad/LeNet-in-Tensorflow/blob/master/LeNet-Lab.ipynb
reset_graph()



# load data
X_train = mnist.train.images
Y_train= mnist.train.labels
X_test = mnist.test.images
Y_test = mnist.test.labels

# Reshaping to format which CNN expects (batch, height, width, channels)
X_train1 = X_train.reshape(X_train.shape[0],28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

p = np.shape(X_train1)
print(p)

max_h = 32 
max_w = 32
pad_h = (max_h-p[1])//2
pad_w = (max_w-p[2])//2 

paddings = ((pad_h,pad_h),(pad_w,pad_w))

X_train      = np.pad(X_train1,[[0, 0], [2,2], [2,2], [0,0]], "constant")
X_test       = np.pad(X_test, [[0, 0], [2,2], [2,2], [0,0]], "constant")

########################################
# define variables and placeholders
########################################

X = tf.placeholder(tf.float32, [None, 32, 32, 1])
Y = tf.placeholder(tf.int32, [None, 10])
one_hot_y = tf.one_hot(Y, 10)

learning_rate = 0.003


W1 = tf.Variable(tf.truncated_normal([5, 5, 1, 6], stddev=1, mean = 0))
B1 = tf.get_variable(name="conv1_biases", shape=[6], initializer=tf.random_normal_initializer(stddev=0.3)) #layer depth of 6

WS2 = tf.get_variable(name="pool1_weights", shape=[6], initializer=tf.random_normal_initializer(stddev=0.3))
BS2 = tf.get_variable(name="pool1_biases", shape=[6], initializer=tf.random_normal_initializer(stddev=0.3))

##C3-s3 connections

W3 = tf.Variable(tf.truncated_normal([5, 5, 6, 16], stddev=1, mean = 0))
B3 = tf.get_variable(name="conv2_biases", shape=[16], initializer=tf.random_normal_initializer(stddev=0.3)) #layer depth of 16

WS4 = tf.get_variable(name="pool2_weights", shape=[16], initializer=tf.random_normal_initializer(stddev=0.3))
BS4 = tf.get_variable(name="pool2_biases", shape=[16], initializer=tf.random_normal_initializer(stddev=0.3))

W5 = tf.Variable(tf.truncated_normal(shape=[5,5,16,120], mean=0, stddev=1))
B5 = tf.get_variable(name="conv3_biases", shape=[120], initializer=tf.random_normal_initializer(stddev=0.3)) 

W6 = tf.Variable(tf.truncated_normal(shape=[120,84], mean=0, stddev=1))
B6 = tf.get_variable(name="f6_biases", shape=[84], initializer=tf.random_normal_initializer(stddev=0.3))

WOut = tf.Variable(tf.truncated_normal(shape=[84,10], mean=0, stddev=1))
BOut = tf.get_variable(name="Out_biases", shape=[10], initializer=tf.random_normal_initializer(stddev=0.3))

########################################
# build the model

########################################
stride = 1  
C1 = tf.nn.tanh(tf.nn.conv2d(X, W1, strides=[1, stride, stride, 1], padding='VALID') + B1)

S2 = tf.nn.tanh(tf.nn.avg_pool(C1, ksize = [1,2,2,1], strides = [1,2,2,1], padding = 'VALID')*WS2)+BS2

#S2 = tf.nn.tanh(tf.nn.avg_pool(C1, ksize = [1,2,2,1], strides = [1,2,2,1], padding = 'VALID'))

stride = 1  
C3 = tf.nn.tanh(tf.nn.conv2d(S2, W3, strides=[1, stride, stride, 1], padding='VALID') + B3)

S4 = tf.nn.tanh(tf.nn.avg_pool(C3, ksize = [1,2,2,1], strides = [1,2,2,1], padding = 'VALID')*WS4)+BS4

#S4 = tf.nn.tanh(tf.nn.avg_pool(C3, ksize = [1,2,2,1], strides = [1,2,2,1], padding = 'VALID'))
                  
stride = 1  
C5 = tf.nn.tanh(tf.nn.conv2d(S4, W5, strides=[1, stride, stride, 1], padding='VALID') + B5)

# reshape the output from the third convolution for the fully connected layer

F6in=tf.reshape(C5, [-1, 1*1*120])

F6out = tf.nn.tanh(tf.matmul(F6in, W6) + B6) 

Output = tf.nn.tanh(tf.matmul(F6out, WOut) + BOut) 

#Ylogits = (F6out - W6)**2
Y_hat = tf.nn.softmax(Output)

########################################
# define the Loss function
########################################
cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(logits=Y_hat, labels= Y)
cross_entropy = tf.reduce_mean(cross_entropy) * 100


########################################
# define the optmizer
########################################
optimizer = tf.train.AdamOptimizer(learning_rate)
train_step = optimizer.minimize(cross_entropy)

# accuracy of the trained model, between 0 (worst) and 1 (best)
correct_prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(Y_hat, 1))
accuracy_operation = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
predictions = tf.argmax(Y, 1)

########################################
# execute the model
########################################
init = tf.global_variables_initializer()
batch_size=1000
epochs = 10 #reduce for a quick execution
slope=mnist.train.num_examples // batch_size

def evaluate(X_data, y_data):
    num_examples = len(X_data)
    total_accuracy = 0
    sess = tf.get_default_session()
    for offset in range(0, num_examples, batch_size):
        batch_x, batch_y = X_data[offset:offset+batch_size], y_data[offset:offset+batch_size]
        accuracy = sess.run(accuracy_operation, feed_dict={X: X_data, Y: y_data})
        total_accuracy += (accuracy * len(batch_x))
        #print( total_accuracy)
    return total_accuracy / num_examples


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    num_examples = 55000
    
    print("Training...")
    print()
    for i in range(epochs):
        #X_train, y_train = shuffle(X_train, Y_train)
        for offset in range(0, num_examples, batch_size):
            end = offset + batch_size
            batch_x, batch_y = X_train[offset:end], Y_train[offset:end]
            sess.run(train_step, feed_dict={X: batch_x, Y: batch_y})
            #print(offset)

        ##evaluate the model
        #print(X_test[1,1])
        #print(Y_test)
        test_accuracy = evaluate(X_test, Y_test)
        print("Test Accuracy = {:.3f}".format(test_accuracy))


(55000, 28, 28, 1)
Training...

Test Accuracy = 0.542
Test Accuracy = 0.595
Test Accuracy = 0.616
Test Accuracy = 0.627
Test Accuracy = 0.637
Test Accuracy = 0.640
Test Accuracy = 0.641
Test Accuracy = 0.647
Test Accuracy = 0.650
Test Accuracy = 0.651


---
# 11. Implement AlexNet
In the last section, you should implement **AlexNet** either using Tensorflow or Keras. Again, please take a look at its [paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) before start to implement it.
The AlexNet CNN architecture won the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2012/) in 2012 by a large margin. It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer. The following table presents this architecture.
<img src="figs/20-alexnet.png" style="width: 600px;"/>
To train the model, we need a big dataset, however, in this assignment you are going to to assign the pretrained weights to your model, using `tf.Variable.assign`. You can download the pretrained weights from [bvlc_alexnet.npy](https://www.cs.toronto.edu/~guerzhoy/tf_alexnet/bvlc_alexnet.npy). This file is a NumPy array file created by the python. After you read this file, you will receive a python dictionary with a <key, value> pair for each layer. Each key is one of the layers names, e.g., `conv1`, and each value is a list of two values: (1) weights, and (2) biases of that layer. Part of the function to load the weights and biases to your model is given, and you need to complete it.

Here is what you see if you read and print the shape of each layer from the file:
```
weight_dic = np.load("bvlc_alexnet.npy", encoding="bytes").item()
for layer in weights_dic:
    print("-" * 20)
    print(layer)
    for wb in weights_dic[layer]:
        print(wb.shape)

#--------------------
# fc8
# (4096, 1000) # weights
# (1000,) # bias
#--------------------
# fc7
# (4096, 4096) # weights
# (4096,) # bias
#--------------------
# fc6
# (9216, 4096) # weights
# (4096,) # bias
#--------------------
# conv5
# (3, 3, 192, 256) # weights
# (256,) # bias
#--------------------
# conv4
# (3, 3, 192, 384) # weights
# (384,) # bias
#--------------------
# conv3
# (3, 3, 256, 384) # weights
# (384,) # bias
#--------------------
# conv2
# (5, 5, 48, 256) # weights
# (256,) # bias
#--------------------
# conv1
# (11, 11, 3, 96) # weights
# (96,) # bias
```


In [3]:
# TODO: Replace <FILL IN> with appropriate code

# to reset the Tensorflow default graph
reset_graph()

def maxPoolLayer(x, kHeight, kWidth, strideX, strideY, name, padding = "SAME"):
    return tf.nn.max_pool(x, ksize = [1, kHeight, kWidth, 1],
                          strides = [1, strideX, strideY, 1], padding = padding, name = name)
 
def dropout(x, keepPro, name = None):
    return tf.nn.dropout(x, keepPro, name)
 
def LRN(x, R, alpha, beta, name = None, bias = 1.0):
    return tf.nn.local_response_normalization(x, depth_radius = R, alpha = alpha,
                                              beta = beta, bias = bias, name = name)
 
def fcLayer(x, inputD, outputD, reluFlag, name):
    """fully-connect"""
    with tf.variable_scope(name) as scope:
        w = tf.get_variable("w", shape = [inputD, outputD], dtype = "float")
        b = tf.get_variable("b", [outputD], dtype = "float")
        out = tf.nn.xw_plus_b(x, w, b, name = scope.name)
        if reluFlag:
            return tf.nn.relu(out)
        else:
            return out
def convLayer(x, kHeight, kWidth, strideX, strideY,
              featureNum, name, padding = "SAME", groups = 1): #group=2 means the second part of AlexNet
    """convlutional"""
    channel = int(x.get_shape()[-1]) #get channel
    conv = lambda a, b: tf.nn.conv2d(a, b, strides = [1, strideY, strideX, 1], padding = padding)
    with tf.variable_scope(name) as scope:
        w = tf.get_variable("w", shape = [kHeight, kWidth, channel/groups, featureNum])
        b = tf.get_variable("b", shape = [featureNum])
 
        xNew = tf.split(value = x, num_or_size_splits = groups, axis = 3)#input and weights after split
        wNew = tf.split(value = w, num_or_size_splits = groups, axis = 3)
 
        featureMap = [conv(t1, t2) for t1, t2 in zip(xNew, wNew)] #retriving the feature map separately
        mergeFeatureMap = tf.concat(axis = 3, values = featureMap) #concatnating feature map 
        # print mergeFeatureMap.shape
        out = tf.nn.bias_add(mergeFeatureMap, b)
        return tf.nn.relu(tf.reshape(out, mergeFeatureMap.get_shape().as_list()), name = scope.name)


# build the AlexNet model
def __init__(self, x, keepPro, classNum, skip, modelPath = "bvlc_alexnet.npy"):
        self.X = x
        self.KEEPPRO = keepPro
        self.CLASSNUM = classNum
        self.SKIP = skip
        self.MODELPATH = modelPath
 #build CNN
        self.buildCNN()
 
        def buildCNN(self):
        
            conv1 = convLayer(self.X, 11, 11, 4, 4, 96, "conv1", "VALID")
            lrn1 = LRN(conv1, 2, 2e-05, 0.75, "norm1")
            pool1 = maxPoolLayer(lrn1, 3, 3, 2, 2, "pool1", "VALID")
 
            conv2 = convLayer(pool1, 5, 5, 1, 1, 256, "conv2", groups = 2)
            lrn2 = LRN(conv2, 2, 2e-05, 0.75, "lrn2")
            pool2 = maxPoolLayer(lrn2, 3, 3, 2, 2, "pool2", "VALID")
 
            conv3 = convLayer(pool2, 3, 3, 1, 1, 384, "conv3")
 
            conv4 = convLayer(conv3, 3, 3, 1, 1, 384, "conv4", groups = 2)
 
            conv5 = convLayer(conv4, 3, 3, 1, 1, 256, "conv5", groups = 2)
            pool5 = maxPoolLayer(conv5, 3, 3, 2, 2, "pool5", "VALID")
 
            fcIn = tf.reshape(pool5, [-1, 256 * 6 * 6])
            fc1 = fcLayer(fcIn, 256 * 6 * 6, 4096, True, "fc6")
            dropout1 = dropout(fc1, self.KEEPPRO)
 
            fc2 = fcLayer(dropout1, 4096, 4096, True, "fc7")
            dropout2 = dropout(fc2, self.KEEPPRO)
 
            self.fc3 = fcLayer(dropout2, 4096, self.CLASSNUM, True, "fc8")   
        
        
# load inital weights and biases to the model
        def load_initial_weights(self, session):
    # load the weights into memory
           weights_dic = np.load(self.MODELPATH, encoding='bytes').item()

    # loop over all layer names stored in the weights dict
        for layer in weights_dict:
            if name not in self.SKIP:
                with tf.variable_scope(layer, reuse=True):
            # loop over list of weights/biases and assign them to their corresponding tf variable
                  for wb in weights_dict[layer]:
                # biases
                     if len(wb.shape) == 1:
                        bias = tf.get_variable('b', trainable = False)
                        session.run(bias.assign(wb))
                # weights
                     else:
                        weight = tf.get_variable('w', trainable = False)
                        session.run(weight.assign(wb))
                

#### Test the model
After building the AlexNet model, you can test it on different images and present the accuracy of the model. To do so, first you need to use **OpenCV** library to make the images ready to give as input to the model. OpenCV is a library used for image processing. Below you can see how to read an image file and pre-process it using OpenCV to give it to the model. However, you need to complete the code and test the accuracy of your model. The teset images (shown below) are available in the `test_images` folder.
<table width="100%">
<tr>
<td><img src="test_images/test_image1.jpg" style="width:200px;"></td>
<td><p align="center"><img src="test_images/test_image2.jpg" style="width:200px;"></td>
<td align="right"><img src="test_images/test_image3.jpg" style="width:200px;"></td>
</tr>

In [4]:
# TODO: Replace <FILL IN> with appropriate code
# test the AlexNet model on the given images

import cv2 as cv
import os as os


dropoutPro = 1
classNum = 1000
skip = []
#get testImage
testPath = "testModel"
testImg = []

#get list of all images
current_dir = os.getcwd()
image_path = os.path.join(current_dir, 'test_images')
img_files = [os.path.join(image_path, f) for f in os.listdir(image_path) if f.endswith('.jpg')]

#load all images
imgs = []
for f in img_files:
    imgs.append(cv2.imread(f))
    
x = tf.placeholder("float", [1, 227, 227, 3])
 
model = alexnet.alexNet(x, dropoutPro, classNum, skip)
score = model.fc3
softmax = tf.nn.softmax(score)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    model.loadModel(sess)
    
    # loop over all images
    for i, image in enumerate(imgs):
        # convert image to float32 and resize to (227x227)
        img = cv2.resize(image.astype(np.float32), (227, 227))
        
        # subtract the ImageNet mean
        # Mean subtraction per channel was used to center the data around zero mean for each channel (R, G, B).
        # This typically helps the network to learn faster since gradients act uniformly for each channel.
        imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)
        img -= imagenet_mean
        
        # reshape as needed to feed into model
        img = img.reshape((1, 227, 227, 3))
        maxx = np.argmax(sess.run(softmax, feed_dict = {x: test}))
        res = caffe_classes.class_names[maxx] #find the max probility
        #print(res)
        font = cv2.FONT_HERSHEY_SIMPLEX
        cv2.putText(img, res, (int(img.shape[0]/3), int(img.shape[1]/3)), font, 1, (0, 0, 255), 2) #putting on the labels
        cv2.imshow("demo", img) 
        cv2.waitKey(5000)
        
    

FileNotFoundError: [Errno 2] No such file or directory: '/extra/hops/staging/private_dirs/7b6aa3aa8952aae29861a4c8dc26ca53937fa1540825ee67a8b7056f27483449/test_images'