---

# Assignment 10: CNN part 2 (deadline: 20 Jan, 23:59)

## 1. Implementing CNN (10 points)

Here you will implement a simple, but efficient Convolutional Neural Network (CNN) in TensorFlow for image classification task on CIFAR-10 dataset ([link and description](https://www.cs.toronto.edu/~kriz/cifar.html), but please note that the section "Baseline results" is completely outdated). 

We strongly recommend you first to go through a very good [tutorial from TensorFlow](https://www.tensorflow.org/get_started/mnist/pros). You can borrow implementation details from them, especially for convolutions and pooling layers, but make sure to implement our architecture! If you just apply the architecture from that tutorial to CIFAR-10 dataset, you will not receive many points. And please note, that this tutorial considers MNIST dataset, which has grayscale digits. However, CIFAR-10 has images in RGB. There is also another [tutorial by TensorFlow](https://www.tensorflow.org/tutorials/deep_cnn), which shows AlexNet architecture (now outdated!) applied to CIFAR-10. But it is much more advanced than the first one.

Also you may find useful to check these excellent materials http://cs231n.github.io/convolutional-networks/ and http://cs231n.github.io/understanding-cnn/ from Stanford University for general explanations about CNNs.


Specifically for this exercise we decided to invent a slightly new CNN architecture, which combines some ideas from recent research papers. Here it is:

| Layer | Number of units |
|-------| ---|
| input layer   | 32x32x3 units |
| 5x5 conv+ReLU, stride 1, 10 filters| 32x32x10 units |
| 3x3 max pool, stride 2 | 16x16x10 units |
| 4x4 conv+ReLU, stride 1, 20 filters | 16x16x20 units | 
| 3x3 max pool, stride 2 | 8x8x20 units |
| 3x3 conv+ReLU, stride 1, 30 filters | 8x8x30 units | 
| global average pooling | 30 units |
| fully-connected | 10 units (classes) | 
| softmax | 10 units (classes) |
| cross-entropy loss with L2 regularization | 1 unit (objective) |

**Our main expectation for this task is "good" test error (around 35-40% after 10 epochs) and correct implementation of the given architecture.** Of course, it is very far away from the state-of-the-art for this dataset, which is around 3%. But unfortunately, really deep CNNs cannot be trained on ordinary laptops.


Implementation details:
- Initialize your weights with He initialization (for your interest, [original paper](https://arxiv.org/pdf/1502.01852.pdf)), which is simply a normal distribution with mean=0 and std=sqrt(2/n), where n is the number of units on the previous layer (e.g. for 1st conv. layer we have $n=32 \cdot 32 \cdot 3$, for 2nd conv. layer $n=16 \cdot 16 \cdot 10$).
- Convolution should preserve dimensions (use 'SAME'). 
- Global average pooling means that we simply average all our values inside each feature map. So on the 3-rd convolutional layer we have 30 feature maps of size 8x8. After global average pooling we get only 30 values of averaged 8x8 feature maps.
- Use ReLU as activaton function
- We use softmax + cross-entropy loss, and no elementwise sigmoids. Make sure that you use average cross-entropy per batch in order to make an update! Otherwise recommended lambda and learning rate will not be a good choice.
- For optimization you can use Adam with the learning rate given above (0.005), and all other parameters set by default. 
- Please train the model for 10 epochs with batch size = 100. Please note, that 10 epochs is not actually a lot, and we can get substantial improvement if we go beyond 10 epochs. We just want to be sure that 10 epochs can be done in a reasonable amount of time on any kind of laptop that you may have.
- Use TensorBoard to **display the computation graph** defined by your model. Additonally, use TensorBoard to **log the test error, training error and training loss function** of your network for each epoch (again, you can use [here](https://www.tensorflow.org/get_started/summaries_and_tensorboard) for reference). To speed up the training process, it's not necessary to report training accuracy on all 50000 examples, you can just take first 1000 examples.
- Don't forget about L2 regularization of all weights (both convolutional and fully connected). Recommended $\lambda$ is 0.0001.
- Of course, feel free to use different hyperparameters if you find better ones.

Here is the code to load and standardize features of CIFAR-10 dataset. So the simplest version is to load CIFAR-10 using [Tensorpack](https://github.com/ppwwyyxx/tensorpack), which you have to install first. 

In [1]:
import tensorpack.dataflow.dataset as dataset
import numpy as np
train, test = dataset.Cifar10('train'), dataset.Cifar10('test')

# useful to reduce this number to 1000 for debugging purposes
n = 50000
x_train = np.array([train.data[i][0] for i in range(n)], dtype=np.float32)
y_train = np.array([train.data[i][1] for i in range(n)], dtype=np.int32)
x_test = np.array([ex[0] for ex in test.data], dtype=np.float32)
y_test = np.array([ex[1] for ex in test.data], dtype=np.int32)

del(train, test)  # frees approximately 180 MB

# standardization
x_train_pixel_mean = x_train.mean(axis=0)  # per-pixel mean
x_train_pixel_std = x_train.std(axis=0)   # per-pixel std
x_train -= x_train_pixel_mean
x_train /= x_train_pixel_std
x_test -= x_train_pixel_mean
x_test /= x_train_pixel_std

[32m[0120 15:50:10 @fs.py:89][0m [5m[31mWRN[0m Env var $TENSORPACK_DATASET not set, using /home/tatjana/tensorpack_data for datasets.
[32m[0120 15:50:10 @cifar.py:33][0m Found cifar10 data in /home/tatjana/tensorpack_data/cifar10_data.
[32m[0120 15:50:12 @cifar.py:33][0m Found cifar10 data in /home/tatjana/tensorpack_data/cifar10_data.


Note, that you do not have a validation set here, only training and test sets. Of course, a more principled way is to select your hyperparameters based on the validation set and then retrain the model on training and validation sets combined and then report the final performance on the test set. But it would be too complicated for this assignment.

*Hint1: it should be possible to train this CNN on an average laptop (for reference, let's take the processor Intel® Core™ i5-3210M CPU @ 2.50GHz × 4) in 12 minutes and using up to 1 GB RAM (only the dataset needs 60000 images * 32 pixels * 32 pixels * 3 colors * 4 bytes $\approx$ 737 MB after standardization which necessarily converts uint8 to float32). The final test error after 10 epochs should be around 35-40%. For the reference, you can see the optimization log of the correct solution (but of course, you will have random initialization and random shuffling of batches, which makes exactly these numbers not reproducible):*

| Epoch | Test error | Train error |
| --- | -- |
| 1  | 49.300%  | 50.200% |
| 2  | 44.310%  | 45.000% |
| 3  | 43.110%  | 40.700% |
| 4  | 39.550%  | 38.700% |
| 5  | 39.820%  | 39.900% |
| 6  | 38.480%  | 37.000% |
| 7  | 36.180%  | 35.100% |
| 8  | 36.810%  | 37.600% |
| 9  | 35.340%  | 34.800% |
| 10 | 35.210%  | 34.900% |

*Hint2: you may find `tf.nn.sparse_softmax_cross_entropy_with_logits` useful if you don't want to convert the labels into 1-hot encoding.*

*Hint3: you should shuffle batches on each iteration, so you can make use of this function:*
```
def get_next_batch(X, Y, batch_size):
    n_batches = len(X) // batch_size
    rand_idx = np.random.permutation(len(X))[:n_batches * batch_size]
    for batch_idx in rand_idx.reshape([n_batches, batch_size]):
        batch_x, batch_y = X[batch_idx], Y[batch_idx]
        yield batch_x, batch_y
```
        
*Hint4: if you have limited amount of RAM on your laptop (if you don't have any problems with memory, then it's not necessary), you can evaluate test and training errors also by batches and then combine them into the final test/train error. For example like this:*
```
def eval_error(X_np, Y_np, sess, batch_size):
    """Get all predictions for a dataset by running it in small batches."""
    n_batches = len(X_np) // batch_size
    err = 0.0
    for batch_x, batch_y in get_next_batch(X_np, Y_np, batch_size):
        err += sess.run(error_rate, feed_dict={x: batch_x, y: batch_y})
    return err / n_batches
```

*Hint5: you can implement the weight decay in the following way (if you named all your weights variable with the name that contains some `var_pattern`):*
```
def weight_decay(var_pattern):
    """
    L2 weight decay loss, based on all weights that have var_pattern in their name

    var_pattern - a substring of a name of weights variables that we want to use in Weight Decay.
    """
    costs = []
    for var in tf.trainable_variables():
        if var.op.name.find(var_pattern) != -1:
            costs.append(tf.nn.l2_loss(var))
    return tf.add_n(costs)
```

*Hint6: it is a very good idea to debug your code only on a subset of CIFAR-10 (let's say, 1000 examples), so you can quickly see if things go wrong.*


In [2]:
import tensorflow as tf

batch_size = 100
height = 32
width = 32
depth = 3

# placeholders for train data and variables for weights and biases
x = tf.placeholder(tf.float32, (None, height, width, depth), name="x")
y = tf.placeholder(tf.float32, (None,), name="y")

# weights and biases

def conv2d(x_in, W):
    return tf.nn.conv2d(x_in, W, strides=[1, 1, 1, 1], padding='SAME')

w_1 = tf.Variable(tf.random_normal([5,5,3,10], mean=0,stddev=tf.sqrt(2/3072)), name="w_1")
w_2 = tf.Variable(tf.random_normal([4,4,10,20], mean=0,stddev=tf.sqrt(2/2560)), name="w_2")
w_3 = tf.Variable(tf.random_normal([3,3,20,30], mean=0,stddev=tf.sqrt(2/1280)), name="w_3")
w_4 = tf.Variable(tf.random_normal([30,10], mean=0,stddev=tf.sqrt(2/30)), name="w_4")

bias_1 = tf.Variable(tf.zeros([10], tf.float32), name="bias_1")
bias_2 = tf.Variable(tf.zeros([20], tf.float32), name="bias_2")
bias_3 = tf.Variable(tf.zeros([30], tf.float32), name="bias_3")
bias_4 = tf.Variable(tf.zeros([10], tf.float32), name="bias_4")

weights = [w_1, w_2, w_3, w_4]
biases = [bias_1, bias_2, bias_3, bias_4]

In [None]:
def get_next_batch(X, Y, batch_size):
    n_batches = len(X) // batch_size
    rand_idx = np.random.permutation(len(X))[:n_batches * batch_size]
    for batch_idx in rand_idx.reshape([n_batches, batch_size]):
        batch_x, batch_y = X[batch_idx], Y[batch_idx]
        yield batch_x, batch_y
        
def weight_decay(var_pattern):
    """
    L2 weight decay loss, based on all weights that have var_pattern in their name

    var_pattern - a substring of a name of weights variables that we want to use in Weight Decay.
    """
    costs = []
    for var in tf.trainable_variables():
        if var.op.name.find(var_pattern) != -1:
            costs.append(tf.nn.l2_loss(var))
    return tf.add_n(costs)

def forward(x, weights, biases):
    conv_1 = tf.nn.relu(conv2d(x, weights[0]) + biases[0])
    pool_1 = tf.nn.max_pool(conv_1, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME')
    conv_2 = tf.nn.relu(conv2d(pool_1, weights[1]) + biases[1])
    pool_2 = tf.nn.max_pool(conv_2, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME')
    conv_3 = tf.nn.relu(conv2d(pool_2, weights[2]) + biases[2])
    avg_pool = tf.nn.avg_pool(conv_3, [1, 8, 8, 1], [1, 1, 1, 1], padding='VALID')  
    pool_flat = tf.reshape(avg_pool, [-1, 1 * 1 * 30])
    logits = tf.layers.dense(inputs=pool_flat, units=10)
    return(logits, tf.nn.softmax(logits))
    
def compute_cost(unscaled_logits, y, weights, alpha):
    l2 = tf.reduce_sum(weight_decay("w"))
    y = tf.one_hot(indices=tf.cast(y, tf.int32), depth=10)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=unscaled_logits))
    loss = tf.add(loss, alpha*l2)
    return loss

def compute_accuracy(predictions, ground_truth):
    """
    compute & return accuracy (in percentage).
    predictions: predictions from the network
    ground_truth: 1-hot array of ground truth labels
    """
    ground_truth = ground_truth.reshape(-1).astype(int)
    ground_truth = np.eye(10)[ground_truth]        
    return np.sum([np.argmax(ground_truth[x]) != np.argmax(predictions[x]) for x in range(len(predictions))]) / predictions.shape[0]

In [None]:
# construct model, define loss and optimizer
unscaled_logits, predictions = forward(x, weights, biases)
cost = compute_cost(unscaled_logits, y, weights, 0.0001)
optimizer = tf.train.AdamOptimizer(0.005).minimize(cost)

# perform training
with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)
    for epoch in range(int(n/batch_size)):
        err = 0
        nb = 0
        for x_batch, y_batch in get_next_batch(x_train, y_train, batch_size):
            nb += 1
            _, pred, tar = sess.run([optimizer, predictions, y], feed_dict={x: x_batch, y: y_batch})
            err += compute_accuracy(pred, tar)
        tr_err = (err/nb)*batch_size
        
        _, test_pred, test_tar = sess.run([optimizer, predictions, y], feed_dict={x: x_test, y: y_test})
        test_accuracy = compute_accuracy(test_pred, test_tar)
        print("Epoch:",str(epoch+1), "Train error:", tr_err, "Test error:", str(test_accuracy*100))

Epoch: 1 Train error: 65.382 Test error: 55.73
Epoch: 2 Train error: 51.326 Test error: 49.78
Epoch: 3 Train error: 47.174 Test error: 46.99
Epoch: 4 Train error: 44.114 Test error: 42.13
Epoch: 5 Train error: 42.798 Test error: 43.94
Epoch: 6 Train error: 41.486 Test error: 41.72
Epoch: 7 Train error: 40.604 Test error: 41.64
Epoch: 8 Train error: 39.704 Test error: 39.96
Epoch: 9 Train error: 39.352 Test error: 38.84
Epoch: 10 Train error: 38.674 Test error: 38.35
Epoch: 11 Train error: 38.432 Test error: 38.04
Epoch: 12 Train error: 38.17 Test error: 37.63
Epoch: 13 Train error: 37.316 Test error: 37.32
Epoch: 14 Train error: 37.424 Test error: 37.55
Epoch: 15 Train error: 37.074 Test error: 37.61
Epoch: 16 Train error: 36.752 Test error: 36.89
Epoch: 17 Train error: 36.28 Test error: 36.29
Epoch: 18 Train error: 36.054 Test error: 36.11
Epoch: 19 Train error: 35.966 Test error: 36.02
Epoch: 20 Train error: 35.968 Test error: 36.22
Epoch: 21 Train error: 35.824 Test error: 35.52
Epo

## 2. Understanding CNN (5 points + 3 bonus points)
** You should do all these tasks using TensorBoard! It is much more convenient than trying to achieve the same in numpy+matplotlib. To get visualizations all you need is to use tf.summary.image() with correct tensor, which is properly reshaped if it's needed.**

a. Now we will try to understand what CNNs actually learned. Let's do the following:
- Visualize all convolutional filters on all the layers. So you have to visualize 10 conv. filters (5x5) on the 1st layer, 20 conv. filters (4x4) on the 2nd layer and 30 conv. filters (3x3) on the 3-rd layer. Note, that each filter on the 1st layer has 3 input channels, which correspond to red, green and blue colors, so you can display them **in color**. And each filter on other layers has more input channels that have nothing to do with color. So you can just pick 1 of the input channels and display all conv. filters (again, 20 on 2nd layer and 30 on 3rd layer) **in grayscale** that correspond to a particular input channel. Please, provide here a few received images that you find interesting. (1 point)
- Visualize all feature maps (of size 32x32, 16x16, 8x8), i.e. values of neurons after applying activation function (and before the pooling operation) for each convolutional layer for any 5 training examples. Please, provide here a few received images that you find interesting. (1 point)
- Can you notice some interesting patterns on the visualizations? Are all images perfectly interpretable? Discuss both images of convolutional filters and of feature maps on different layers. Are there duplicate filters that look almost identical? (1 point)

b. And now about the optimization process:
- Modify your code to plot the L2 norm of flattened gradients of the weights of each layer after each epoch. So your gradient for a convolutional layer is a tensor of dimension n_in x width x height x n_out (obtained by Optimizer.compute_gradients(loss)), so you should flatten it into a long vector and then take L2 norm of it. Please, also include the gradient magnitude immediately after random initialization. Make sure to calculate gradient magnitudes on the training set (or again, you can use only a subset of 1000 examples). (2 points)
- (BONUS, 3 points) Can you explain why the gradient magnitude doesn't always decrease over time? Can you confirm the picture from Deep Learning book on the page 281, figure 8.1, where the gradient magnitude strictly grows with training? 


## 3. Small questions about CNNs (5 points)
**Please, make sure to provide enough arguments and explanations to your answers.**


a. How many parameters do you have in total? In which layer you have the most of them? Is there any redundancy? (1 point)

On the first convolutional layer we have 10 filters with a size of 5x5 and stride=1. The input dimension is 3:

conv1_params = $(5*5*3+1)*10 = 750$ (for weights and biases)

The second convolutional layer has 20 filters with a size of 4x4:

conv2_params = $(4*4*10+1)*20 = 3220$

On the third convolutional layer we find 30 3x3 filters:

conv3_params = $(3*3*20+1)*30 = 5430$

Finally, we have a fully connected layer:

fc_params = $(30+1)*10 = 310$

We get 9720 parameters in total.

The third convolutional layer has most of the parameters (5430). Since we use parameter sharing among spatial positions we get optimal number of parameters at each layer.  

b. Is it possible to represent a 3x3 convolution operation (for simplicity, let’s assume that we have a single color channel) as a matrix multiplication? If yes, state the form of the matrix. If no, explain why. (2 point)

Yes, 3x3 convolution operation can be represented as a matrix multiplication which has a special form. For 3x3 matrices, convolution can be described as a process of flipping both the rows and columns of the kernel and then multiplying entries at corresponding positions and taking sum over them.

$
 \begin{bmatrix}
  k_{1,1} & k_{1,2} & k_{1,3} \\
  k_{2,1} & k_{2,2} & k_{2,3} \\
  k_{3,1} & k_{3,2} & k_{3,3} 
  \end{bmatrix}
  \cdot
   \begin{bmatrix}
  i_{1,1} & i_{1,2} & i_{1,3} \\
  i_{2,1} & i_{2,2} & i_{2,3} \\
  i_{3,1} & i_{3,2} & i_{3,3} 
  \end{bmatrix}
  =
  \begin{bmatrix}
  k_{3,3} & k_{3,2} & k_{3,1} \\
  k_{2,3} & k_{2,2} & k_{2,1} \\
  k_{1,3} & k_{1,2} & k_{1,1} 
  \end{bmatrix}
  \cdot
   \begin{bmatrix}
  i_{1,1} & i_{1,2} & i_{1,3} \\
  i_{2,1} & i_{2,2} & i_{2,3} \\
  i_{3,1} & i_{3,2} & i_{3,3} 
  \end{bmatrix}
$


E.g. for the central entry (position [2,2]) we get:
$
k_{3,3}*i_{1,1}+k_{3,2}*i_{1,2}+k_{3,1}*i_{1,3}+k_{2,3}*i_{2,1}+k_{2,2}*i_{2,2}+k_{2,1}*i_{2,3}+k_{1,3}*i_{3,1}+k_{1,2}*i_{3,2}+k_{1,1}*i_{3,3}
$

Therefore, the element at position [2, 2] can be represented as a combination of weighted entries of the original image matrix.


c. Applying dropout to convolutional layers may improve your test performance (for you interest, e.g. here [Wide Residual Networks](https://arxiv.org/pdf/1605.07146.pdf)). How could you explain this? (1 point)

Dropout deletes a random sample of the activations in the training phase and helps to prevent overfitting by eliminating the co-adaptation of units. It regularizes the network by adding some noise to the output feature maps on each layer. This allows to achieve better performance because the network becomes more robust to the variation of input features.

However, since dropout causes the loss of information it's usually a good practice to start with small values in the first layer and then gradually increase them. 

d. What is the role of Global Average Pooling that you already implemented in the task 1 (if you are curious, it was proposed in [Network in Network](https://arxiv.org/pdf/1312.4400.pdf))? What is the advantage of using it instead of concatenating all feature maps from the last convolutional layer and using a fully-connected layer on top of them? (1 point)

Global Average Pooling is a structural regularizer and it treats feature maps as confidence maps for given categories. Instead of adding fully connected layers on top of the feature maps, we can take the average of each feature map, and pass the result directly to the activation function (e.g. softmax).  This helps to emphasize correspondences between feature maps and categories. Our feature maps can be interpreted as confidence maps of the categories.

Furthermore, we don't need to optimize any parameters on the pooling layer and this can help us with overfitting. We also sum out the spatial information with Global Average Pooling. Therefore, neural network becomes more robust to the spatial transformations of the input images.

## Bonus task (1 place: 15 pts, 2 place: 10 pts, 3 place: 5 pts)
Now you are free to implement any CNN architecture in order to get **as low test error rate as possible** on CIFAR-10. You can train the model for arbitrary number of epochs, it can have any number of layers and neurons, you can use different data preprocessing techniques, data augmentation, pooling, dropout, model averaging and much more. 

You also can use research papers for inspiration. For example, here is a [collection](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html) of best published results, however it is slightly outdated.

#### Main rules:
- Your final test error should be **reproducible**.
- You must describe your main architecture decisions and why they make sense for the given task.
- You are not allowed to copy and paste CNN architectures from the internet. It should be self-written TF code. We will check this.
- Obviously, you are not allowed to train on test data.
- If you want to take part in the competition, you should submit a **separate python file** which does the training of your model and that outputs test errors for each epoch (evaluated on all 10000 test examples). It will be more convenient format for us to reproduce your results.

*Note: of course, people who have an access to more computational resources have an advantage in this task. However, even if you have an old laptop, you still can perform well and test many models. For example, you can train and evaluate your model only on a subset of the dataset (e.g. 1000 examples), which significantly speeds up the process. You can leave expensive hyperparameter tuning overnight. And of course, you are encouraged to collaborate with your teammates, which also have computational resources.*

## Submission instructions
You should provide a single Jupyter notebook as a solution. The naming should include the assignment number and matriculation IDs of all team members in the following format:
**assignment-10_matriculation1_matriculation_2_matriculation3.ipynb** (in case of 3 team members). 
Make sure to keep the order matriculation1_matriculation_2_matriculation3 the same for all assignments.

Please, submit your solution to your tutor (with **[NNIA][assignment-10]** in email subject):
1. Maksym Andriushchenko <s8mmandr@stud.uni-saarland.de>
2. Marius Mosbach <s9msmosb@stud.uni-saarland.de>
3. Rajarshi Biswas <rbisw17@gmail.com>
4. Marimuthu Kalimuthu <s8makali@stud.uni-saarland.de>

**If you are in a team, please submit only 1 solution to only 1 tutor.**