# Convolution Neural Networks

## Part 1: Convolutions

In this first part you will see what convolutions do and how they work. 

### Load the Python libraries

Let us start by loading the necessary Python libraries and set a few parameters for the notebook. the `misc` element from `scipy` will allow us to do some elementary image manipulation. 

In [0]:
import numpy as np
import matplotlib.pyplot as plt

# for elementary image manipulation
from scipy import misc

% matplotlib inline

# specifies the default figure size for this notebook
plt.rcParams['figure.figsize'] = (10, 10)
# specifies the default color map
plt.rcParams['image.cmap'] = 'gray'

### Import an image

To explore how convolutions work, you will use the portrait of [Grace Hopper](https://en.wikipedia.org/wiki/Grace_Hopper).
Use the `imread` function from `misc` to load the image. ([imread documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.imread.html) specify the option `flatten=True` which averages the three color channel into one for a black-white photo). 

The image corresponds to a matrix of dimension H x W (height by width) where each entry corresponds to a pixel value between 0 (black) and 255 (white).

Use `imshow` from `plt` to display the image. 
Since it's a matrix, you can use standard numpy style indexing to only show a region.
Show the first 200 lines and the first 600 columns.

In [0]:
# Load the grace_hopper.jpg image from the data folder

# Show the image

# Show another figure with only the first 200 rows / 600 cols


Alternatively, you can view the values of the pixels directly, for example select the first five rows and columns (**top left corner**) and show the corresponding matrix. 

In [0]:
# Print the pixel values of a region in the top left corner


### Define and apply a convolution function

Now lets define a convolution function. First you must define a function which traverses the image to apply the convolution at every point and returns the result in a filtered image. Calculating the size of the filtered image along each dimension can be a little tricky, the formula is: 

                         (Size of the filtered image) = (input image size) - (filter size) + 1

Let us start by implementing the `convolve` function. It takes as input an image and a filter matrix, and returns
the output of applying the filter at each position in the image through a function `multiply_sum`. 

The function `multiply_sum` corresponds to the following operation for two matrices $A$ and $B$ of identical dimensions:

$$
\text{multiply_sum}(A, B) = \sum_{i,j} A_{ij}B_{ij}
$$

i.e. you form a matrix $C$ with entries corresponding to the entry-wise products and you sum across $C$

**Note**: the implentations here are computationally inefficient (and you will see that applying it on the image takes a second). In practice, when dealing with thousands of images, you really don't want sub-optimal operations which is why libraries like Tensorflow hide away *a lot* of optimisations to make such operations as quick as possible and leverage the hardware that is available to you (e.g.: GPU).

In [0]:
# add your code to define the multiply_sum function (check that it works on a simple example)


The function below defines the convolution operation, go through the code and make sure it makes sense to you what is happening (there is a bit of fiddling required to apply the operations at the right place and store the results appropriately)

In [0]:
# Convolution function
def convolve(image, filter_matrix):
    
    # get the dimension of the filter
    filter_height = filter_matrix.shape[0]
    filter_width  = filter_matrix.shape[1]
    
    # allocate an empty array for the filtered image using the formula
    # this is the array we'll use to store the result of the convolution
    filtered_image = np.ndarray(shape=(image.shape[0] - filter_height + 1, 
                                       image.shape[1] - filter_width + 1))
    
    # go through rows
    for row in range(filtered_image.shape[0]):
        # go through columns
        for col in range(filtered_image.shape[1]):
            # select a local patch of the image
            patch = image[row:(row + filter_height), 
                          col:(col + filter_width)]

            # apply the multiply_sum operation
            ms = multiply_sum(patch, filter_matrix)
            
            # store it at the right location
            filtered_image[row, col] = ms
            
    return filtered_image

An there you have it, a convolution operator! You can apply a filter onto an image and see the result. 

Define a filter matrix corresponding to

$$
\left(\begin{array}{ccc}
    -1&-1&-1\\ 
    2&2&2\\ 
    -1&-1&-1
\end{array}\right)
$$

apply it to GH's portrait and display the result with `imshow`

In [0]:
# Add your code here...


### Quiz: What did our filter do?

1) By looking at the image, can you tell what kind of pattern the filter detected?

2) How would you design a filter which detects vertical edges?

3) What would the following filter do: ([Prewitt operator](https://en.wikipedia.org/wiki/Prewitt_operator))

$$
\left(\begin{array}{ccc}
    1&1&1\\ 
    0&0&0\\ 
    -1&-1&-1
\end{array}\right)
$$


how about its transpose? how about if you swap the first and last row?

Try variations until you get an intuition for what these operators do.

In [0]:
# Apply the filter and see what it does then transpose it and check again

# apply the transpose

# flip first and last row


### Convolutions with colour

Very good! But what if we had a coloured image, how would we use that extra information to detect useful patterns? 
The idea is simple: on top of having a set weight for each pixel, we have a set weight for each colour channel within that pixel.
Filters become stacks of kernels (usually 3 for the three channels: R, G, B). 

An example is the following kernel which detects region of the image that are mostly brown.

In [0]:
brown_filter = np.array(
      [[[ 0.13871045,  0.17157242,  0.12934428], # Red channel
        [ 0.16168842,  0.20229845,  0.14835016],
        [ 0.135694  ,  0.16206263,  0.11727387]],

       [[ 0.04231958,  0.05471011,  0.03167877], # Green channel
        [ 0.0462575 ,  0.06581022,  0.03104937],
        [ 0.04185439,  0.04734124,  0.02087744]],

       [[-0.15704881, -0.16666673, -0.16600266], # Blue channel
        [-0.17439997, -0.17757156, -0.18760149],
        [-0.15435153, -0.17037505, -0.17269668]]])

print(brown_filter.shape)

The **first** dimension corresponds the three channels (R, G, B) (try `brown_filter[1, :, :]` for the filter values corresponding to the green channel). 

Looking at the values, you can see that the filter responds to regions that are red (positive values, reasonably large), a little bit to the green values (positive values, quite small), and not at all to regions that are blue (negative values). 

To see it in practice, use `imread` from `misc` without `flatten=True` to load the coloured image of Grace Hopper.

Show the image and its dimensions

In [0]:
# Load and display the coloured image of Grace Hopper

# Show the shape of the image


Now we would like to apply the Brown filter and see the result. 
One thing needs to be done, it's a bit annoying but it happens *all the time* in CNNs (and other non-trivial NNs): you need to adjust dimensions. 
Currently, it is the **first** dimension of the brown filter that corresponds to the colour channels while you saw that it is the **last** dimension of GH's image that correspond to the channels. 

Therefore you need to re-arrange dimensions from 

```
(0, 1, 2) -----> (1, 2, 0)
```

this can be done via the `transpose` method: `array.transpose((1, 2, 0))`.

Adjust the brown filter and apply it. 

In [0]:
# Adjust the dimensions of the brown filter and apply it to the image of GH


### Quizz

Can you design a filter which will detect the edge from the background (blue) to Grace Hopper’s left shoulder (black).

**Note**: it's good practice to have the weights in your filter sum to 0 and don't forget to re-arrange the dimensions.

In [0]:
# devise a left_shouler_filter and apply it


## Convolutional Neural networks

Now lets load an already trained network in our environment. This network (VGG-16) has been trained on the Imagenet dataset where the goal is to classify pictures into one out of one thousand categories. 
When it came out in 2014, it won the annual ImageNet Recognition Challenge correctly classifying 93% of the images in the test set. 
For comparison, humans can achieve around 95% accuracy. 
It's also very simple, it only uses 3x3 convolutions (like the ones you have used before)! 

It is however rather deep and it takes **2 to 3 weeks with 4 GPUs** to train it...

To load the model, you must first define it's architecture. 
You're going to do this step by step as you learn the components of convolutional neural networks. 

**IMPORTANT NOTE**: it is extremely difficult to come up with an architecture "that works". So while people successfully adapt existing neural nets such as VGG16 to their needs (and, in fact, a lot of people use it without its last layer as an *image-embedding operator*), architecture design is the realm of research (and much head scratching).

Recently, the Google Brain team has applied brute-force style search to try to find the best architectures for problems but also to learn what activation function to use, what step-size mechanisms etc (learn everything-approach). 
This required an obscene amount of resources and the results were far from intuitive (*search for "Google Brain 2017 year review" for a discussion of this and many other interesting results*). 

### Load the Python libraries

Let's load the necessary libraries. We are again going to use the `Keras` library with the tensorflow backend. You will also use `cv2` for image manipulations. Keras and Tensorflow can be installed via `pip`, `cv2` also though you may have to write

```bash
pip install opencv-python
```

for it to work. 

**NOTE**: these libraries are fairly advanced in terms of computations and therefore it can be a bit fiddly to install them on your computer (especially Tensorflow and OpenCV). If it doesn't work, it's very likely someone has had your problem before so just copy paste the error message in Google and go from there though for this notebook we recommend you pair up with someone who has a working implementation (so that you don't waste too much time reading stackoverflow and github posts). You may also get a few warnings from Python but if it doesn't look too scary, you're probably fine. 


In [0]:
import cv2 # for image manipulations

from keras.models import Sequential

from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers.convolutional import ZeroPadding2D

from keras.optimizers import SGD
from keras.datasets import cifar10
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils

### Implementing a convolutional layer

You are going to define the first convolutional layer of the network. But before, you will add some padding to the image so the convolutions get to apply on the outer edges.

In what follows you don't have to modify the cells but just run them making sure you understand what is being done. Do not tune the parameters as we will load pre-trained weights on the architecture!

In [0]:
# Create the model, it's a Sequential model (stack of layers one after the other)
vgg_model = Sequential()

# On the very first layer, you must specify the input shape
# ZeroPadding2D adds a frame of 0 (column left and right, row top and bottom)
# the tuple (1, 1) indicates it's one pixel and symmetric.
vgg_model.add(ZeroPadding2D((1, 1), input_shape=(224, 224, 3))) 

# Your first convolutional layer will have 64 3x3 filters, 
# and will use a relu activation function
vgg_model.add(Conv2D(64, (3, 3), activation='relu', name='conv1_1'))

### Stacking layers

Now you're going to stack another convolutional layer. Remember, the output of a convolutional layer is a 3-D tensor, just like the input image. Although it does have a much higher depth!

In [0]:
# Once again you must add padding
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(64, (3, 3), activation='relu', name='conv1_2'))

### Adding pooling layers

Now lets add your first pooling layer. Pooling reduces the width and height of the input by aggregating adjacent cells together.


In [0]:
# Add a pooling layer with window size 2x2
# The stride indicates the distance between each pooled window
vgg_model.add(MaxPooling2D((2, 2), strides=(2, 2)))

### Adding more convolutions for VGG

Now you can stack many more of these! Remember not to change the parameters as we are about to load the weights of an already trained version of this network.

Also, as you will quickly realise, Keras for practitioners usually means a lot of copy-pasting...

In [0]:
# second set of Padding - Conv - Padding - Conv - Pooling
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(128, (3, 3), activation='relu', name='conv2_1'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(128, (3, 3), activation='relu', name='conv2_2'))
vgg_model.add(MaxPooling2D((2, 2), strides=(2, 2)))

# third set
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(256, (3, 3), activation='relu', name='conv3_1'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(256, (3, 3), activation='relu', name='conv3_2'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(256, (3, 3), activation='relu', name='conv3_3'))
vgg_model.add(MaxPooling2D((2, 2), strides=(2, 2)))

# fourth set
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv4_1'))
vgg_model.add(ZeroPadding2D((1,1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv4_2'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv4_3'))
vgg_model.add(MaxPooling2D((2, 2), strides=(2, 2)))

# fifth set
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv5_1'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv5_2'))
vgg_model.add(ZeroPadding2D((1, 1)))
vgg_model.add(Conv2D(512, (3, 3), activation='relu', name='conv5_3'))
vgg_model.add(MaxPooling2D((2, 2), strides=(2, 2)))

As you can see, the depth of the layers get progressively larger, up to 512 for the latest layers. 
This means that, as we go along, each layer detects a greater number of features. 

On the other hand, each max-pooling layer halves the height and width of the layer outputs. 
Starting from images of dimensions 224x224, the final outputs are only of size 7x7.

Now you're about to add some fully connected layers which can learn the more abstract features of the image. 
But first you must first change the layout of the input so it looks like a 1-D tensor (vector).

In [0]:
# Flatten the output
vgg_model.add(Flatten())

# Add a fully connected layer with 4096 neurons
vgg_model.add(Dense(4096, activation='relu'))

The `Flatten` function removes the spatial dimensions of the layer output, it is now a simple 1-D vector of numbers. This means we can no longer apply 2D convolution layers as before, but we can apply fully connected layers like the ones of the perceptron from the previous module.

`Dense` layers are fully connected layers. You used them in the previous module.

### Preventing overfitting with Dropout

`Dropout` is a method used at train time to prevent overfitting. As a layer, it randomly modifies its input
so that the neural network learns to be robust to these changes. Although you won’t actually use it
now, you must define it to correctly load the pre-trained weights as it was part of the original network.

In [0]:
# Add a dropout layer
vgg_model.add(Dropout(0.5))

The number 0.5 indicates the amount of change, 0.0 means no change, and 1.0 means completely different.


Add one more fully connected layer:

In [0]:
vgg_model.add(Dense(4096, activation='relu'))
vgg_model.add(Dropout(0.5))

Finally a softmax layer to predict the categories. There are 1000 categories and hence 1000 neurons.

In [0]:
vgg_model.add(Dense(1000, activation='softmax'))

### Loading the weights

And you're all set with the architecture! Let's load the weights of the network. For this use 

```python
vgg_model.load_weights(path)
```

where `path` is the path to `vgg16_weights_tf_dim_ordering_tf_kernels.h5` which you can get [here](https://github.com/fchollet/deep-learning-models/releases): 

In [0]:
# add your code here


Compile the network no need to worry about this for now

In [0]:
sgd = SGD()
vgg_model.compile(optimizer=sgd, loss='categorical_crossentropy')

### Preprocessing the data

Lets feed an image to your model. In the VGG network, we only do zero centering. The model takes as input a slightly transformed version of the input. 

1. use `cv2` to read the `cat.jpg` image in `data/` as well as the `puppy.jpg` and any other simple photo you found online
2. resize the images to a 224 by 224 image using `cv2.resize`
3. (optional) show the photos

In [0]:
# add your code here


The VGG16 network assumes that the input it will receive is an array of 224 by 224 RGB images that have been *centered* in each of their channels. 

The function below applies the centering and makes sure the dimensions are right. 

In [0]:
# This transformation performs the 0-centering
def transform_image(image):
    image_t = np.copy(image).astype(np.float32) # Avoids modifying the original
    image_t[:, :, 0] -= 103.939                 # Substracts mean Red
    image_t[:, :, 1] -= 116.779                 # Substracts mean Green
    image_t[:, :, 2] -= 123.68                  # Substracts mean Blue
    image_t = np.expand_dims(image_t, axis=0)   # The network takes batches of images as input
    return image_t

img_t = transform_image(img)
img2_t = transform_image(img2)

print(img_t.shape)

The first dimension is a "dummy" dimension, that is because the network expects an array of images as input. 

The three subsequent dimensions are the image dimensions with the three colour channels at the end. 

### Getting an output from the network

Let's push the images through the network and see what happens!

In [0]:
# Push the image through the network using vgg_model.predict call the result

# the output is for a batch of images but you only gave one so extract the first element

# now plot the output, xlabel=Categories, ylabel=Probabilities


The network seems pretty confident! Lets look at its top 5 guesses for each of the two images. 

1. load the labels from `data/synset_words`
2. sort the top 5 probabilities
3. display the corresponding categories

In [0]:
# add your code here


Hurray! Our network knows what it's talking about. Let's have a closer look at what goes on inside.

### Looking inside the network

In a convolutional neural network, there's an easy way to visualise the filters learned at the very first layer. We can print each filter to show which colours it reponds to.

In [0]:
# This is a helper function to let you visualise what goes on inside the network

def vis_square(weights, padsize=1, padval=0, activation=False):
    # Avoids modifying the network weights
    data = np.copy(weights)
    
    # Normalise the inputs
    data -= data.min()
    data /= data.max()
    
    # Lets tile the inputs
    # How many inputs per row (e.g.: if 64 filters then 8x8)
    
    n = int(np.ceil(np.sqrt(data.shape[0]))) 
    
    # add padding between inputs (you can safely ignore this)
    padding = ((0, n ** 2 - data.shape[0]), (0, padsize), (0, padsize)) + ((0, 0),) * (data.ndim - 3)
    data = np.pad(data, padding, mode='constant', constant_values=(padval, padval))
    
    # place the filters on an n by n grid
    data = data.reshape((n, n) + data.shape[1:])
    
    # merge the filters contents onto a single image
    data = data.transpose((0, 2, 1, 3) + tuple(range(4, data.ndim)))
    data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
        
    # show the filter. In the activation case we don't have a colour channel (see further)
    plt.figure()
    if activation:
        plt.imshow(data[:, :, 0])
    else:
        plt.imshow(data)
    
    plt.axis('off')
    
# Get the weights of the first convolutional layer 
# (so that's the second layer, just after the input layer)
first_layer_weights = vgg_model.layers[1].get_weights()

# first_layer_weights[0] stores the connection weights of the layer
# first_layer_weights[1] stores the biases of the layer
# For now we're just interested in the connections
filters = first_layer_weights[0]

# Visualise the filters 
# (swapaxes to get it to be in the appropriate ordering of dimensions)
vis_square(np.swapaxes(filters, 0, 3))

You can see how each filter detects a different property of the input image. Some are designed to respond to certain colours, while some other -- the greyscale looking ones -- detects changes in brightness such as edges. 

Another way of visualising the network is to see which neurons get activated as the images traverses the network. A neuron outputing a high value means the pattern it has learnt to detect has been observed. 
Let's apply this to our kitten image.

In [0]:
# keras works with tensorflow but also with Theano, there are differences.
import keras.backend as K

def get_layer_output(model, image, layer):
    output_fn = K.function([model.layers[0].input], [model.layers[layer].output])
    return output_fn([image])[0]

# retrieve the activation of the first convolutional layer
layer_output = get_layer_output(vgg_model, img_t, 1)

# visualise
vis_square(np.swapaxes(layer_output, 0, 3), padsize=5, padval=1, activation=True)

**It's worth spending a moment to understand what is going on here. Each pixel in this image is a different neuron in the neural network. Neurons on the same image sample share the same weights and therefore detect the same feature. You can compare the visualised filters above with their corresponding image sample. What filter helps detect grass?**



Using this method, it is possible to visualise the deeper parts of the neural network, although they become much harder to interpret. You can visualise the output of the second convolutional layer:

In [0]:
# add your code here


And the eighth layer:

In [0]:
# add your code here


As we get further down the network, the representations become smaller in their spatial features thanks to the pooling layers. The final convolutional layers only have dimensions 14 by 14.

In [0]:
# add your code here to visualise the 28th layer


## Training your own network

Lets train a network! We're going to use the CIFAR10 dataset, in which the goal is to categorise images in one of 10 categories. ([more informations](https://www.cs.toronto.edu/~kriz/cifar.html))

If you're interested in benchmarks, have a look [here](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130). 

### Loading the CIFAR10 dataset

Load the CIFAR10 dataset. Here you need similar steps than for MNIST:

* use `cifar10.load_data()` to load the data (hopefully you have done that earlier)
* separate train and test
* normalise the values by 255
* there are 10 categories so use `np_utils.to_categorical` to specify the output has 10 categories


In [0]:
# Add your code here to load and prepare the cifar10 data


### Building the model

Let's define a model, you will use a small model so that the training is not too slow. You should be able to recognise the key parts. 

In [0]:
cifar10_model = Sequential()

cifar10_model.add(Conv2D(32, (3, 3), 
                         padding='same', 
                         input_shape=(32, 32, 3), activation='relu'))

cifar10_model.add(Conv2D(32, (3, 3), activation='relu'))
cifar10_model.add(MaxPooling2D(pool_size=(2, 2)))
cifar10_model.add(Dropout(0.25))

cifar10_model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
cifar10_model.add(Conv2D(64, (3, 3), activation='relu'))
cifar10_model.add(MaxPooling2D(pool_size=(2, 2)))
cifar10_model.add(Dropout(0.25))

cifar10_model.add(Flatten())
cifar10_model.add(Dense(512, activation='relu'))
cifar10_model.add(Dropout(0.5))
cifar10_model.add(Dense(10, activation='softmax'))

**Quiz: HOW MANY WEIGHTS IN THE NETWORK?**

- How many convolution weights does the first layer contain? What about the second layer?
- Are there any other weights in those layers?

### Define the training schedule

Using the Adam optimizer, you can compile the model.

In [0]:
# Using the Adam optimizer, as before

cifar10_model.compile(loss='categorical_crossentropy', 
                      optimizer='adam', metrics=['accuracy'])

### Image pre-processing

We have to define the preprocessing for the images. Here we define:

* 0-mean across all images ("feature-centering")
* variance 1 across all images ("normalization")
* we introduce a random horizontal and vertical shift to create more ("perturbed") training samples (makes the NN more robust as well)
* randomly flip images horizontally

you get the idea of what can be done. Don't forget that randomisation can improve things significantly but they also mean you have many (many) more samples to train on. It also means you're introducing (a lot) of correlation in your dataset whence the need for (more) regularisation such as dropout... 

In [0]:
# Preprocessing, does both normalization and augmentation
datagen = ImageDataGenerator(
        featurewise_center=True,                 # set input mean to 0 over the dataset
        samplewise_center=False,                 # set each sample mean to 0
        featurewise_std_normalization=True,      # divide inputs by std of the dataset
        samplewise_std_normalization=False,      # divide each input by its std
        rotation_range=0,                        # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,                   # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,                  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,                    # randomly flip images
        vertical_flip=False)                     # randomly flip images

# Compute quantities required for featurewise normalization (std, mean)
datagen.fit(X_train)

And you're set! You can start training and see the accuracy improve! (Don't feel like you have to wait all the way until the end, we did it for you and it gets to around 80% accuracy in a bit over an hour of training which is ok but quite far from state of the art result for this dataset). 

In [0]:
batch_size = 32
nb_epoch = 200

cifar10_model.fit_generator(
    datagen.flow(X_train, Y_train, batch_size=batch_size),
    steps_per_epoch=X_train.shape[0]/batch_size,
    epochs=nb_epoch,
    validation_data=(X_test, Y_test))