<a href="https://colab.research.google.com/github/mtwenzel/image-video-understanding/blob/master/Session_1_CNN_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This is a Jupyter notebook.  
### The most important keyboard shortcuts (cf. the "Help" menu) are
* **cursor keys** to select cells
* **Enter** to go from command mode to edit mode (for changing cell contents)
  * (**Esc** would go back to command mode.)
* **Shift+Enter** to *execute and advance* a cell
    
### If you execute it in Google Colab, some extra functions are provided:

* Cells can **hide the code**. This is the case for the "Imports" cell below. Double-clicking still gets you to the code directly. The code can be hidden again with a double-click.
* Cells can provide convenient **parameter interfaces**, like drop-down lists, sliders, and input fields. You will see this in the "Initialize random data" cell below. Again, double-clicking brings up the code.

# Image and Video Understanding -- Session 1 (Classification)


## 1. First Experiments with Random Data
* Start by importing some required python modules that implement the layers we will use to build the network. 
* We also need a "container" to connect the layers: the "Model"

In [None]:
#@title Imports
#@markdown To edit the imports, double-click on the cell

#@markdown We set TensorFlow 2.x as default for this notebook. This includes Keras.

#%tensorflow_version 2.x

from tensorflow.keras.layers import InputLayer, Conv2D, MaxPool2D, Flatten, Dense, UpSampling2D, LocallyConnected2D, Dropout
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import optimizers

import numpy as np

### Create random data

In these examples, we'll use artificial data first, and then switch to real data.

Run the code in the following cell, which will create a pair of input data `x_train` and corresponding output data `y_train` for training a classifier.  `x_train` contains the set number of training examples (or instances), with the set number of features. `y_train` contains labels, with the first half `1`, and the second `0`. Our goal is to train a model which takes the input data `x` and map them to one of the two classes `0` or `1` contained in `y`.

In [None]:
#@title Initialize random data
#@markdown Create random data sampled from uniform distribution.
#@markdown Set the desired number of instances
NUM_INSTANCES = 100 #@param {type:'slider', min:0, max:10000, step:100}
#@markdown Set the desired number of features (random from uniform distribution)
NUM_FEATURES = 1000 #@param {type:'slider', min:0, max:10000, step:100}
x_train = np.random.random((NUM_INSTANCES, NUM_FEATURES)) 
y_train = np.zeros((NUM_INSTANCES,)) # Label vector (initialized with 0s)
y_train[:int(NUM_INSTANCES/2)] = 1 # set first half of vector to 1


### Define the model
We define the deep learning model below in a _sequential_ fashion, which means that the single layers are added one after another. Here we use a simple fully-connected model using `Dense` layers. Each feature from the input vector `x_train` is connected to every neuron (called unit) in the following `Dense` layer. In the input layer we must know the shape of our data which will be fed to the network. 

In [None]:
model = Sequential() # We choose a simple sequential model without branching
model.add(InputLayer(input_shape = (NUM_FEATURES,)))
#@markdown Play with the number of neurons
NUM_NEURONS = 256 #@param {type:'integer'}
model.add(Dense(units=NUM_NEURONS, name="Hidden")) 

#@markdown Optionally increase the number of layers.
#model.add(Dense(units=128))
#model.add(Dense(units=64))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adadelta')
model.summary()
# @markdown If only interested in the number of parameters, use this:
# @markdown `print("Model parameters: {0:,}".format(model.count_params()))`

### Training the network

Train the network by executing the following. 

Clicking left to the output once will change the display mode from a scrollable field to a full display and back. Double-clicking it collapses it, so it is not so dominant.
In Google Colab, you can savely `x` the output with a click in the top left corner. This removes the printout, but not the cell results.

In [None]:
history = model.fit(x_train, y_train, batch_size=10, epochs=100)


### Investigate the "history" object you created
The training stores its history and important parameters in the _history_ parameter we assigned it to. 
* Try out the following commands and inspect the variables.
* Make use of tab completion, e.g. by typing `hidden_layer.` and press `<tab>` 

In [None]:
loss_history = history.history['loss']
print(f'Loss history: {loss_history}')
weights = history.model.get_weights()
hidden_layer = history.model.get_layer("Hidden")
for w in weights:
    print(w.shape)

* We can also display the learning success as measured by the loss by plotting it.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(figsize=(18, 4), dpi= 80, facecolor='w', edgecolor='k')
ax.plot(history.history['loss'])
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.grid()
plt.show()

### Interpreting the result
* What can you observe regarding the loss?
* Why is that possible?
* Change the number of training instances to 1000. Assure that the classes are equally frequent again. What can you observe?
* Be reminded that you have to re-create the model to reset the weights. To do this, execute the cell with the model definition (important is the `model.compile()` call)


### A remark on optimization
* Optimizers like SGD, ADAM, ADAGrad ADADelta etc. are variants of Stochastic Gradient Descent (SGD).
* SGD estimates the gradient for parameters based on a batch of examples.
    * The larger the batch, the better the estimated gradiend approximates the gradient for the whole dataset.
* It takes about 300 epochs to converge when creating 1000 instances.

## 2. Image Classification: _MNIST handwritten digits_

### Read the data

* We want to work on images: MNIST is a public dataset which contains images of handwritten digits. 
* You can import them from Keras with one line, because it is one of the standard datasets used for machine learning.

In [None]:
#@title Import MNIST data
#@markdown If you execute this cell, you will overwrite the data `x_train` and `y_train` above. In addition, it gives you test data in `x_test` and `y_test`.
from tensorflow.keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

# reduce data by factor 10 / 20 for fast execution during course
x_train = x_train[::10]
y_train = y_train[::10]
x_test = x_test[::20]
y_test = y_test[::20]

# verify resulting array shapes
x_train.shape, y_train.shape, x_test.shape, y_test.shape

### Inspecting the data

Look at the shape of the `x_train` variable to understand how the data is organised.

In [None]:
# Inspect the shape of x_train
print(x_train.shape)

* You can see that the data has 6000 training examples, each of shape 28x28.
* These are images of size 28x28 pixels.


Look at the shape and values of `y_train` to understand the output.

In [None]:
 print(f'Shape of y: {y_train.shape}, Minimum of y: {y_train.min()}, Maximum of y: {y_train.max()}')

* The corresponding output is just a long vector of corresponding labels in the range [0...9], refering to the displayed digit. 

As we are dealing with *images* now, we want to display them.
* `matplotlib` is a python package well suited plotting data and displaying images.
* Change the index of `x_train[index]` in the cell below to have a look at different images from the dataset. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Look at an image
plt.imshow(x_train[700], cmap='gray')

Also, we could be interested in the distribution of labels in our data so we plot a histogram:

In [None]:
plt.hist(y_train)
plt.xlabel('Digits')
plt.ylabel('Frequency')
plt.show()

### Preparing the labels for a classification network
We want to convert the numeric labels to so-called *"one-hot vectors"*.
* One-hot means that the network does not directly output a number between 0 and 9 representing the digit.
* Rather, we want a vector with 10 entries, in which only one entry is 1, all others 0, e.g. `[0, 0, 1, 0, 0, 0, 0, 0, 0, 0]` to label a "2".
* *Rationale:* The digits represent different categorical classes, and we want to penalize all confused digits the same; it is not "better" or "closer" if the network outputs 4.2 given an image depicting a "6" than if the output is 1.
* In general, the one-hot encoding helps with classification problems and allows to let the neuron with maximal activation "win".

In [None]:
num_labels = 10
# Code to convert labels
y_train_one_hot = (np.arange(num_labels) == y_train[:,np.newaxis]).astype(np.float32)

In [None]:
# Keras offers a convenience function to achieve the same:
from tensorflow.keras.utils import to_categorical
y_train_one_hot = to_categorical(y_train, num_classes=num_labels)
# Same for the testing data
y_test_one_hot  = to_categorical(y_test, num_classes=num_labels)

Look at the shape of the one-hot-converted vector `y_train_one_hot` to assure that each training example no has 10 entries. Compare the digit representation exemplarily to the one-hot-representation.

In [None]:
print(y_train_one_hot.shape)
print(f'Digit {y_train[10]} is converted to {y_train_one_hot[10,:]}')

### Image classification with a simple neural network
We now want to train the above network on this data. It should take the images in `x_train` as input and predicts the correct digit as stored in `y_train_one_hot`. We have to adapt the model to use inputs of 28 x 28, and to produce vector outputs. We have prepared this below:
* Modified the parameter `input_shape=(...)` to adapt to the new data
* Modified the number of dense units in the output layer to reflect the number of classes; 10 in the digits example
* Modified the loss function to deal with multiple classes

In [None]:
model = Sequential()
model.add(InputLayer(input_shape=(28,28)))
model.add(Flatten()) # Layer reshaping the 28x28 arrays into vectors of length 28*28=784
model.add(Dense(units=128)) # Try higher or lower numbers of hidden units!
# Try adding more layers!
model.add(Dense(units=128))
model.add(Dropout(0.5))

#model.add(Dense(units=128))
#model.add(Dropout(0.5))

model.add(Dense(units=10, activation='softmax', name='output')) # The number of units in the output layer refers to the number if classes

model.compile(loss='categorical_crossentropy', optimizer='adadelta') # The categorical crossentropy loss can deal with multiple classes
model.summary()

In [None]:
# This experiments takes about 1 sec per epoch on an older MacBook Pro.
history = model.fit(x_train, y_train_one_hot, batch_size=500, epochs=100) # In this example, you'll no longer want batches of size 10...

### Evaluate the model on independent test data
* The following cell executes the model on the test data using `model.predict()`
* The result is a list of 10-vectors (recall the on-hot encoding), only this time there are also values between 0 and 1.
* How can we compare these with the true labels in `y_test_one_hot`? There are many possible ways to evaluate classifiers; in general, you want to define some kind of error, usually based on differences.

In [None]:
pred = model.predict(x_test)
print(f'Shape of the test input: {x_test.shape}, shape of the predicted output: {pred.shape}')
print(f'Exemplary prediction: {pred[0]}')

The `argmax()` function may come in handy, which converts from the one-hot representation back to integer indices of the maximally activated classes:

In [None]:
pred_integer_indices = pred.argmax(axis = -1)
print(f'Exemplary prediction: {pred_integer_indices[0]}')

Now that we have the prediction of our network on the test dataset: How well does this prediction fit the real classes of the data? Let's compare the predictions with the true classes `y_true`

In [None]:
diff = y_test - pred_integer_indices
correctly_classified_examples = np.where(diff == 0)[0].shape[0]
num_examples = y_test.shape[0]
wrongly_classified_examples = num_examples - correctly_classified_examples
print(f'{correctly_classified_examples} ({correctly_classified_examples/num_examples * 100} %) of the examples are classified correctly while {wrongly_classified_examples} ({wrongly_classified_examples/num_examples * 100}%) are classified wrong.')

Are you satisfied with the performance of the classifier? If not, play around with the parameters and try to get a better result. You can for example do one of the following
* Take a look at the loss function using the code in the following cell:
    * Has the model finished it's training? You can increase the number of epochs in the model training above. 
    * Increase the batch size of the training.
    
* Increase the number of neurons (units) in the model. 
* Add more layers to you model.

In [None]:
fig,ax = plt.subplots(figsize=(18, 4), dpi= 80, facecolor='w', edgecolor='k')
ax.plot(history.history['loss'])
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.grid()
plt.show()

## 3. Image classification with a simple convolutional neural network (CNN)
We now try to solve the MNIST classification using a convolutional neural network instead of a fully-connected network. 

### Define the network
* Instead of using `Dense` layers as before we now use `Conv2D` layers. These layers perform convolutions accross their input (you can find a visualization of this [here](https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d)). One can view this convolution operation also as a filtering with different kernels. The number of filters which are learned and the size of these filters are a parameter of the `Conv2D` layer. By filtering for example with 32 filters one add a new dimension to the data which is often also called the "channel dimension" (after filtering the images are of size 128, 128, 32).
* To learn patterns on different spatial resolutions we downsample between the convolutions using the `MaxPool2D` layer which only keeps the maximum value in a 2x2 neighborhood.  
* To remove the 2D nature again and end up with an output vector of 10 entries we use `Flatten()` which just converts the 2D image in a 1D vector by writing all values in one row. 

In [None]:
import tensorflow
tensorflow.keras.backend.image_data_format()
tensorflow.keras.backend.set_image_data_format('channels_last')

In [None]:
convnet = Sequential()
convnet.add(InputLayer(input_shape=(28,28,1)))
convnet.add(Conv2D(filters=32, kernel_size=(3,3), padding='same'))
convnet.add(Conv2D(filters=32, kernel_size=(3,3), padding='same'))
convnet.add(MaxPool2D())
convnet.add(Conv2D(filters=32, kernel_size=(3,3), padding='same'))
convnet.add(Conv2D(filters=32, kernel_size=(3,3), padding='same'))
convnet.add(MaxPool2D())
convnet.add(Flatten())
convnet.add(Dense(units=128))
convnet.add(Dropout(0.5))
convnet.add(Dense(units=10, activation='softmax'))
convnet.compile(loss='categorical_crossentropy', optimizer='adadelta')
print("convnet parameters: {0:,}".format(convnet.count_params()))
convnet.summary()

### Train 
* Your input data now needs to have a "channel" dimension, as the convolutional filter result will be a multi-channel image.

In [None]:
convnet_history = convnet.fit(x_train[...,np.newaxis], y_train_one_hot, batch_size=500, epochs=100)

### Compare the training performance

Take a look at the loss plot. You can also compare it directly with the loss from the fully-connected network above by plotting both into the same figure. What do you observe?

In [None]:
fig,ax = plt.subplots(figsize=(18, 4), dpi= 80, facecolor='w', edgecolor='k')
ax.plot(convnet_history.history['loss'], label='CNN')
# Remove the # to plot both loss curves togehter
#ax.plot(history.history['loss'], label='FCN')
ax.set_xlabel('Epoch')
ax.set_ylabel('Loss')
ax.grid()
ax.legend()
plt.show()

### Evaluate on the test data
Apply the model to the test data and have a look how many cases are classified correctly:

In [None]:
pred = convnet.predict(x_test[...,np.newaxis])
pred_integer_indices = pred.argmax(axis = -1)

In [None]:
diff = y_test - pred_integer_indices
correctly_classified_examples = np.where(diff == 0)[0].shape[0]
num_examples = y_test.shape[0]
wrongly_classified_examples = num_examples - correctly_classified_examples
print(f'{correctly_classified_examples} ({correctly_classified_examples/num_examples * 100} %) of the examples are classified correctly while {wrongly_classified_examples} ({wrongly_classified_examples/num_examples * 100}%) are classified wrong.')

To inspect the prediction performance in more detail the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) is a useful evaluation metric. It displays for every class the amount of examples which are classified as any of the given classes.

In [None]:
import sklearn.metrics
pred = convnet.predict(x_test[...,np.newaxis])
cm = sklearn.metrics.confusion_matrix(pred.argmax(axis = -1), y_test)
cm

It's more intuitive to look at it as a heat map.

In [None]:
plt.matshow(cm)
plt.show()

Side note: Numpy and Matplotlib are two important, central libraries for numeric computing with Python. In addition, there are also more advanced libraries such as Seaborn, which build upon the things introduced above and offer dedicated functions for complex graphics, such as a combined version of the above matrix + heatmap.

In [None]:
import seaborn as sns
ax = sns.heatmap(cm, annot=True)

Looking at the confusion map:
* Which classes are easy for the classifier and which are hard?
* Which classes get mixed up a lot? Can you think about a reason for that? 

Are you satisifed with the results of the classifier? Try out different configurations:
* Change the number of epochs used during training
* Change the batch size.
* Add more layers to the model.
* Change the number of filters or the kernel_size of the model
* ...