<img src="images/thro.png" align="right"> 
# A2I2 - Artificial Neural Networks (ANN)

## Lecture

# Introduction

After working through the homework notebooks 1 to 3 you are now familiar with artifical neurons, artifical neural networks, the feed forward algorithms, gradient descent, the backpropagation algorithm and you know how alcohol can make training ANNs much faster (the drunk man stumbling down the hill of stochastic gradient descent).

### Any questions regarding this material?

## Content of this Lecture

**1) More on the cost function**

**2) Optimizations regarding the training algorithm**

**3) Training in batches and epochs**

**4) Popular neural network setups for different problems***

**5) MNIST - code example**

**6) Exercise**

## 1) More on the cost function

In the videos you learned about the ***cost function*** (which is often also called ***loss function*** or ***objective function***). Apart from the (mean) ***squared error*** that was introduced in the video, there is another very popular loss function, the ***cross-entropy*** (sometimes also called ***log loss***), which measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .017 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.

Cross-entropy for a training example $h$ is defined as
\begin{equation*}
\epsilon(h)= - \sum_j y_j(h) \mathrm{ld}(d_j(h))
\end{equation*}

where $d_j(h)$ is the output of neuron $j$ in the output layer and $y_j(h)$ is the desired output, which usually is $0$ for all but the one neuron encoding the correct class (for which is it $1$). This definition is derived from the definition of entropy in information theory. The original definition uses the logarithm to base 2 ($\mathrm{ld}$), but any logarithm can be used, often the natural logarithm $\ln$ to base $e$ is used.

In the special case of binary classification, in which there is only one output neuron, the binare cross-entropy is used:

\begin{equation*}
\epsilon(h)= - (y(h) \mathrm{ld}(d(h)) + (1-y(h)) \mathrm{ld}(1-d(h)))
\end{equation*}


## 2) Optimizations regarding the training algorithm

Instead of the simple ***SGD optimizer*** (stochastic gradient descent) introduced in the videos, most modern networks use more advanced ***optimizer*** algorithms. Two ideas in particular have been shown to be very beneficial:

* **Adapting the learning rate during training**: SGD takes a step of a certain fixed size in the direction of the negative gradient. It has been observed, that it is good to make large steps at the beginnen of the training, when you are far away from the local minimum. Once you get closer to the minimum, it is better to make smaller steps in order to not "overshoot" the minimum. This can be achieved by adapting the learning rate dynamically during training.

* **Using Moments**: The idea of this optimization is to not only use the current gradient as the direction of decents, but also factor in the gradients of the previous step with exponentially decreasing strength. One could say, instead of simple stumbling down the hill, our drunk man tries to stumble a little bit straighter. 

A number of optimizers have been proposed. Particularly popular is the **RMSprop optimizer** which combines both ideas.

## 3) Training in batches and epochs

The video already introduced the concept of using only a subset (called a **batch** or **mini batch**) of the training examples to compute the gradient, leading to the SGD optimizer. Most neural network libraries divide the training examples into three sets: the **test set** to be used for evaluation after training (about 10% of the data), the **training set** used for actual training (about 80% of the data) and the **validation set** for validation during the training (about 10% of the data), resulting in the following process:

1) divide the *training set* into the batches (e.g. 128 training examples per batch)

2) train the network on one batch

3) once all batches have been used for training (which is called an **epoch**), compute loss and some quality metric (e.g. accuracy) based on the *validation set*

4) stop the training if either the maximal number of epochs has been reached or the quality metric has reached a given threshold, otherwise repeat from step 2

After the training, compute the quality metric (and maybe the loss) on the *test set* in order to get a more realstic estimate.

## 4) Popular neural network setups for different problems

Depending on the problem at hand, different choices for the activation function of the output layer and the loss function can be recommended:

**binary classification**: the output layer consists of only one neuron, which is supposed to be $0$ or $1$. Sigmoid as activation function and binary cross-entropy as loss function often work quite well.

**multiple disjoint classes**: the output layer has one neuron per class. Exactly one of these is supposed to be $1$, all others should be $0$. Softmax as activation function and cross-entropy as loss function often work quite well.

**multiple non-disjoint classes**: the output layer again has one neuron per class, but now multiple neurons can and should be $1$. Sigmoid as activation function and the sum of binary cross-entropy as loss function (each neuron can be considered a binary classfication problem) often work well.

**regression**: instead of predicting a class label, the network has to compute a (numeric) function value. The output layer consists of one neuron. The identity function as activation function and the squared error as loss function often work well.

For the hidden layers, pretty much all networks today use ReLU or one of its variants (leakyReLU, Softplus) as activation function.

## 5) MNIST - code example

We will be using Tensorflow/Keras, a very popular library for building neural networks. You can find the Keras documentation at https://keras.io/

In [None]:
# if you are using Azure Notebooks, please make sure to use a kernel that 
# includes tensorflow v2 or larger (e.g. the Python 3.6 kernel)
import tensorflow as tf
print(tf.__version__)

In [None]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.optimizers import RMSprop, SGD
from cycler import cycler

import matplotlib.pyplot as plt

import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn.metrics import roc_curve, auc

print(keras.__version__)

In [None]:
# load the data and split into train/test
mnist = keras.datasets.mnist
num_classes = 10
print('Loading...')
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print('...finished')

In [None]:
# show a few digits...
fig = plt.figure()
for i in range(9):
  plt.subplot(3,3,i+1)
  plt.tight_layout()
  plt.imshow(x_train[i], cmap='gray', interpolation='none')
  plt.title("Digit: {}".format(y_train[i]))
  plt.xticks([])
  plt.yticks([])
fig
plt.show()

In [None]:
x_train.shape

In [None]:
x_train[0]

In [None]:
y_train

In [None]:
# munge the data
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

# Reserve 10,000 samples for validation
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]

print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_val = keras.utils.to_categorical(y_val, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

In [None]:
def myeval(model, history):
    # evalute the model quality
    epochs = len(history.history['loss'])
    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
    
    # plot the metrics
    # Create cycler object for the graphs. Use any styling you please
    monochrome = (cycler('color', ['k']) * cycler('linestyle', ['-', '--', ':', '=.']))

    fig = plt.figure()
    ax = plt.axes()
    ax.set_prop_cycle(monochrome)
    plt.plot(history.history['accuracy'],linestyle='--')
    plt.plot(history.history['val_accuracy'],linestyle='-')
    startacc = min([min(history.history['accuracy']), min(history.history['val_accuracy'])])
    plt.ylim([startacc, 1])
    plt.xticks(range(0,epochs))
    ax.xaxis.set_major_locator(plt.MultipleLocator(10))
    plt.title('model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'], loc='lower right')

    fig = plt.figure()
    ax = plt.axes()
    ax.set_prop_cycle(monochrome)
    plt.plot(history.history['loss'],linestyle='--')
    plt.plot(history.history['val_loss'],linestyle='-')
    plt.xticks(range(0,epochs))
    ax.xaxis.set_major_locator(plt.MultipleLocator(10))
    plt.title('model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Training', 'Validation'], loc='upper right')

    plt.show()

---
### A model very close to the one from the video

In [None]:
model = Sequential()
model.add(Dense(16, activation='sigmoid', input_shape=(784,)))
model.add(Dense(16, activation='sigmoid'))
model.add(Dense(num_classes, activation='sigmoid'))
model.summary()

model.compile(loss='mean_squared_error', optimizer=SGD(), metrics=['accuracy'])

In [None]:
epochs = 48
batch_size = 128
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_val, y_val))

In [None]:
myeval(model, history)

---
### Improving the activation functions

In [None]:
model = Sequential()
model.add(Dense(16, activation='relu', input_shape=(784,)))
model.add(Dense(16, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.summary()

model.compile(loss='mean_squared_error', optimizer=SGD(), metrics=['accuracy'])

In [None]:
epochs = 48
batch_size = 128
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_val, y_val))

In [None]:
myeval(model, history)

## 5) Exercises

#### Exercise 1: Switch optimizer from SGD to RMSprop and loss function from mean squared error to (categorical) cross-entropy. How does the quality of the network change? Discuss/interpret your results. 

In [None]:
## your code goes here

#### Exercise 2: Increase the number of neurons in the hidden layers to 256 and the epochs to 128. Again, how does the quality of the network change? Discuss/interpret your results.

In [None]:
## your code goes here

In [None]:
# --- EOF ---