<a href="https://colab.research.google.com/github/Guliko24/CF969_SU/blob/main/Lab_9_CNN_and_RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CF969 - Big Data for Computational Finance
## Lab 9: Convolutional and Recurrent Neural Networks

In this lab we will see how to use two of the main network architectures in deep learning:
* Convolutional Networks,
* Recurrent Neural Networks.

When going through these notes, please experiment with the pieces of code, take your time with them, and look up the meaning of the various statements in the online TensorFlow documentation whenever you do not fully understand what is happening.

## Convolutional Networks

LeNet-5 is an example of a simple convolutional network for handwritten character recognition. Its architecture is summarised in the following picture.

![title](imagesCNN/lenet5.png)

There are 7 layers in LeNet-5: The first 6 are layers of convolution layers interchanged with pooling layers. The final layer is a fully connected layer with a rather unusual activation function where the weights and biases were manually set. Another peculiarity is that the kernels of the second convolutional layer (C3) do not always use all of the features computed in the previous layer (C2): you can see in the image above that the kernel count in C3 is 16 while the feature count in S2 is 6; not a divisor of 16.

This well-known convolutional network achieves an accuracy of over 99% on the MNIST dataset.

For more information on this network, see the original publication:
* Y. LeCunn, L. Bottou, Y. Bengio, P. Haffner. _"Gradient Based Learning Applied to Document Recognition."_ Proceedings of the IEEE. 1998. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

As a reminder, the convolution operation is summarised in the following picture. Also see the last set of slides of the module and Chapter 6 of Nielsen's book.

![title](imagesCNN/convolution.png)

A convolution operation takes an input matrix (typically representing an image) and computes for all entries at a certain interval distance in the matrix a weighted sum (plus possibly a bias) of the entries around it (including itself). The result is a smaller matrix containing all these weighted sums as entries. The interval distance is called the stride length. In the above picture the weights correspond to performing an embossing operation on the image. The weights and bias applied to each pixel are the same. Usually, multiple such convolution operations are applied in parallel in a single convolutional layer, this produces multiple separate matrices. For example, in the first layer of LeNet-5 you can see that six convolution operations are applied.

A pooling operation shrinks such a matrix further by partitioning it in blocks, and summarising each of the blocks in some way by a single value. A typical choice here is the max-function, where simply the max value of such a block is output. The result of a pooling operation is a significantly smaller matrix where each entry represents the pooling result of each separate block.

Using convolutions and pooling instead of simply applying a fully connected layer on our inputs has some advantages:
* There are relatively little weights to train, as the weights are shared for each pixel.
* These operations assume that the input is in the form of a matrix, which can be a natural choice of input depending on the data at hand. In the case of input in the form of an image, this certainly is true, and the convolution operation implicitly takes into account the spatial structure of the input.
* Sometimes certain sparseness restrictions are also imposed on the weights of a convolution layer. In the convolution operation depicted above, almost all weights are zero. Imposing this explicitly can further reduce the size of the parameter space, and will significantly speed up training.

In TensorFlow, as you may have guessed, the convolution operation is built-in. We can apply it as follows, where we assume that we have 28 by 28 input pictures, as in the  MNIST dataset. (Do not run this code, it will not work by itself of course. Just inspect this code for now.)

In [None]:
x = tf.nn.conv2d(x, W, strides=[1, stride_length, stride_length,1], padding='SAME')

Note that here, x is a tensor of rank 4. _stride_length_ is the stride length associated to the convolution operation.

What would happen if our input pictures were provided to us in full color, rather than just greyscale? In that case, each pixel would probably be represented on three channels: one for red, one for green, one for blue. This could be represented as three separate 28 by 28 matrices, or even better: a rank 3 tensor, where there are 3 entries along the third axis of the tensor. The convolution operation would also become a 3-dimensional one, where for a single pixel a sum of weights is taken over all surrounding pixels, where each surrounding pixel has 3 values. For example, if we would assume 28 by 28 input images, with colored pixels encoded over three channels, and we would apply five convolution operations in parallel on it, then the operation can be depicted as follows.

![title](imagesCNN/colorconvolution.png)

The above code would be rewritten into the following. Again, do not run this code, but do inspect and compare with the above variant of the code, so as to understand what is the role of the various values in the vectors that we specify.

As I explained in the lecture, convolution operations are usually interchanged with pooling operations to reduce the size of the data output by the pooling layer. Pooling is also called _subsampling_ or _downsampling_. The following picture displays how 2 times 2 max-pooling works on a 4 times 4 matrix. The output is half the size of the input.

![title](imagesCNN/pooling.png)

In TensorFlow, max-pooling is again built-in. Max-pooling of 2 by 2 can be called as follows. Again, this line of code is not intended to be run. We will combine all of the code later on into a piece of code that actually works.

In [None]:
tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

When we train convolutional networks, we see that often the final result will be a network in which the convolutional weights represent clear geometric shapes that the network is trying to detect. Below is a visualisation of the various convolution weights that result from training a famous neural network for ImageNet, coming from the following paper:

* A. Krizhevsky, Ilya Sutskever, and G. E. Hinton. _ImageNet classification with deep convolutional neural networks_. Advances in neural information processing systems. 2012. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

![title](imagesCNN/imagenetconvolution.png)

You can see here that some of the convolution operations are testing the input for patterns that resemble edges along various orientations. This is a great example of a way in which modern deep learning technology derives its success from the ability to decompose the task into subtasks; and identifying the correct subtasks through a learning algorithm. In this case such subtasks consist in the identification of edges with a variety of orientations in local regions of the picture.

This particular convolutional network by Krizhevsky et al. was trained on 1.2 million high-resolution images, in the ImageNet LSVRC-2010 contest. The task for the network is to identify over thousand possible types of objects in the input picture, i.e., to classify the dataset into 1000 different classes. The accuracy is measured by the likelihood ranking that the network outputs. On the test data, they achieved top-1 and top-5 error rates of 37.5% and 17.0%.
Training from scratch took "ﬁve to six days on two NVIDIA GTX 580 3GB GPUs."

**Exercise:** We now combine the TensorFlow functions implemented above into a single convolutional network for handwritten digit classification on the MNIST dataset. The code implements a couple of other techniques as well that we have learned throughout our journey: A stochastic version of gradient descent, and dropout are included as well. Your task is to run the code and to work out the answers to the following questions:

* How many convolutional layers are there?
* How many pooling layers are used?
* How many fully connected layers are there?
* Are there any further layers?
* How many convolution operations per layer are applied?
* What type of activation functions are used?
* How many weights and biases are trained in each of the convolution layers?

Running the code probably gives you an accuracy of over 96%. Note that this is a substantial piece of code and it might take you quite a bit of time to understand it. Please tell me in case you would like to discuss some of this code or have something clarified.

In [None]:
import tensorflow as tf
from tensorflow.keras import Model, layers
import numpy as np

In [None]:
# Network Parameters
num_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units

# Training parameters
learning_rate = 0.001
training_iters = 20000
batch_size = 128
display_step = 10

In previous lab scripts, we used to train based on the entire input. This is, in principle, time-consuming, so it is advised to train based on large chunks of the input. The next code snippet allows us to break the data into batches of given size.  Recall that, by doing so, we lose some of the nice theoretical learning-related properties. Also, we make sure to randomize when breaking into batches, so as to avoid limiting ourselves to a batch that is not statistically independent.

In [None]:
def next_batch(num, data, labels):
	idx = np.arange(0 , len(data))
	np.random.shuffle(idx)
	idx = idx[:num] # pull off first num items only
	data_shuffle = data[idx] # pull off only selected rows
	labels_shuffle = labels[idx] # pull off only selected rows
	return data_shuffle, labels_shuffle

Time to load and manipulate the data.

In [None]:
# Load dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Convert to float32
train_images, test_images = np.array(train_images, np.float32), np.array(test_images, np.float32)

# Normalize images value from [0, 255] to [0, 1]
train_images, test_images = train_images / 255., test_images / 255.

We now define the key tools for the neural network. The main two operations, convolution and max-pooling, are defined as follows.

In [None]:
# Create some wrappers for simplicity
def conv2d(x, W, b, strides=1):
    # Conv2D wrapper, with bias and relu activation
    x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
    x = tf.nn.bias_add(x, b)
    return tf.nn.relu(x)

def maxpool2d(x, k=2):
    # MaxPool2D wrapper
    return tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

We are now ready to create the CNN.

In [None]:
# Create model (CNN network)
def conv_net(x):
    # Reshape input picture
    # Input shape: [-1, 28, 28, 1]. A batch of 28x28x1 (grayscale) images.
    x = tf.reshape(x, shape=[-1, 28, 28, 1])

    # Convolution Layer. Output shape: [-1, 28, 28, 32].
    conv1 = conv2d(x, weights['wc1'], biases['bc1'])

    # Max Pooling (down-sampling). Output shape: [-1, 14, 14, 32].
    conv1 = maxpool2d(conv1, k=2)

    # Convolution Layer. Output shape: [-1, 14, 14, 64].
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])

    # Max Pooling (down-sampling)
    conv2 = maxpool2d(conv2, k=2)

    # Reshape conv2 output to fit fully connected layer input, Output shape: [-1, 7*7*64].
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])

    # Fully connected layer, Output shape: [-1, 1024].

    fc1 = tf.nn.relu(tf.matmul(fc1, weights['wd1']) + biases['bd1'])
    # Apply Dropout
    fc1 = tf.nn.dropout(fc1, dropout)

    # Output, class prediction
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    return tf.nn.softmax(out)

In [None]:
# Store layers weight & bias

# A random value generator to initialize weights.
random_normal = tf.initializers.RandomNormal()

weights = {
    # Conv Layer 1: 5x5 conv, 1 input, 32 filters (MNIST has 1 color channel only).
    'wc1': tf.Variable(random_normal([5, 5, 1, 32])),
    # Conv Layer 2: 5x5 conv, 32 inputs, 64 filters.
    'wc2': tf.Variable(random_normal([5, 5, 32, 64])),
    # FC Layer 1: 7*7*64 inputs, 1024 units.
    'wd1': tf.Variable(random_normal([7*7*64, 1024])),
    # FC Out Layer: 1024 inputs, 10 units (total number of classes)
    'out': tf.Variable(random_normal([1024, num_classes]))
}

biases = {
    'bc1': tf.Variable(tf.zeros([32])),
    'bc2': tf.Variable(tf.zeros([64])),
    'bd1': tf.Variable(tf.zeros([1024])),
    'out': tf.Variable(tf.zeros([num_classes]))
}


In [None]:
# Define loss and optimizer
# Cross-Entropy loss function
def cross_entropy(y_pred, y_true):
    # Encode label to a one hot vector.
    y_true = tf.one_hot(y_true, depth=num_classes)
    # Clip prediction values to avoid log(0) error.
    y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
    # Compute cross-entropy.
    return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))

# Accuracy metric
def accuracy(y_pred, y_true):
    # Predicted class is the index of highest score in prediction vector (i.e. argmax).
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1)

# Stochastic Gradient Descent optimizer
optimizer = tf.optimizers.SGD(learning_rate)

In [None]:
# Optimization process.
def run_optimization(x, y):
    # Wrap computation inside a GradientTape for automatic differentiation.
    with tf.GradientTape() as g:
        pred = conv_net(x)
        loss = cross_entropy(pred, y)

    # Variables to update, i.e. trainable variables.
    trainable_variables = list(weights.values()) + list(biases.values())

    # Compute gradients.
    gradients = g.gradient(loss, trainable_variables)

    # Update W and b following gradients.
    optimizer.apply_gradients(zip(gradients, trainable_variables))

All done! Time to train.

In [None]:
# Run training
step = 1
while step * batch_size < training_iters:
    batch_x, batch_targets = next_batch(batch_size, train_images, train_labels)

    # Run the optimization to update W and b values.
    run_optimization(batch_x, batch_targets)
    step += 1
    if step % display_step == 0:
        pred = conv_net(batch_x)
        loss = cross_entropy(pred, batch_targets)
        acc = accuracy(pred, batch_targets)
        print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))
print("Training finished!")

In [None]:
# Test model on validation set.
pred = conv_net(test_images)
print("Test Accuracy: %f" % accuracy(pred, test_labels))

In [None]:
# Visualize predictions.
import matplotlib.pyplot as plt

In [None]:
# Predict 5 images from validation set.
n_images = 5
test_set = test_images[:n_images]
predictions = conv_net(test_set)

# Display image and model prediction.
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction: %i" % np.argmax(predictions.numpy()[i]))

**Exercise:** Modify and re-run the code so that there is one instead of two convolutional layers (and therefore one instead of two pooling layers), a higher stride length, a lower keep probability for dropout, different activation functions, and a higher number of features in the first convolutional layer.

For additional reading and practice with convolutional networks, please also take a look at https://cs231n.github.io/convolutional-networks/

## Simplifying the CNN
The code above is rather low-level. This has the benefit that it provides more power and control to the programmer. On the other hand, it might be easier to make a mistake or fail to implement it properly.

The following two code snippets offer simplifications. The first one replaces setting weights and biases, defining wrappers, and defining the model. The second one replaces the cross-entropy error function.

How can you combine the previous code  with the snippets below? Please try the combination in a new notebook. Apart from the following couple of snippets, _trainable_variables_ in _run_optimization_ should now be set as _trainable_variables_ = _conv_net.trainable_variables_


In [None]:
# Create TF Model
class ConvNet(Model):
    # Set layers
    def __init__(self):
        super(ConvNet, self).__init__()
        # Convolution Layer with 32 filters and a kernel size of 5
        self.conv1 = layers.Conv2D(32, kernel_size=5, activation=tf.nn.relu)
        # Max Pooling (down-sampling) with kernel size of 2 and strides of 2
        self.maxpool1 = layers.MaxPool2D(2, strides=2)

        # Convolution Layer with 64 filters and a kernel size of 3
        self.conv2 = layers.Conv2D(64, kernel_size=3, activation=tf.nn.relu)
        # Max Pooling (down-sampling) with kernel size of 2 and strides of 2
        self.maxpool2 = layers.MaxPool2D(2, strides=2)

        # Flatten the data to a 1-D vector for the fully connected layer
        self.flatten = layers.Flatten()

        # Fully connected layer
        self.fc1 = layers.Dense(1024)
        # Apply Dropout (if is_training is False, dropout is not applied)
        self.dropout = layers.Dropout(rate=0.5)

        # Output layer, class prediction
        self.out = layers.Dense(num_classes)

    # Set forward pass
    def call(self, x, is_training=False):
        x = tf.reshape(x, [-1, 28, 28, 1])
        x = self.conv1(x)
        x = self.maxpool1(x)
        x = self.conv2(x)
        x = self.maxpool2(x)
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.dropout(x, training=is_training)
        x = self.out(x)
        if not is_training:
            # tf cross entropy expect logits without softmax, so only
            # apply softmax when not training
            x = tf.nn.softmax(x)
        return x

# Build neural network model
conv_net = ConvNet()

In [None]:

# Cross-Entropy Loss
# Note that this will apply 'softmax' to the logits
def cross_entropy_loss(x, y):
    # Convert labels to int 64 for tf cross-entropy function
    y = tf.cast(y, tf.int64)
    # Apply softmax to logits and compute cross-entropy
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=x)
    # Average loss across the batch
    return tf.reduce_mean(loss)

## Recurrent Neural Networks

Now, I would like to cover an example of a recurrent neural network with LSTM units. RNNs involve a notion of time: the input is provided over a sequence of time steps. In the case of this network, for each training example the input is provided over a period of 28 time steps, where at each step the network receives a row of 28 pixels of the input.

Please inspect the code, and consult the TensorFlow reference to obtain information and insight on the various functions that are used here.

In [None]:
import tensorflow as tf
from tensorflow.keras import Model, layers
import numpy as np

In [None]:
# Network Parameters
num_classes = 10 # MNIST total classes (0-9 digits)
dropout = 0.75 # Dropout, probability to keep units
num_input = 28 # number of sequences
timesteps = 28 # timesteps
num_units = 32 # number of neurons for the LSTM layer

# Training parameters
learning_rate = 0.001
training_iters = 20000
batch_size = 128
display_step = 10

In [None]:
def next_batch(num, data, labels):
	idx = np.arange(0 , len(data))
	np.random.shuffle(idx)
	idx = idx[:num] # pull off first num items only
	data_shuffle = data[idx] # pull off only selected rows
	labels_shuffle = labels[idx] # pull off only selected rows
	return data_shuffle, labels_shuffle

In [None]:
# Load dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Convert to float32
train_images, test_images = np.array(train_images, np.float32), np.array(test_images, np.float32)

# Normalize images value from [0, 255] to [0, 1]
train_images, test_images = train_images / 255., test_images / 255.

In [None]:
# Create LSTM Model
class LSTM(Model):
    # Set layers.
    def __init__(self):
        super(LSTM, self).__init__()
        # RNN (LSTM) hidden layer
        self.lstm_layer = layers.LSTM(units=num_units)
        self.out = layers.Dense(num_classes)

    # Set forward pass
    def call(self, x, is_training=False):
        # LSTM layer
        x = self.lstm_layer(x)
        # Output layer (num_classes)
        x = self.out(x)
        if not is_training:
            # tf cross entropy expect logits without softmax, so only
            # apply softmax when not training
            x = tf.nn.softmax(x)
        return x

# Build LSTM model
lstm_net = LSTM()

In [None]:
# Cross-Entropy Loss
# Note that this will apply 'softmax' to the logits
def cross_entropy_loss(x, y):
    # Convert labels to int 64 for tf cross-entropy function
    y = tf.cast(y, tf.int64)
    # Apply softmax to logits and compute cross-entropy
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=x)
    # Average loss across the batch
    return tf.reduce_mean(loss)

# Accuracy metric
def accuracy(y_pred, y_true):
    # Predicted class is the index of highest score in prediction vector (i.e. argmax)
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32), axis=-1)

# Adam optimizer
optimizer = tf.optimizers.Adam(learning_rate)

In [None]:
# Optimization process
def run_optimization(x, y):
    # Wrap computation inside a GradientTape for automatic differentiation
    with tf.GradientTape() as g:
        # Forward pass.
        pred = lstm_net(x, is_training=True)
        # Compute loss.
        loss = cross_entropy_loss(pred, y)

    # Variables to update, i.e. trainable variables
    trainable_variables = lstm_net.trainable_variables

    # Compute gradients
    gradients = g.gradient(loss, trainable_variables)

    # Update weights following gradients
    optimizer.apply_gradients(zip(gradients, trainable_variables))

In [None]:
# Run training
step = 1
while step * batch_size < training_iters:
    batch_x, batch_targets = next_batch(batch_size, train_images, train_labels)

    # Run the optimization to update W and b values.
    run_optimization(batch_x, batch_targets)
    step += 1
    if step % display_step == 0:
        pred = lstm_net(batch_x, is_training=True)
        loss = cross_entropy_loss(pred, batch_targets)
        acc = accuracy(pred, batch_targets)
        print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))
print("Training finished!")

Clearly, using RNNs for solving static image classification problems which do not naturally involve any notion of time, might not seem like the most natural choice. Nonetheless, please run the code and take a look at the resulting test accuracy.

For more practice with LSTMs, see the following three resources.
* Tutorial on word2vec, which can be used with RNNs to solve language tasks: https://www.tensorflow.org/tutorials/representation/word2vec
* Wikipedia page on LSTMs: https://en.wikipedia.org/wiki/Long_short-term_memory
* _The Unreasonable Effectiveness of Recurrent Neural Networks_ by Andrej Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/