In [3]:
import tensorflow as tf
from tensorflow.keras import layers

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from tensorflow import keras

print(tf.__version__)
print(tf.keras.__version__)


2.6.0
2.6.0


# Q1 

In the lecture we used a code snippet similar to the following:

In [4]:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 2

imdb = keras.datasets.imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

model = keras.Sequential()
model.add(keras.layers.Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(keras.layers.Dropout(0.2))

model.add(keras.layers.Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(keras.layers.GlobalMaxPooling1D())
model.add(keras.layers.Dense(hidden_dims))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Activation('relu'))
model.add(keras.layers.Dense(1))
model.add(keras.layers.Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs, 
          validation_split=0.1)
model.summary()

2022-03-31 16:07:22.363611: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-31 16:07:22.370913: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-03-31 16:07:23.482863: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/2
Epoch 2/2
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 50)           250000    
_________________________________________________________________
dropout (Dropout)            (None, 400, 50)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d (Global (None, 250)               0         
_________________________________________________________________
dense (Dense)                (None, 250)               62750     
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation (Activation)      (None, 

a) Explain this code snippet in detail. Do not use more than 800 words. (8 points)

```max_features = 5000``` setting the vocabulary size to be considered.
```maxlen = 400``` the maximum length of a report
```batch_size = 32``` the batch size
```embedding_dims = 50``` the dimension of the embedded vectors
```filters = 250``` the number of filters applied in the convolutiobnal layer
```kernel_size = 3``` the kernel size in the convolutiobnal layer
```hidden_dims = 250``` the number of nodes in the hidden dense layer
```epochs = 2``` the number of epochs

`imdb = keras.datasets.imdb` 
`(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)` 
reading in the imdb dataset & segementing the data into the predefined training and test data-sets, only including the 5000 most popular words in the dictionary. 

`x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)`
The preprocessing sequence pads the unknown words with 0's, and also pads reports that are too short with 0's on the end so all reports are of the same length. This padding sequence also cuts reports which are longer than maxlen to be of length maxlen. 

`x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)`
The same preprocessing sequence padding the test data.

`model = keras.Sequential()` organises the model such that the layers are sequentially stacked. 

`keras.layers.Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen)`
The embedding layer has one-hot encoding built into it, whereby the position of each word between 0-max_features in to vocabulary is assigned a 1 in a max_features length vector, while each other position is assigned 0.
The embedding layer then maps each of these max_features length vectors to a vector of length 'embedding_dims' by applying embedding_dims weights to each entry in the max_length vectors. The embedding layer can only be used on integer value inputs of fixed range. 
 
 
 `model.add(keras.layers.Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))` 
A 1 dimensional convolutional kernel (a 3x1 vector, where the 3 is the 'kernel_size' and the 1 is from the dimensionality of the 1-D conv layer) is applied to the outputs of the previous layer. The 'filters' weights & biases are then applied to each of the outputs of each kernel. 
`strides=1` indicates that the kernel begins on each consecutive row of the given column, rather than skipping rows.
`padding='valid'` implies that no extra 0's are added to the sides of the inputs, (hence the output dimensioanlity is 398x250 instead of 400x250).
`activation = relu` sets the activation function calculating the output of each node to the relu function.
                 
`model.add(keras.layers.GlobalMaxPooling1D())` 
Takes the maximum valued entry of each column of the input, and outputting a vector of each of these values.

`model.add(keras.layers.Dense(hidden_dims))` defines a fully-connected layer consisting of 250 nodes. Each of the ouputs from the previous layer is connected to all of the nodes in the dense layer, by a certain weight to which a bias is the applied.

`model.add(keras.layers.Dropout(0.2))`
The dropout layer removes nodes from the dense layer at random during training. The proportion of nodes removed is defined in the function - in this case 0.2 of the 250 nodes are removed.

`model.add(keras.layers.Activation('relu'))` is an activation layer applied to the dense layers nodes using the ReLU activation function.  The reLU activation function reuturns 0 for input values x <= 0, and x for x>0.

    $$ f(x) = \begin{cases} -0 & x\leq 0 \\-x  & x>0 \end{cases} $$

`keras.layers.Dense(1)` defines a fully-connected layer of one node, where each neuron input will be connected to every output of the previous layer, with weights and biases applied. 

`model.add(keras.layers.Activation('sigmoid'))` is an activation layer applied to the dense layer using the sigmoid function as an activation fuction. The sigmoid function returns values between 0 and 1, and can be used to convert the inputs of the layer to a probability ditribution. The sigmoid function is used as the final activation function in binaray classification problems.

`model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])`
Complies the model with the binary crossentrophy function as the loss function, and the *Adam* optimisation function. The accuracy is also printed so we know to what extent the model is predicting correctly.              
              
`model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs, 
          validation_split=0.1)`
Trains the model on the training data, using a batch size (the number of articles considered in one training step) of batch_size, and the number of epochs (the number of times the entire training data is passed through the model) = epochs. The validation split =0.1 means that 0.1 of the training data is not used to train the model, but rather to validate how well the model is performaing on unseen articles.

b) The line `model.summary()` displays the number of trainable parameters in each layer. Precisely explain those numbers for each layer. Do not use more than 300 words. (3 points)

In [5]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 400, 50)           250000    
_________________________________________________________________
dropout (Dropout)            (None, 400, 50)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d (Global (None, 250)               0         
_________________________________________________________________
dense (Dense)                (None, 250)               62750     
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation (Activation)      (None, 250)               0

    Embedding: has 5000x50 weights to train
    Dropout, max_pooling and activation layers have no parameters to train.
    conv1D has 3 (kernel)x 50 (input) x 250 weights+ 250 biases to train.
    dense(250) has 250x250 weights + 250 biases to train.
    dense(1) has 1x250 weights + 1 bias to train.

In [2]:
3*50*250 +250

37750

c) Rewrite the above snippet using the Keras functional API. (3 points)

In [10]:
# YOUR CODE HERE
inputs = keras.Input(shape=(400), name="img")
x = keras.layers.Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen)(inputs)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Conv1D(filters, 3, activation="relu")(x)
block_1_output = keras.layers.GlobalMaxPooling1D()(x)

x = keras.layers.Dense(hidden_dims, activation = "relu")(block_1_output)
x = keras.layers.Dropout(0.2)(x)

outputs = keras.layers.Dense(1, activation ="sigmoid")(x)

model = keras.Model(inputs, outputs, name="api_imdb")
#keras.utils.plot_model(model, show_shapes=True)
raise NotImplementedError()

NotImplementedError: 

In [11]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs, 
          validation_split=0.1)
model.summary()

Epoch 1/2
Epoch 2/2
Model: "api_imdb"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
img (InputLayer)             [(None, 400)]             0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
dropout_4 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 250)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_5 (Dropout)          (None, 25

What is a functional API? Recall we used a different API to write the software where two layers co-existed, rather than one layer after the other. Functional ApI is lecture L07. The course is designed in a way that if you follow the links a little bit deeper now you should get a better understanding now.

# Q2 Overfitting

a) Explain the problem of *overfitting* in the context of Machine Learning. Do not use more than 400 words. (4 points)

Over-fitting occurs when a model can predict the patterns in the training data-set with a higher
accuracy compared with the test and validation data-sets. 

For example if a tudent practiced solving one type of question over and over again in preparation for an exam, and became excellent at doing that question. The question on a exam was a different question, and the student performed only medicorately on the exam. The student couldn't adapt to the differences in the question.

Similarly an overfitted model performs extremely well on the data on which it is trained, however its accuracy is lower on unseen data, and its loss is higher on the unseen data than the trained data aswell. 

b) Explain three different methods, which are used to address the problem of overfitting. Do not use more than 600 words. (6 points)

### Dropout layers
Drop-out layers remove a specified percentage/number of nodes at random in a layer within a model during training. For example 20% of the nodes could be removed from a layer (activation is not applied to the neurons, and they are discounted rom weight distribution). They
reduce over-fitting as they stop the model from becoming overly reliant on the training data-set. 
### Data Augmentation
Training a model on too little data can cause model overfitting. Data augmentation is a method which can be used to supplement data-sets which aren’t large enough to train a model effectively by augmenting the available data, in a way that's appropriate to the dataset. For example, data aug-
mentation of a data-set of images involves the rotation of the available images within a fixed range (e.g. 5 degrees about
the vertical axis), as well as flipping the image about the vertical and horizontal axis.
### Early stopping
Early stopping can be applied to a model to stop it from continuing to train when the validation accuracy starts decreasing (or when the validation loss starts increasing). After a certain number of iterations (epochs) the model perforamance stops improving, and then begins decreasing again. Early stopping can be applied either as a function, or by hand by examining the plots of the validation accuracy and loss over a large number of epochs, and selecting the epochs on which to optimise the model by eye. 

c) Provide a working code example, which illustrates the effectiveness of one of the methods discussed in the previous subquestion. (4 points)

In [12]:
# YOUR CODE HERE
inputs = keras.Input(shape=(400), name="img")
x = keras.layers.Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen)(inputs)
#x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Conv1D(filters, 3, activation="relu")(x)
block_1_output = keras.layers.GlobalMaxPooling1D()(x)

x = keras.layers.Dense(hidden_dims, activation = "relu")(block_1_output)
#x = keras.layers.Dropout(0.2)(x)

outputs = keras.layers.Dense(1, activation ="sigmoid")(x)

model2 = keras.Model(inputs, outputs, name="api_imdb")

model2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model2.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs, 
          validation_split=0.1)
model2.summary()

Epoch 1/2
Epoch 2/2
Model: "api_imdb"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
img (InputLayer)             [(None, 400)]             0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 400, 50)           250000    
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 250)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 250)               62750     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 251       
Total params: 350,751
Trainable params: 350,751
Non-trainable params: 0
________________________________

You can insert snippets from throughout the course or of similar standard which demonstrates that one of the methods mentioned in b is effective. It should be a simple example, you shouldn't be waiting 10 mins for the model to train.

# Q3 

a) Explain the following related concepts and the connections between them:
  - One-hot encoding
  - softmax layer
  - sparse categorical crossentropy

Do not use more than 800 words. 
(8 points)

### One-hot encoding


One-hot encoding: number converted to vector, where the vector consists of one 1 and the rest are 0s.

Softmax layer: At the end we want to have a layer so that the output layer can be intertppeted as a probability distribution.

Sparse categorical crossentrophy: If we want to later on have a function, and softmax layer to categorise the data at the end, we want to tell the system how wrong it is if it comes up with the wrong probability distribution at the end. Punishes system for getting wrong distribution.

b) Using only elementary Python or the Numpy library, define a function `softmax(myarray)` which calculates the softmax function for a batch of samples. Here `myarray` is a rank 2 numpy array, where `myarray.shape[0]` is the number of samples and `myarray.shape[1]` is the number of categories. The function is supposed to return an array `mysoftmax`, which has the same shape as `myarray`. (2 points)

In [23]:
def softmax(myarray):
    mysoftmax = np.zeros(np.shape(myarray))
    for i in range(myarray.shape[0]):
        total = np.sum(np.exp(myarray[i,:]))
        for j in range(myarray.shape[1]):
            mysoftmax[i,j] = np.exp(myarray[i,j])/total
    
    return mysoftmax

c) Using only elementary Python or the Numpy library, define a function `sparse_categorical_crossentropy(mysoftmax, mylabels)` which calculates the sparse categorical crossentropy error function for a batch of samples. Here `mysoftmax` is an array of the same shape as `myarray` in the previous subquestion. `mylabels` is a numpy array of `mysoftmax.shape[0]` integers between 0 and `mysoftmax.shape[1]-1`. The function is supposed to return a single number. (2 points)

In [43]:
def sparse_categorical_crossentropy(mysoftmax, mylabels):
    onehotlabels = np.zeros(np.shape(mysoftmax))
    loss = np.zeros(len(mylabels))
    for i in range(len(mylabels)):
        onehotlabels[i,mylabels] = 1
    for j in range(len(mylabels)):
        loss[j] = -(1/len(mylabels)) *np.sum(onehotlabels[j,:]*np.log(mysoftmax[j,:])+(1-onehotlabels[j,:])*np.log(1-mysoftmax[j,:]))
    return np.sum(loss)

Define a function for sparse categorical crossentrophy using elementary pyhton and numpy.

d) Using only elementary Python or the Numpy library, define a function `accuracy(mysoftmax, mylabels)` which calculates the accuracy of the predictions in `mysoftmax` when compared to the actual labels in `mylabels`. The shape of these two arrays are as in the previous subquestion. The function is supposed to return a single number. (2 points)

In [54]:
def accuracy(mysoftmax, mylabels):
    counter = 0
    for i in range(len(mylabels)):
        if np.argmax(mysoftmax[i,:]) == mylabels[i]:
            counter= counter+1
    acc = counter / len(mylabels)    
    return acc

In [55]:
a = np.array([[1,2,3,4],[4,6,8,10]])
alab = np.array([3,3])

In [56]:
accuracy(softmax(a),alab)

1.0

In [53]:
softmax(a)
np.argmax(softmax(a)[0,:])

3

# Q4 Backpropagation

a) Explain the concept of *backpropagation* in the context of *Artificial Neural Networks* in your own words.  Address the following points in your answer:
- What is the background of the problem to be addressed?
- What problem is addressed?
- How does it work?
- What are the numerical challenges and how can they be addressed?
- What factors influence the performance?

Provide links to all sources used in your answer. Do not use more than 700 words. (7 points)

YOUR ANSWER HERE

b) Explain the concept of *backpropagation through time* in the context of *Recurrent Neural Networks*.   Address the following points in your answer:
- What is the background of the problem to be addressed?
- What problem is addressed?
- How does it work?
- What are the numerical challenges and how can they be addressed?
- What factors influence the performance?

Provide links to all sources used in your answer. Do not use more than 700 words. (7 points) 

YOUR ANSWER HERE

Topics have changed a bit throughout the years but it should mostly be familiar to us. 30mins per question.