<!---
Latex Macros
-->
$$
\newcommand{\bar}{\,|\,}
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\weights}{\mathbf{w}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

This notebook is the final submission for SNLP at UCL 2017/18. It was completed in groups of 3 with scores for both code and understanding. It achieved a final mark of 90%. It focused on Deep Learning for NLP story understanding, with the use of GloVe and LSTMs amongst other techniques.

# Assignment 3

## Introduction

In the last assignment, you will apply deep learning methods to solve a particular story understanding problem. Automatic understanding of stories is an important task in natural language understanding [[1]](http://anthology.aclweb.org/D/D13/D13-1020.pdf). Specifically, you will develop a model that given a sequence of sentences learns to sort these sentence in order to yield a coherent story [[2]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/short-commonsense-stories.pdf). This sounds (and to an extent is) trivial for humans, however it is quite a difficult task for machines as it involves commonsense knowledge and temporal understanding.

## Goal

You are given a dataset of 45502 instances, each consisting of 5 sentences. Your system needs to ouput a sequence of numbers which represent the predicted order of these sentences. For example, given a story:

    He went to the store.
    He found a lamp he liked.
    He bought the lamp.
    Jan decided to get a new lamp.
    Jan's lamp broke.

your system needs to provide an answer in the following form:

    2	3	4	1	0

where the numbers correspond to the zero-based index of each sentence in the correctly ordered story. So "`2`" for "`He went to the store.`" means that this sentence should come 3rd in the correctly ordered target story. In this particular example, this order of indices corresponds to the following target story:

    Jan's lamp broke.
    Jan decided to get a new lamp.
    He went to the store.
    He found a lamp he liked.
    He bought the lamp.

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to, and in `X` in `group_X` contains the number of your group.

After you placed it there, **rename the notebook file** to `group_X.ipynb`.

The notebook is pre-set to save models in

    DIRECTORY_OF_YOUR_BOOK/assignments/2017/assignment3/problem/group_X/model/

Be sure not to tinker with that directory - we expect your submission to contain a `model` subdirectory with a single saved model! 
The saving procedure might overwrite the latest save, or not. Make sure you understand what it does, and upload only a single model! (for more details check tf.train.Saver)

## General Instructions
This notebook will be used by you to provide your solution, and by us to both assess your solution and enter your marks. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit, move nor copy these cells**.
2. **Assessment** Sections: these sections are used for both evaluating the output of your code, and for markers to enter their marks. **Do not edit, move, nor copy these cells**.
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

**If you edit, move or copy any of the setup, assessments and mark cells, you will be penalised with -20 points**.

Note that you are free to **create additional notebook cells** within a task section. 

Please **do not share** this assignment nor the dataset publicly, by uploading it online, emailing it to friends etc.

## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change, move or copy it.**

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
#! SETUP 1 - DO NOT CHANGE, MOVE NOR COPY
import sys, os
_snlp_book_dir = "../../../../../"
sys.path.append(_snlp_book_dir)
# docker image contains tensorflow 0.10.0rc0. We will support execution of only that version!
import statnlpbook.nn as nn

import tensorflow as tf
import numpy as np

## <font color='green'>Setup 2</font>: Load Training Data

This cell loads the training data. **Do not edit the next cell, nor copy/duplicate it**. Instead refer to the variables in your own code, and slice and dice them as you see fit (but do not change their values). 
For example, no one stops you from introducing, in the corresponding task section, `my_train` and `my_dev` variables that split the data into different folds.   

In [2]:
#! SETUP 2 - DO NOT CHANGE, MOVE NOR COPY
data_path = _snlp_book_dir + "data/nn/"
data_train = nn.load_corpus(data_path + "train.tsv")
data_dev = nn.load_corpus(data_path + "dev.tsv")
assert(len(data_train) == 45502)

### Data Structures

Notice that the data is loaded from tab-separated files. The files are easy to read, and we provide the loading functions that load it into a simple data structure. Feel free to check details of the loading.

The data structure at hand is an array of dictionaries, each containing a `story` and the `order` entry. `story` is a list of strings, and `order` is a list of integer indices:

In [3]:
data_train[0]

{'order': [3, 2, 1, 0, 4],
 'story': ['His parents understood and decided to make a change.',
  'The doctors told his parents it was unhealthy.',
  'Dan was overweight as well.',
  "Dan's parents were overweight.",
  'They got themselves and Dan on a diet.']}

## <font color='blue'>Task 1</font>: Model implementation

Your primary task in this assignment is to implement a model that produces the right order of the sentences in the dataset.

### Preprocessing pipeline

First, we construct a preprocessing pipeline, in our case `pipeline` function which takes care of:
- out-of-vocabulary words
- building a vocabulary (on the train set), and applying the same unaltered vocabulary on other sets (dev and test)
- making sure that the length of input is the same for the train and dev/test sets (for fixed-sized models)

You are free (and encouraged!) to do your own input processing function. Should you experiment with recurrent neural networks, you will find that you will need to do so.

In [4]:
# convert train set to integer IDs
train_stories, train_orders, vocab = nn.pipeline(data_train)

You need to make sure that the `pipeline` function returns the necessary data for your computational graph feed - the required inputs in this case, as we will call this function to process your dev and test data. If you do not make sure that the same pipeline applied to the train set is applied to other datasets, your model may not work with that data!

In [5]:
# get the length of the longest sentence
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_orders, _ = nn.pipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

You can take a look at the result of the `pipeline` with the `show_data_instance` function to make sure that your data loaded correctly:

In [6]:
nn.show_data_instance(dev_stories, dev_orders, vocab, 155)

Input:
 Story:
  The manager decided to offer John the job.
  During the interview he was very <OOV> and <OOV>
  He went to the interview very prepared and nicely dressed.
  John was excited to have a job interview.
  The manager of the company was really impressed by John's comments.
 Order:
  [4 2 1 0 3]

Desired story:
  John was excited to have a job interview.
  He went to the interview very prepared and nicely dressed.
  During the interview he was very <OOV> and <OOV>
  The manager of the company was really impressed by John's comments.
  The manager decided to offer John the job.


In [None]:
def convertGloveTF(vocab, vocab_key):
    print("Begin")

    pretrainedVectors = loadGlovePretrainedVectors('glove.6B.300d.txt')

    vector_dim = 200

    # Randomly initialise OOV vec
    oov_vec = np.random.rand(1, vector_dim)

    embeddings = np.zeros((len(vocab), vector_dim))

    # deal with <OOV> and <PAD>
    embeddings[0, :] = np.ones([1, vector_dim])  # <PAD> - set as ones
    embeddings[1, :] = oov_vec  # <OOV> 

    for i in range(2, len(vocab_key)):
        word = vocab_key[i]
        try:
            embeddings[i, :] = pretrainedVectors[word]
        except:
            embeddings[i, :] = oov_vector

    np.save('embeddings', embeddings)


def loadGlovePretrainedVectors(file):
    print("Loading Glove pre-trained word vectors")
    f = open(file, 'r', encoding='utf8')
    vectors = {}
    for line in f:
        lineSegment = line.split()
        word = lineSegment[0]
        embedding = []
        for vector in lineSegment[1:]:
            embedding.append(float(vector))
        vectors[word] = embedding

    print("Finished loading")
    return vectors


In [None]:
# import re
# token = re.compile("[\w-]+|'t|'ll|’re|’n|'ve|'d|'m|'s|\'")
# def tokenize(input):
#     return [word.lower() for word in token.findall(input)]

# def pipeline(data, vocab=None, max_sent_len_=None):
#     is_ext_vocab = True
#     if vocab is None:
#         is_ext_vocab = False
#         vocab = {'<PAD>': 0, '<OOV>': 1}

#     max_sent_len = -1
#     data_sentences = []
#     data_orders = []
#     for instance in data:
#         sents = []
#         for sentence in instance['story']:
#             sent = []
#             tokenized = tokenize(sentence)
#             for token in tokenized:
#                 if not is_ext_vocab and token not in vocab:
#                     vocab[token] = len(vocab)
#                 if token not in vocab:
#                     token_id = vocab['<OOV>']
#                 else:
#                     token_id = vocab[token]
#                 sent.append(token_id)
#             if len(sent) > max_sent_len:
#                 max_sent_len = len(sent)
#             sents.append(sent)
#         data_sentences.append(sents)
#         data_orders.append(instance['order'])

#     if max_sent_len_ is not None:
#         max_sent_len = max_sent_len_
#     out_sentences = np.full([len(data_sentences), 5, max_sent_len], vocab['<PAD>'], dtype=np.int32)

#     for i, elem in enumerate(data_sentences):
#         for j, sent in enumerate(elem):
#             out_sentences[i, j, 0:len(sent)] = sent

#     out_orders = np.array(data_orders, dtype=np.int32)

#     return out_sentences, out_orders, vocab

### Model

The model we provide is a rudimentary, non-optimised model that essentially represents every word in a sentence with a fixed vector, sums these vectors up (per sentence) and puts a softmax at the end which aims to guess the order of sentences independently.

First we define the model parameters:

In [7]:
data_train = data_train + data_dev
train_stories, train_orders, vocab = nn.pipeline(data_train)
max_sent_len = train_stories.shape[2]

# convert dev set to integer IDs, based on the train vocabulary and max_sent_len
dev_stories, dev_orders, _ = nn.pipeline(data_dev, vocab=vocab, max_sent_len_=max_sent_len)

In [8]:
#Configuration
BATCH_SIZE = 25
EPOCHS = 5 #you can change it to 5, not defined in the training yet (if you want you can assign it, now is 5)

NUM_LAYERS = 3 ## You can put 1,2 or 3
NUM_UNITS = 200 ## You can put more
KEEP_PROB = { 0: 0.5, 1: 0.5, 2: 0.5, 3: 0.3, 4: 0.3, 5: 0.3}
TARGET_SIZE = 5

vocab_size = len(vocab)
input_size = 100 #EMBEDDING, You can use 20
output_size = 5


and then we define the model

In [9]:
### MODEL ###

#max_sent_len is the maximum sentence length

#We are going to concatenate the 5 stories
total_sequence = 5 * max_sent_len   

## Placeholders
story = tf.placeholder(tf.int64, [None, None, None], "story")        # [batch_size x 5 x max_length]
order = tf.placeholder(tf.int64, [None, None], "order")              # [batch_size x 5]

batch_size = tf.shape(story)[0]
keep_prob = tf.placeholder(tf.float32) #Value of probability to keep during dropout

sentences = [tf.reshape(x, [batch_size, -1]) for x in tf.split(axis = 1, num_or_size_splits=5, value = story)]  # 5 times [batch_size x max_length]

# Word embeddings

embeddingArray = np.load('./embeddings100.npy') ### commented for inference
embeddings_row, embeddings_col = embeddingArray.shape
initializer = tf.constant_initializer(embeddingArray)


embeddings = tf.get_variable("x", [vocab_size, input_size], initializer=initializer, dtype=tf.float64)

sentences_embedded = [tf.nn.embedding_lookup(embeddings, sentence)   # 5 X[batch_size x max_seq_length x input_size]
                      for sentence in sentences]

# Model architecture

#Concatenate all sentences
inputs = tf.cast(tf.concat(sentences_embedded,axis = 1), tf.float32) # size [batch_size X (5 * max_)]
# [batch_size x (5 * max_seq_length) x embedding_size]

def build_cell(lstm_size, keep_prob):
    # Use a basic LSTM cell
    lstm = tf.nn.rnn_cell.LSTMCell(lstm_size, state_is_tuple=True)
        
    # Add dropout to the cell
    drop = tf.nn.rnn_cell.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    return drop

#Lstm cell
#lstm = tf.nn.rnn_cell.LSTMCell(NUM_UNITS, state_is_tuple=True) #I changed this in order to be compatible

#dropout 
#dropout = tf.nn.rnn_cell.DropoutWrapper(lstm, output_keep_prob = keep_prob) #I changed this in order to be compatible

#stack of cells
stack = tf.nn.rnn_cell.MultiRNNCell([build_cell(NUM_UNITS, keep_prob) for _ in range(NUM_LAYERS)], state_is_tuple=True)
#stack = tf.nn.rnn_cell.MultiRNNCell([dropout for _ in range(NUM_LAYERS)], state_is_tuple=True)

#outputs, after the sequence of RNN
outputs, state = tf.nn.dynamic_rnn(stack, inputs, dtype=tf.float32)

#reshape output shape[total_sequence * NUM_UNITS, batch_size]
output_h = tf.reshape(outputs, [-1, total_sequence * NUM_UNITS])

#weights of last layer, shape[total_sequence * num_units, 5 * target_size]
W = tf.get_variable("w", [total_sequence * NUM_UNITS, 5 * TARGET_SIZE], dtype=tf.float32)

#biases of last layer, shape[5 * target_size]
b = tf.get_variable("b", [5 * TARGET_SIZE], dtype=tf.float32)

#get logits after linear combination shape [batch_size, (5 * target_size) ]
logits_flat = tf.add(tf.matmul(output_h, W),b) 

#reshape to the correct size
logits = tf.reshape(logits_flat, [-1, 5, TARGET_SIZE])        # [batch_size x 5 x target_size]x

# loss 
loss = tf.reduce_sum(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=order))

l2 = 0.005 * sum(tf.nn.l2_loss(loss)for tf_var in tf.trainable_variables()
                 if not ("noreg" in tf_var.name or "b" in tf_var.name))
loss += (l2)

# prediction function
unpacked_logits = [tensor for tensor in tf.unstack(logits, axis=1)]
softmaxes = [tf.nn.softmax(tensor) for tensor in unpacked_logits]
softmaxed_logits = tf.stack(softmaxes, axis=1)
predict = tf.arg_max(softmaxed_logits, 2)

We built our model, together with the loss and the prediction function, all we are left with now is to build an optimiser on the loss:

In [10]:
optimizer = tf.train.AdamOptimizer(learning_rate=0.0005, beta1=0.9, beta2=0.99, epsilon=1e-8)
gvs = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gvs]
opt_op = optimizer.apply_gradients(capped_gvs)

### Model training 

We defined the preprocessing pipeline, set the model up, so we can finally train the model

In [11]:
TRAIN = False

In [12]:
if TRAIN:
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        n = train_stories.shape[0]

        for epoch in range(EPOCHS):
            print('----- Epoch', epoch, '-----')
            indices = np.random.random_sample(((n // BATCH_SIZE), ))
            indices = np.argsort(indices)
            total_loss = 0
            for i in range(n // BATCH_SIZE):
                inst_story = train_stories[indices[i] * BATCH_SIZE: (indices[i] + 1) * BATCH_SIZE]
                inst_order = train_orders[indices[i] * BATCH_SIZE: (indices[i] + 1) * BATCH_SIZE]
                feed_dict = {story: inst_story, order: inst_order, keep_prob: KEEP_PROB[epoch]}
                _, current_loss = sess.run([opt_op, loss], feed_dict=feed_dict)
                total_loss += current_loss

            print(' Train loss:', total_loss / n)

            train_feed_dict = {story: train_stories, order: train_orders, keep_prob: 1.0}
            train_predicted = sess.run(predict, feed_dict=train_feed_dict)
            train_accuracy = nn.calculate_accuracy(train_orders, train_predicted)
            print(' Train accuracy:', train_accuracy)

            dev_feed_dict = {story: dev_stories, order: dev_orders, keep_prob:1.0}
            dev_predicted = sess.run(predict, feed_dict=dev_feed_dict)
            dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)

            print(' Dev accuracy:', dev_accuracy)



        nn.save_model(sess)

## <font color='red'>Assessment 1</font>: Assess Accuracy (40 pts) 

We assess how well your model performs on an unseen test set. We will look at the accuracy of the predicted sentence order, on sentence level, and will score them as followis:

* 0 - 10 pts: 45% <= accuracy < 50%, linear
* 10 - 20 pts: 50% <= accuracy < 55, linear
* 20 - 40 pts: 55 <= accuracy < 60, linear
* extra 0-10 pts: 60 <= accuracy < 70, linear

The **linear** mapping maps any accuracy value between the lower and upper bound linearly to a score. For example, if your model's accuracy score is $acc=54.5\%$, then your score is $10 + 10\frac{acc-50}{55-50}$.

Change the following lines so that they construct the test set in the same way you constructed the dev set in the code above. We will insert the test set instead of the dev set here. **`test_feed_dict` variable must stay named the same**.

In [13]:
# LOAD THE DATA
data_test = nn.load_corpus(data_path + "dev.tsv")
# make sure you process this with the same pipeline as you processed your dev set
test_stories, test_orders, _ = nn.pipeline(data_test, vocab=vocab, max_sent_len_=max_sent_len)

# THIS VARIABLE MUST BE NAMED `test_feed_dict`
test_feed_dict = {story: test_stories, order: test_orders, keep_prob: 1.0}

The following code loads your model, computes accuracy, and exports the result. **DO NOT** change this code.

In [14]:
#! ASSESSMENT 1 - DO NOT CHANGE, MOVE NOR COPY
with tf.Session() as sess:
    # LOAD THE MODEL
    saver = tf.train.Saver()
    saver.restore(sess, './model/model.checkpoint')
    
    # RUN TEST SET EVALUATION
    dev_predicted = sess.run(predict, feed_dict=test_feed_dict)
    dev_accuracy = nn.calculate_accuracy(dev_orders, dev_predicted)

dev_accuracy

0.64489577765900585

## <font color='orange'>Mark</font>:  Your solution to Task 1 is marked with ** __ points**. 
---

## <font color='blue'>Task 2</font>: Describe your Approach

Enter a 1000 words max description of your approach **in this cell**.
Make sure to provide:
- an **error analysis** of the types of errors your system makes
- compare your system with the model we provide, focus on differences and draw useful comparations between them

Should you need to include figures in your report, make sure they are Python-generated (matplotlib, seaborn, bokeh are all included in the stat-nlp-book Docker image). For that, feel free to create new cells after this cell (before Assessment 2 cell). Link online images at your risk.

## Motivations

The aim of this assignment is to construct deep learning models that would determine the correct sequence of a story. The first successful application of deep learning models in sequence-to-sequence learning tasks was by ([Sutskevar et al. 2014](https://arxiv.org/pdf/1409.3215.pdf)). This paper applied a multi-layered Long Short-Term Memory (LSTM) to map the input sequence to a vector of fixed dimensionality and then another deep LSTM to decode the target sequence from the vector. Since then, LSTMs and its variants have become the de-facto approach to sequence learning and mapping problems.

## Approach

**1) Preprocessing**

**1.1) Word Embedding**

The baseline model presented in our stub represents words as discrete ids. This leads to a problem of data sparsity when trying to contextualize our words as part of a sentence. To overcome this we vectorise our words using GloVe embedding. Vectorizing our words alleviates the data sparsity problem as similar words are recognised as similar or the same vector.

Given the size restraint of our assignment file we applied the GloVe model (pretrained with the Wikipedia and Gigaword datasets) with 100 dimensions. We do not use all the word embeddings when training, only the embeddings for the words in our vocabulary. We deal with OOVs by setting such words to the UNK token and assigning them same vector to them. We experimented with 200 and 300 dimensions but due to memory constraints we deemed further experimentation infeasible. For the epochs that we did train successfully, no significant difference was noted. Implementing GloVe boosted the score on our LSTM by 0.5 - 1%.

We analysed how many tokens in our vocab were not among the pre-trained word embeddings and fortunately it was a small amount. In the event a large proportion of our vocab was not among the pre-trained word embeddings, we would have attempted to train our own word embeddings. 


**1.2 Model**

![model](https://scontent-lht6-1.xx.fbcdn.net/v/t1.0-9/26219123_10155922352942969_8715478189856124103_n.jpg?oh=d8b7eee8f9e8909ed13b61a6f2f41714&oe=5AFD8190)

The baseline model is very simplistic, its approach is to take the input sentence and convert it into a single value which is dependent on the length of a sentence. So the baseline model learns how to order the sentences by length. We implement an LSTM model which very much focuses on the sequence of words in the sentence. This is the critical difference between our model and the original baseline. Capturing sequential relationships helps our model perform better.

Our final model consists of stacked layers LSTM cells. The inputs into the model are the word embeddings, these were originally initialised but later replaced with the output of the GloVE preprocessing. In order to capture the dependencies between the sentences in the story, we concatenated the five sentences per input; where each sentence was already replaced by the embedding. After each LSTM cell we used dropout to stop the model from overfitting, before feeding the outputs into the next cell in the stack. This is done via the MultiRNNCell and dynamic_rnn cells. Most of LSTM models use between 100 and 200 units. We realized that using 200 units performed slightly better than using 100. Besides, using more LSTM layers resulted in more overfitting, that is why we decided to stick with only 3 layers. We use the sparse matrix cross entropy loss to calculate the loss along with L2 regularization to help evaluate the model before using the Adam Optimiser. 

With regards to similarities between our model and the original. Like the original, we use word embeddings and so move to a numerical representation of our data. From our experience, vectorising helps the model learn better. 


**1.3) Training**

As well as implementing a new model, we also improve upon the default implementation by implementing better training procedures. 
Gradient Clipping
Since our model has 3 LSTM layers with 200 units cells working in tandem over many time steps, they could have derivatives that are either very large or very small. Gradient clipping is a technique that helps prevent exploding gradients in such networks. Through gradient clipping we normalize the gradients of parameter vector when its value exceeds a certain threshold. ([*Bengio, 2013*](https://arxiv.org/abs/1211.5063))

\begin{equation*}
New Gradient = Old Gradient*\frac{threshold}{L2-norm(gradients)}
\end{equation*}

**Random Batching**

To ensure our mini-batches are not highly correlated with one another we are shuffling our data using random batching. This also helps prevent oscillation of weight values that occur from one epoch to the next, and to make weight space more stochastic.  ([*Goodfellow, 2016*](http://www.deeplearningbook.org/)). In practical terms, we shuffle the indices of the data every epoch before drawing batches. The addition of random batching does result in a marginal improvement in our accuracy of 0.5%.

**Dropout**

Adding extra layers to our deep architecture increases the probability of overfitting our training data. One way to prevent this is by implementing dropout. This ensures that our neurons activate with only a certain probability. We implemented a dictionary of dropout probabilities in order to slow down the overfitting as the model ran each epoch. 
L2-Regularization To compliment our stochastic regularizer (dropout), we also added non-stochastic L2-regularization. This adds a penalty on the square value of our LSTM weights and ensures they only function if there are big gradients to counteract them. This increased both the consistency and accuracy of our final model. Bringing us to a final accuracy of 55.5% on the dev set. 


**Optimisation**

In order to tune our final model we used gridsearch to find the optimal values needed for Adam, the number of layers and the word embedding dimensions. We found that 3 layers worked best and allowed us to keep the model small enough to stay inside the constraints. Adam seemed to overfit with higher than 0.0005 learning rate, we also decayed the learning rate as the model trained in order to avoid overfitting. These were all optimised using cross-validation, although the initial grid searches had to contain few parameters as there was a time constraint due to work with LSTMs.

**Checkpoints**

Our final model achieved its’ maximum accuracy on epoch 4/20, therefore early-stopping was used in order to allow us to save the model so that it can be used for inference. There was also a patience period added to the model in order to stop the training if the model loss stopped decreasing, the final training was stopped at epoch 12.


**2) Failed Approaches**

There were a number of additional techniques we applied that were not in the baseline code, however these did not improve our score. Regex Tokenization, we used regular expressions to split up the sentences with the hope that it would perform better than splitting up by the spaces. We tried several regular expressions before settling on one that would work well with apostrophes. This increased our validation accuracy by 1. However, we were not able to get GloVe working in conjunction with custom tokenization and therefore decided to omit it.

We switched from a uniform initializer to truncated normal before switching over to GloVe. Truncated normal proved better than uniform and achieved scores of 53.5% on the dev set with basic testing as opposed to 52.5% from uniform. It also appeared to converge faster. 

Some investigation was done into activation functions. The final model used Tanh, however we tested with relu and sigmoid, neither improved our accuracy and gradient clipping was more effective at reducing the gradients per epoch. 

GRU Cells were tried, these provided little change from the LSTM cells we used in the final model and often achieved a lower score while being more consistent between epochs. Therefore these were not used in the final model.

## Comparison with Baseline Model

| Model | Accuracy |
| :---         |         ---: |
| LSTM (Final)     | 55.6%   |
| Baseline     | 40%     |

## Other Models

| Model |  Accuracy |
| :---         |        ---: |
| LSTM + GloVe + L2 | 55.3%   |
| LSTM + GloVe     | 54%     |
| LSTM + regex token     | 53%     |
| basic LSTM/GRU   | 51%     |
| MLP     | 32%     |


**Error analysis**

Our model was able to predict the first (72%) and final sentences (63%) most accurately, this was due to the nature of those sentences. When using english, we build upon the sentences in such a way that the first and last sentences are usually the shortest, often containing just a main clause. When taking the mean of the 5 positions in the sentences across the dataset we found this to be the case.

Another reason for the first sentence being easy to predict is that the first sentence will include information that will be control by a personal pronoun later in the story. These are short stories and thus names are usually given in the first sentence. For example, when looking at the following stories we can see where the model can correctly predict the first sentence but also where it struggles.

Story 1: “Frank had been drinking beer. He got a call from his girlfriend, asking where he was. Frank suddenly realized he had a date that night. Since Frank was already a bit drunk, he could not drive. Frank spent the rest of the night drinking more beers.” 

This follows 1 of the 2 rules defined in the first paragraph, the first sentence is short, however the model was unable to correctly label the rest of the sentences due to the use of “Frank” and the similarity of these sentences. 
Story 2: “Josh had a parrot that talked. During show and tell, Josh's parrot said a bad word. He brought his parrot to school. The teacher told Joshua not to bring his bird again. When Josh got home, he was grounded.” 
This story has a simpler format than the previous, it was guessed correctly by our final model. Again it follows the same rules underlined in the first paragraph of the error analysis. Josh is the object of the story and this is easy to see.

Other than these rules the english sentence structure allows for compound and complex sentences. These are sentences which contain a main clause but also other clauses. These are usually connected by connectives like “and” or “because”. These are usually seen in sentence 2, 3 or 4. This is apparent in Story 2.
All of these contribute to the length of the sentences in a numeric way, this is very easy for the model to pick up and thus increases our accuracy the more apparent these rules are.


## <font color='red'>Assessment 2</font>: Assess Description (60 pts) 

We will mark the description along the following dimensions: 

* Clarity (10pts: very clear, 0pts: we can't figure out what you did, or you did nothing)
* Creativity (25pts: we could not have come up with this, 0pts: Use only the provided model)
* Substance (25pts: implemented complex state-of-the-art classifier, compared it to a simpler model, 0pts: Only use what is already there)

## <font color='orange'>Mark</font>:  Your solution to Task 2 is marked with ** __ points**.
---

## <font color='orange'>Final mark</font>: Your solution to Assignment 3 is marked with ** __points**. 