# HA2 - Recurrent Neural Networks for NLP

This assignment is done in collaboration with Daniel Langkilde from Recorded Future. For questions, contact him on daniel@recordedfuture.com.

### Named Entity Recognition
The goal of this assignment is to build and train a model for [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition). An important part of understanding natural language is being able to accurately locate and classify named entities. Named entities are nouns such as persons, organizations, locations etc. We will treat NER as a sequence labelling problem, and use Recurrent Neural Networks to solve it.

This assignment is broken down into 6 main tasks:
1. Loading the dataset and understanding it
2. Computing the word embeddings
3. Preprocessing the dataset
4. Training a vanilla RNN for this task
5. Training a deep LSTM for this task
6. Evaluating the best model

## 1. Dataset
In this assignment we will a dataset called CoNLL-2002 Shared Task for Named Entity Recognition. It consists of 47 959 sentences, or 1 048 576 words including punctuation, with corresponding entity labels. 

In [None]:
import pandas as pd
import numpy as np
conll_data = pd.read_csv("./conll_2002_ner_dataset.csv")

### 1.1. Description of dataset

In [None]:
print("Shape of dataset: "+str(conll_data.shape))
conll_data.head(25)

The first column in the dataset is used to indicate when a new sentence begings. 

The second columns holds the word. 

The third column holds the Part-of-Speech tag. The POS tag describes the role of the word in the sentence (like adjective, noun, verb etc). More details are available here https://en.wikipedia.org/wiki/Part_of_speech 

The last column contains the Named Entity Tags. These are the tags we want to be able to predict for a given sentence. The tags in the dataset are

In [None]:
tags = list(conll_data.Tag.unique())
tags.sort(key = lambda x: x[2:]) # sorting them to make it easier to read the list
NUM_CLASSES = len(tags)
print("Number of different tags: "+str(NUM_CLASSES))
tags

As you can see there are 9 different types represented in the dataset. The B- and I- prefixes indicate if the word is the beginning of an entity, or is inside an entity, putting the total number of classes at 17. For example "Barack Obama" would get the labels "B-per I-per". The entities represented are

O&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;- No type<br>
Art&nbsp;&nbsp;&nbsp;&nbsp;- Artifact<br>
Eve&nbsp;&nbsp;&nbsp;- Event<br>
Geo &nbsp;- Geographical Entity<br>
Gpe &nbsp;- Geopolitical Entity<br>
Nat &nbsp;&nbsp;- Natural Phenomenon<br>
Org &nbsp;&nbsp;- Organization<br>
Per &nbsp;&nbsp;- Person<br>
Tim &nbsp;&nbsp;- Time Expression<br>

## 2. Computing the word embeddings

The biggest challenge with NLP is that it's sparse. This is sometimes known as the curse of dimensionality. There are many ways to say the same thing, and the meaning of words is highly dependent on context. For computers the representation of words as discrete atoms is blunt. Two words have no inherent notion of similarity in their representation. 


To express this more formally, let's assume we represent words as vectors. The size of our vectors will be the size of our vocabulary. Each word can then be represented as a 1 in a specific position, and zero in all others. There is no meaningful way to compute the similarity between two words in this representation. Instead we've found a way to move from this symbolic representation to a distributed representation that inherently allows for similarity comparisons. 

We base this representation on the assumption that the meaning of a word is implicit in the context in which it appears. Or to paraphrase the famous linguist J.P. Firth ''You shall know a word by the company it keeps''. The idea behind word embeddings is to find a dense vector representation for words such that the cosine distance between two different words is directly proportional to the probability that those two words appear in the same context.

We will now compute such dense vectors, or word embeddings as they are also known, for the words in our dataset. Rather than write everything ourselves we will use a package called gensim. There are some hyperparameters to choose for this, and default values are supplied.

First we want to group the words by which sentence they occur in. We will drop the POS tag for this exercise.
The goal is to get an ndarray with 47958 rows, each with two columns containing the array of words and their labels correspondingly. Fill in your code in the next cell.

In [None]:
def read_conll_data(conll_data):
    # complete

In [None]:
raw_data = read_conll_data(conll_data)

The required output is a numpy.ndarray containing one numpy.ndarray for each sentence. Inside each numpy.ndarray sentence are two numpy.ndarray containing the words and labels respectively. The first numpy.ndarray sentence should look like this:

    array([ array(['Thousands', 'of', 'demonstrators', 'have', 'marched', 'through',
           'London', 'to', 'protest', 'the', 'war', 'in', 'Iraq', 'and',
           'demand', 'the', 'withdrawal', 'of', 'British', 'troops', 'from',
           'that', 'country', '.'], 
          dtype='<U13'),
           array(['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O',
           'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'], 
          dtype='<U5')], dtype=object)

Now we'll use gensim to compute the embeddings. Some hyperparameters are shown here for reference.

In [None]:
from gensim.models import Word2Vec
VECTOR_SIZE = 100
MIN_COUNT = 5
MIN_ALPHA = 0.0001
WINDOW = 3

Given our raw dataset your task is to compute a gensim Word2Vec model.

In [None]:
def get_sentences(data):
    # complete
    
def compute_embeddings_model(data):
    # complete

sentences = get_sentences(raw_data)
embeddings = compute_embeddings_model(sentences)

Here `sentences` is of type `list` and embeddings is of `gensim.models.word2vec.Word2Vec`. If done correctly the following commands:

In [None]:
embeddings.most_similar('Germany')

In [None]:
embeddings.most_similar(positive=['Paris', 'France'], negative=['Berlin'], topn=10)

should return something like (expect some variation due to the stochastic nature of t-SNE).

    [('France', 0.9594805240631104),
     ('Britain', 0.8995000123977661),
     ('Brazil', 0.8921124935150146),
     ('Italy', 0.85428786277771),
     ('Spain', 0.8505465984344482),
     ('Canada', 0.8474853038787842),
     ('Japan', 0.8394404053688049),
     ('Argentina', 0.8286175727844238),
     ('Netherlands', 0.8081429600715637),
     ('Russia', 0.8042771220207214)]

and

    [('Germany', 0.7740741968154907),
     ('Britain', 0.708018958568573),
     ('Spain', 0.7058088779449463),
     ('Japan', 0.7046102285385132),
     ('Italy', 0.7018724679946899),
     ('Asia', 0.6579268574714661),
     ('Brazil', 0.6521944403648376),
     ('Canada', 0.6436143517494202),
     ('Vietnam', 0.6399591565132141),
     ('Australia', 0.63737952709198)]

### Visualizing the word Embeddings
Before we move on to the neural networks, let's also plot the embeddings we've computed. To do this we will use a method called t-SNE. t-SNE allows us to find a nice projection of our high dimensional embeddings onto a two dimensional surface. The code is ready to run, no need to add anything to it.

(this might take some minutes to run, depending on your hardware)

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def compute_tsne(embeddings_model):
    embeddings_model.init_sims(replace=True)
    X = embeddings_model[embeddings_model.wv.vocab]
    tsne = TSNE(n_components=2)
    return tsne.fit_transform(X)

tsne = compute_tsne(embeddings)

In [None]:
# save the resulting projection
plt.rcParams["figure.figsize"] = (50,50)
plt.scatter(tsne[:, 0], tsne[:, 1])
labels = list(embeddings.wv.vocab.keys())
for label, x, y in zip(labels, tsne[:, 0], tsne[:, 1]):
    plt.annotate(
        label,
        xy=(x, y), xytext=(-1, -1),
        textcoords='offset points', ha='right', va='bottom')

plt.savefig('tsne.png')

Why is this visualization helpful? What sort of information does it provide us?

**Your answer**: (fill in here)

## 3. Preprocessing

We will need to do some more preprocessing of our data for it to be useful to train an RNN. The following code defines some helpful variables for you.

In [None]:
NUM_SENTENCES = len(sentences)
MAX_SENT_LENGTH = 0

for sentence in sentences:
    length = len(sentence)
    if length > MAX_SENT_LENGTH: 
        MAX_SENT_LENGTH = length
        
print("Number of sentences: "+str(NUM_SENTENCES))
print("Maximum sentence length: "+str(MAX_SENT_LENGTH))
print("Number of target classes: "+str(NUM_CLASSES))

There are 5 things we need to do before we are ready to train our Recurrent Neural Network. We will need to

1. <b>Replace words with their embeddings</b><br>
We've previously computed word embeddings, and we now need to transform our sequences of tokens into sequences of word vectors.
<br>
2. <b>One-hot encode targets</b><br>
Each token can have any of our 17 classes. We want to represent the class of a token with a one-hot encoded vector.
<br>
3. <b>Zero-pad sentences and targets</b><br>
We need all our sequences to be of equal length in order for the matrix algebra to work out. To achieve this we will zero pad each sentence with zero vectors, and each target with zero tags.
<br>
4. <b>Check dimensions of our data</b><br>
It's a good practice to always check the dimensionality of your dataset before you start training your model.
5. <b>Split into train and test set</b><br>
Finally we want to split our dataset into a train and test set.

### 3.1 Replace words with their embeddings
We will now replace each word in our list of sentences with their word vector. If a word is not in our vocabulary we will sample its vector representation from a normal distribution instead (common practice for these problems).

In [None]:
def get_vector_for_unknown_word(dim):
    unk_vec = 2*np.random.randn(dim)-1
    norm_const = np.linalg.norm(unk_vec)
    unk_vec /= norm_const
    return unk_vec

def replace_words_with_embeddings(sentences, embeddings):
    # complete

We will use the list of sentences and our computed embeddings as input.

In [None]:
data = replace_words_with_embeddings(sentences, embeddings)

Our output here should be of type List where each List entry is a List of word vectors. Word vectors will be of type `numpy.ndarray`.

### 3.2 One-hot encode targets
Similar to the previous steps we will now replace our sequence of entity tags with vectors. Each entity tag will be one-hot encoded, which means that it will be represented by a vector with dimension equal to the number of classes. The entity type of a specific token will be represented by a 1, with all other classes 0.

In [None]:
def get_targets(data):
    return list(map(lambda pair: pair[1], data))

def one_hot_encode_targets(targets, tags):
    # complete

In [None]:
raw_targets = get_targets(raw_data)
targets = one_hot_encode_targets(raw_targets, tags)

Similar to the sentence case, raw_targets here will be of type List. Each entry in this list will be a numpy.ndarray of labels.

targets will be of type List with each sentence represented by a numpy.ndarray of numpy.ndarrays holding 1s and 0s. The first sentence should look like this:

### 3.3 Zero-pad sentences and targets
We need all our sequences to be of equal length in order for the matrix algebra to work out. To achieve this we will zero-pad each sentence with zero vectors.

In [None]:
def zero_pad_sentence(sentence):
    # complete

In [None]:
for sentence in data:
    zero_pad_sentence(sentence)
data = np.array(data)

Once we're finished with our preprocessing we transform our list of data points to a numpy.ndarray. That means we should now have data of type numpy.ndarray, where each sentence is represented by a numpy.ndarray of word vectors in the form of numpy.ndarrays. The shape should be `(47958, 104, 100)`.

Next we will zero-pad our targets with the "no entity" type label.

In [None]:
def zero_pad_targets(targets):
    # complete

In [None]:
targets = np.array(zero_pad_targets(targets))

Similar to our data our targets should now be a numpy.ndarray of numpy.ndarray of numpy.ndarray. The shape should be `(47958, 104, 17)`.

### 3.4 Check data dimensions
To ensure that we've succeeded in preprocessing all our data appropriately we will check some dimensionality.

In [None]:
NUM_SENT_OK = targets.shape[0] == data.shape[0]
TARGET_LENGHT_OK = MAX_SENT_LENGTH == targets.shape[1]
DATA_LENGTH_OK = MAX_SENT_LENGTH == data.shape[1]
VECTOR_SIZE_OK = data.shape[2] == VECTOR_SIZE
print("Target lenght ok: "+str(TARGET_LENGHT_OK))
print("Data lenght ok: "+str(DATA_LENGTH_OK))
print("Vector size ok: "+str(VECTOR_SIZE_OK))
print("Num sent ok: "+str(NUM_SENT_OK))

The booleans should all be True to proceed.

### 3.5 Split into train and test data
We will split our dataset into a train and a test part. We will use 20% of the data to test.

In [None]:
import math 
TEST_FRACTION = 0.2
NUM_TEST = math.ceil(NUM_SENTENCES * TEST_FRACTION)
NUM_TRAIN = NUM_SENTENCES - NUM_TEST
x_train = data[:NUM_TRAIN]
y_train = targets[:NUM_TRAIN]
x_test = data[NUM_TRAIN+1:]
y_test = targets[NUM_TRAIN+1:]

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

## 4. Training a simple RNN
We are finally ready to train our first Recurrent Neural Network. 

### 4.1 Training

Design a single-layer vanilla RNN to predict the class of each word in the sentences.

Hints:
- Keras has a layer called `SimpleRNN`, for implementing this type of RNNs.
- Maybe you want to take a look at the `TimeDistributed` layer too. 

Compile and train it. Motivate your choice of loss and metrics for this particular problem.

**Motivations**: (fill in here)

### 4.2 Testing new sentences
Now that we have a model, let's check how it performs on some sentences. Create a function that takes as input a string, which will be the sentence we want to perform NER, and a Keras model for doing it. The output should be the predicted probabilities for each label, for each of the words in the provided sentence. More specifically, it should be a `numpy.ndarray` with shape (`num_words`, `num_classes`), where `num_words` is the number of words in the input sentence and `num_classes` is the number of classes we can predict.

In [None]:
def perform_ner(sentence, model):
    # Complete:
    
    # Preprocess the sentence into what your model expects
    
    # Predict and return the result

Test it on the following sentence:

In [None]:
sentence = "The constitution states that thou shalt not kill ."
prediction = perform_ner(sentence, simple_model)

To make it easier to visualize the prediction probability mass function for each word, create a function that receives your prediction, and plots the pmf for each word separately. 

Requirements:
- All plots should have titles showing the word associated to that pmf.
- The labels for the x-axis should be the name of the tags, aligned vertically (i.e. `rotation='vertical'` when calling the `xticklabels` method.

In [None]:
def plot_prediction(sentence, prediction):
    # complete

In [None]:
plot_prediction(sentence, prediction)

Finally, write a function that takes your pmf predictions and converts it to hard predictions (i.e. returns a list with the most probable label for each word).

In [None]:
def decode_prediction(prediction):
    # complete

In [None]:
decode_prediction(prediction)

Now we perform the same steps again, but for another sentence:

In [None]:
sentence2 = "Barack Obama was the president of the United states ."
prediction2 = perform_ner(sentence2, simple_model)
plot_prediction(sentence2, prediction)
decode_prediction(prediction2)

What can you observe from the output from these two sentences? 

**Your answer**: (fill in here)

How was the word "states" classified in each sentence? Why do you think this happened?

**Your answer**: (fill in here)

## 5. Deep LSTM RNN
As we stated before, distant dependencies between words are difficult to capture with vanilla RNNs. This is referred to as the vanishing/exploding gradient problem.

Several different architectures for RNNs have been proposed to deal with the vanishing/exploding gradient problem. One of the most successful in recent years has been Long-Short Term Memory networks (LSTM). LSTM are capable of learning long-term dependencies. They were introduced in 1997, but have been popularized through the general increase in interest for deep learning. The difference between an LSTM and an RNN is the structure of the repeating module. The core idea is that LSTM have the ability to remove or add information to the cell state using structures called gates.

We can further improve our model by introducing Long-Short Term Memory cells and adding multiple layers. Design a new model using LSTMs. This time, you're free to add more layers, and/or use other layer wrappers (like the `Bidirectional`, for instance).

Compile and train your model. As before, motivate your choice of loss and metrics for this problem.

**Motivations**: (fill in here)

Test it on the same sentences as before.

Test on new sentences you find relevant.

## 6. Evaluate your best model

Which model performed best, the vanilla RNN or the one using LSTMs? How do you evaluate this?

**Your answer**: (fill in here)

Using the best model you obtained, evaluate its performance using the data from the test set.

In order to contextualize the metrics you computed in the test set, it's helpful to have a baseline to compare to. Can you come up with a simple way to obtain a baseline performance for this problem? Explain.

**Your asnwer**: (fill in here)

If you managed to solve the last question, implement it here.

Given your baseline performance, do you think your model performed well?

**Your answer**: (fill in here)