# A3: Word Embeddings and Language Modelling

Adam Ek

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.

In this lab we will explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
from collections import namedtuple
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.nn import CrossEntropyLoss
import torch.optim as optim
import json


# for gpu, replace "cpu" with "cuda:n" where n is the index of the GPU
device = torch.device('cuda:2')
torch.cuda.is_available()

In [None]:
# CONSTANTS
data_path = 'data/'
UNKNOWN_TOKEN = '<UNK>'
START_TOKEN = '<START>'
END_TOKEN = '<END>'
PADDING_TOKEN = '<PAD>'

# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on Canvas under files/assignments/03-lab-data/wiki-corpus.50000.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a
is      this a lab
a       this is lab
lab     is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [None]:
WINDOW_SIZE = 4
OCCURRENCE_THRESHOLD = 1


def corpus_reader(data_path, number_sentences):
    before = math.floor(WINDOW_SIZE / 2)
    after = WINDOW_SIZE - before

    samples = []
    with open(data_path, encoding='UTF-8') as f:
        lines = f.readlines()[:number_sentences]
        occurrences = {}
        for line in lines:
            tokenized = [word.lower() for word in line.rstrip('\n').split(' ')]
            for word in tokenized:
                occurrences[word] = occurrences.get(word, 0) + 1

        unknown_words = [word for word in occurrences if occurrences[word] <= OCCURRENCE_THRESHOLD]

        for line_index, line in enumerate(lines):
            tokenized = [UNKNOWN_TOKEN if word in unknown_words else word.lower() for word in line.rstrip('\n').split(' ')]
            for index, word in enumerate(tokenized):
                if word != UNKNOWN_TOKEN:
                    context = []
                    for i in range(index - before, index):
                        if i >= 0:
                            context.append(tokenized[i])
                    for i in range(index + 1, index + after + 1):
                        if i < len(tokenized):
                            context.append(tokenized[i])
                    samples.append((word, context))
    return samples

In [None]:
#print(corpus_reader(data_path + 'wiki-corpus.50000.txt')[:1000])

We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

### Good:
- different subjects
- big vocabulary
- words are used in 'true' meaning (not metaphorical) grammatically because of informative nature

### Bad:
- one type of language
- informative
- may lose variety that can be in different kinds of texts
- depending on goal

### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [None]:
class CBOWDataset(Dataset):
    def __init__(self, data):
        self.vocab = {}
        for target_word, context in data:
            if target_word not in self.vocab:
                self.vocab[target_word] = len(self.vocab)
            for word in context:
                if word not in self.vocab:
                    self.vocab[word] = len(self.vocab)

        self.data = data

    def __getitem__(self, idx):
        target_word, context = self.data[idx]

        encoded_context = [self.get_encoded_word(word) for word in context]
        encoded_context.extend([len(encoded_context)] * (WINDOW_SIZE - len(encoded_context)))

        Sample = namedtuple('Sample', 'target_word context')

        return Sample(self.get_encoded_word(target_word), encoded_context)

    def __len__(self):
        return len(self.data)
    
    def get_encoded_word(self, word):
        if word in self.vocab:
            return self.vocab[word]
        else:
            return self.vocab[UNKNOWN_TOKEN] 

In [None]:
def get_data(path, batch_size):
    samples = corpus_reader(path, 10000)
    dataset = CBOWDataset(samples)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=lambda x: x)

    return dataloader

We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

### Good
- beginning of sentences
- no duplicates

### Bad
- some proper nouns may be missed, those that may have a counterpart in a common noun and vice versa (e.g. Mark/mark)
- can create problems in other languages (e.g. German)


## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [None]:
class CBOWModel(nn.Module):
    def __init__(self, vocab_size, dimensions):
        super(CBOWModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, dimensions)
        self.prediction = nn.Linear(dimensions, vocab_size)

    def forward(self, context):
        embedded_context = self.embeddings(context)
        projection = self.projection_function(embedded_context)
        predictions = self.prediction(projection)

        return predictions

    def projection_function(self, xs):
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = torch.sum(xs, dim=1)
        return xs_sum

Now we need to train the models. First we define which hyperparameters to use. (You can change these, for example when *developing* your model you can use a batch size of 2 and a very low dimensionality (say 10), just to speed things up). When actually training your model *fo real*, you can use a batch size of [8,16,32,64], and embedding dimensionality of [128,256].

In [None]:
# you can change these numbers to suit your needs :)
word_embeddings_hyperparameters = {'epochs': 3,
                                   'batch_size': 8,
                                   'embedding_size': 128,
                                   'learning_rate': 0.001}

In [None]:
cbow_dataloader = get_data(data_path + 'wiki-corpus.50000.txt', word_embeddings_hyperparameters['batch_size'])

In [None]:
cbow_model = CBOWModel(len(cbow_dataloader.dataset), word_embeddings_hyperparameters['embedding_size'])
cbow_model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [None]:
%%time

# start training loop
for epoch in range(word_embeddings_hyperparameters['epochs']):
    total_loss = 0
    print(f'batch count: {math.floor(len(cbow_dataloader.dataset) / cbow_dataloader.batch_size)}')
    for i, batch in enumerate(cbow_dataloader):
        torch.cuda.empty_cache()
        contexts = torch.tensor([sample.context for sample in batch], device=device)
        target_words = torch.tensor([sample.target_word for sample in batch], device=device)

        # send your batch of sentences to the model
        output = cbow_model(contexts)

        # compute the loss, you'll need to reshape the input
        # you can read more about this is the documentation for
        # CrossEntropyLoss
        loss = loss_fn(output.view(-1, len(cbow_dataloader.dataset)), target_words.view(-1))
        total_loss += loss.item()

        # print average loss for the epoch
        print(f'epoch {epoch},', f'batch {i}:', np.round(total_loss / (i + 1), 4), end='\r')

        # compute gradients
        loss.backward()

        # update parameters
        optimizer.step()

        # reset gradients
        optimizer.zero_grad()
        
        del contexts
        del target_words
    print()


## Evaluating the model

In [None]:
# your code goes here

def read_wordsim(path, dataset, embeddings):
    count = 0
    dataset_sims = []
    model_sims = []
    with open(path) as f:
        for line in f:
            word1, word2, score = line.split()

            score = float(score)

            # get the index for the word
            count += 1
            print(count, end='\r')
            dataset_sims.append(score)
            word1_idx = dataset.get_encoded_word(word1)
            word2_idx = dataset.get_encoded_word(word2)

            # get the embedding of the word
            word1_emb = embeddings.weight[word1_idx]
            word2_emb = embeddings.weight[word2_idx]
            # compute cosine similarity, we'll use the version included in pytorch functional
            # https://pytorch.org/docs/master/generated/torch.nn.functional.cosine_similarity.html
            cosine_similarity = F.cosine_similarity(word1_emb, word2_emb, dim=0)

            model_sims.append(cosine_similarity.item())

    return dataset_sims, model_sims


path = 'wordsim_similarity_goldstandard.txt'
data, model = read_wordsim(data_path + path, cbow_dataloader.dataset, cbow_model.embeddings)
pearson_correlation = np.corrcoef(data, model)

# the non-diagonals give the pearson correlation,
print(pearson_correlation)

### Results
#### full wiki-50000
```
Epoch 0: 7.5331
Epoch 1: 7.1656
Epoch 2: 7.0064
CPU times: user 11h 28min 53s, sys: 6h 9min 52s, total: 17h 38min 45s
Wall time: 17h 46min 41s

[[1.         0.22844377]
 [0.22844377 1.        ]]
```
#### wiki-50000 (min-freq 1)
```
batch number: 155520.125
epoch 0, batch 155520: 7.2749
batch number: 155520.125
epoch 1, batch 155520: 6.8959
batch number: 155520.125
epoch 2, batch 155520: 6.7349
CPU times: user 11h 53min 38s, sys: 6h 1min 25s, total: 17h 55min 3s
Wall time: 17h 58min 4s

[[1.         0.15954327]
 [0.15954327 1.        ]]
```

#### wiki-50000 (min-freq=1, sentences=10000)
```
batch count: 33071
epoch 0, batch 33071: 7.6517
batch count: 33071
epoch 1, batch 33071: 6.7243
batch count: 33071
epoch 2, batch 33071: 6.4683
CPU times: user 42min 29s, sys: 15min 27s, total: 57min 57s
Wall time: 59min 12s

[[1.         0.20549521]
 [0.20549521 1.        ]]
```

In [None]:
import dill as pickle

with open('cbow.pickle', 'wb') as f:
    pickle.dump((cbow_model, cbow_dataloader), f)

In [None]:
import dill as pickle

with open('cbow.pickle', 'rb') as f:
    cbow_model, cbow_dataloader = pickle.load(f)

Do you think the model performs good or bad? Why?

[3 marks]

- Training takes a long time
- looking on pearson correlation, the model performs not to good, but still finds similarities
- there are 13 words of the gold standard, which do not appear in the wiki-corpus-50000. This lowers the expressiveness of the pearson correlation

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

In [None]:
from pprint import pprint

similarities = []

with open(data_path + 'wordsim_similarity_goldstandard.txt') as f:
    lines = f.readlines()
    for index, line in enumerate(lines):
        word1, word2, _ = line.rstrip('\n').split()
        similarities.append(((word1, word2), model[index]))

similarities = sorted(similarities, key=lambda x: x[1], reverse=True)

print('Best 10:')
pprint(similarities[:10])
print()
print('Worst 10:')
pprint(similarities[-10:])

### Results:
```
Best 10:
[(('tiger', 'tiger'), 1.0),
 (('Arafat', 'Jackson'), 1.0),
 (('vodka', 'brandy'), 1.0),
 (('gem', 'jewel'), 1.0),
 (('tiger', 'jaguar'), 1.0),
 (('tiger', 'feline'), 1.0),
 (('tiger', 'carnivore'), 1.0),
 (('Japanese', 'American'), 1.0),
 (('Harvard', 'Yale'), 1.0),
 (('marathon', 'sprint'), 1.0)]

Worst 10:
[(('doctor', 'personnel'), -0.125071719288826),
 (('rock', 'jazz'), -0.12563274800777435),
 (('football', 'soccer'), -0.13329313695430756),
 (('cucumber', 'potato'), -0.14495040476322174),
 (('seven', 'series'), -0.14708052575588226),
 (('dollar', 'yen'), -0.1481156051158905),
 (('physics', 'chemistry'), -0.15008291602134705),
 (('cup', 'artifact'), -0.15044575929641724),
 (('cup', 'tableware'), -0.16372840106487274),
 (('professor', 'doctor'), -0.1771547794342041)]
```

The best 10 pairs make sense in our view. Also the worst 10 make sense on their own (e.g. jazz/rock are 'opposite' genres and perform bad). But the relations between the word pairs are different for best and worst (e.g. vodka/brandy are 'opposite' types of alcohol and perform good). The same applies to 'Japanese' and 'American', which has a similarity of 1, while 'dollar' and 'yen' are not similar at all, even though the relation is very similar.

On the other side, it is striking that proper nouns appear a lot in the similar pairs and not at all in the worst 10.



Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

- use larger corpus
- more epochs to better memorize the dataset (if we only want to focus on WordSim353 and not generalize)
- change hyperparameters (higher learning rate, ...)
- refine model (add dropout, ...)

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

In [None]:
sentiments = [
    ('good', 'bad'),
    ('ugly', 'beautiful'),
    ('beautiful', 'amazing'),
    ('bad', 'evil'),
    
    ('funny', 'weird'),
    ('funny', 'good'),
    
    # these synonyms are not appearing in the corpus and are assigned to the <UNK> token
    ('funny', 'curious'),
    ('funny', 'amusiing'),
    ('funny', 'hilarious'),
    ('funny', 'entertaining')
]

for pair in sentiments:
    word1, word2 = pair
    word1_idx = cbow_dataloader.dataset.get_encoded_word(word1)
    word2_idx = cbow_dataloader.dataset.get_encoded_word(word2)
    word1_emb = cbow_model.embeddings.weight[word1_idx]
    word2_emb = cbow_model.embeddings.weight[word2_idx]
    cosine_similarity = F.cosine_similarity(word1_emb, word2_emb, dim=0)
    
    print(f"{word1} ({word1_idx}), {word2} ({word2_idx}): {cosine_similarity}")

The cosine similarity for these words reflect the relation bewtween them. Opposites get lower scores than synonyms. Still the scores are not that far apart from each other, so it's maybe not a good basis to draw a conclusion.

Words like 'funny' can be used for a good sentiment **and** sarcastically for a bad sentiment. We tried to compare synonyms for both sentiments, but found only 'bad' synonyms like 'weird' in the corpus, but no 'good' synonyms. Hence, funny has a rather high cosine similarity with 'bad' words and the 'good' meaning is not represented by the embedding. 

A bigger corpus would produce less unknown words and maybe generate better embeddings for the sentiment analysis.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-corpus.50000.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [None]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs': 3,
                      'batch_size': 8,
                      'learning_rate': 0.001,
                      'embedding_dim': 128,
                      'output_dim': 128}

In [None]:
OCCURRENCE_THRESHOLD = 1

class LMDataset(Dataset):
    def __init__(self, file, sentence_number):
        samples = []
        with open(file, encoding='UTF-8') as f:
            lines = f.readlines()[:sentence_number]
            occurrences = {}
            for line in lines:
                tokenized = [word.lower() for word in line.rstrip('\n').split(' ')]
                for word in tokenized:
                    occurrences[word] = occurrences.get(word, 0) + 1

            unknown_words = [word for word in occurrences if occurrences[word] <= OCCURRENCE_THRESHOLD]

            for line in lines:
                tokenized = [UNKNOWN_TOKEN if word in unknown_words else word.lower() for word in line.rstrip('\n').split(' ')]
                samples.append(tokenized)


        vocab_list = [PADDING_TOKEN, *list(occurrences.keys()), UNKNOWN_TOKEN, START_TOKEN, END_TOKEN]
        self.vocab = {word: index for index, word in enumerate(vocab_list)}
        self.data = samples

    def __getitem__(self, idx):
        sentence = self.data[idx]
        encoded_sentence = [self.vocab[word] for word in sentence]
        Sample = namedtuple('Sample', 'train eval')

        train = [self.vocab[START_TOKEN], *encoded_sentence]
        evaluation = [*encoded_sentence, self.vocab[END_TOKEN]]
        return Sample(torch.tensor(train, device=device), torch.tensor(evaluation, device=device))

    def __len__(self):
        return len(self.data)

    def get_encoded_word(self, word):
        if word in self.vocab:
            return self.vocab[word]
        else:
            return self.vocab[UNKNOWN_TOKEN]
        
    def get_vocab_size(self):
        return len(self.vocab)

In [None]:
#dataset = LMDataset('data/wiki-corpus.50000.txt', 10000)
#dataset[1]

In [None]:
def get_lm_data(path, batch_size):
    dataset = LMDataset(path, 10000)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=lambda x: x)

    return dataloader

In [None]:
class LM_withLSTM(nn.Module):
    def __init__(self, vocab_size, dimensions, out_dim, num_layers):
        super(LM_withLSTM, self).__init__()
        self.embeddings =  nn.Embedding(vocab_size, dimensions)
        self.LSTM = nn.LSTM(dimensions, out_dim, num_layers=num_layers)
        self.predict_word = nn.Linear(out_dim, vocab_size)

    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_reprentation, *_ = self.LSTM(embedded_seq)
        predicted_words = self.predict_word(timestep_reprentation)
        
        del embedded_seq
        del timestep_reprentation
        return predicted_words

In [None]:
# load data
lm_dataloader = get_lm_data(data_path + 'wiki-corpus.50000.txt', lm_hyperparameters['batch_size'])


# build model and construct loss/optimizer
lm_model = LM_withLSTM(lm_dataloader.dataset.get_vocab_size(),
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'],
                       1)
lm_model.to(device)

loss_fn = CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])

In [None]:
%%time

# start training loop
for epoch in range(lm_hyperparameters['epochs']):
    total_loss = 0
    print(f'batch count: {math.floor(len(lm_dataloader.dataset) / lm_dataloader.batch_size)}')
    for i, batch in enumerate(lm_dataloader):
        torch.cuda.empty_cache()
        # the strucure for each BATCH is:
        # <start>, w0, ..., wn, <end>

        # when training the model, at each input we predict the *NEXT* token
        # consequently there is nothing to predict when we give the model 
        # <end> as input. 
        # thus, we do not want to give <end> as input to the model, select 
        # from each batch all tokens except the last. 
        # tip: use pytorch indexing/slicing (same as numpy) 
        # (https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html#operations-on-tensors)
        # (https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/)
        input_sentence = pad_sequence([sample.train for sample in batch])
        gold_data = pad_sequence([sample.eval for sample in batch])
        
        input_sentence = input_sentence.to(device)
        gold_data = gold_data.to(device)
        # send your batch of sentences to the model
        output = lm_model(input_sentence)

        # for each output, the model predict the NEXT token, so we have to reshape 
        # our dataset again. On timestep t, we evaluate on token t+1. That is,
        # we never predict the <start> token ;) so this time, we select all but the first 
        # token from sentences (that is, all the tokens that we predict)

        # the shape of the output and sentence variable need to be changed,
        # for the loss function. Details are in the documentation.
        # You can use .view(...,...) to reshape the tensors  
        loss = loss_fn(output.view(-1, lm_dataloader.dataset.get_vocab_size()), gold_data.view(-1))
        total_loss += loss.item()

        # print average loss for the epoch
        print(f'epoch {epoch},', f'batch {i}:', np.round(total_loss / (i + 1), 4), end='\r')

        # compute gradients
        loss.backward()

        # update parameters
        optimizer.step()

        # reset gradients
        optimizer.zero_grad()
        del input_sentence
        del gold_data
    print()

### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [None]:
# your code goes here


def evaluate_model(path, dataloader, lm_model):
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']

            # the data is tokenized as whitespace
            tok_good_s = good_s.split(' ')
            tok_bad_s = bad_s.split(' ')

            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
               
            
            enc_good_s = torch.tensor([dataloader.dataset.get_encoded_word(x.lower()) for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([dataloader.dataset.get_encoded_word(x.lower()) for x in tok_bad_s], device=device).unsqueeze(0)

            # pass your encoded sentences to the model and predict the next tokens
            good_s = lm_model(enc_good_s)
            bad_s = lm_model(enc_bad_s)

            # get probabilities with softmax
            gs_probs = F.softmax(good_s[0], dim=1)
            bs_probs = F.softmax(bad_s[0], dim=1)
            
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s[0])
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s[0])
            
            accuracy.append(int(gs_sent_prob > bs_sent_prob))

    return accuracy


def find_token_probs(model_probs, encoded_sentence):
    probs = []
    # iterate over the tokens in your encoded sentence
    for index, gold_token in enumerate(encoded_sentence):
        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = model_probs[index][gold_token]
        probs.append(prob)
    sentence_prob = sum(probs)
    return sentence_prob


path = 'existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(data_path + path, lm_dataloader, lm_model)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))


### Results
#### full wiki-50000 (min-freq=1, number_sentences=10000)
```
batch count: 1250
epoch 0, batch 1249: 3.8109
batch count: 1250
epoch 1, batch 1249: 3.3558
batch count: 1250
epoch 2, batch 1249: 3.1512
CPU times: user 40.3 s, sys: 25.3 s, total: 1min 5s
Wall time: 1min 6s

Final accuracy:
0.677
```

In [None]:
import dill as pickle

with open('lm.pickle', 'wb') as f:
    pickle.dump((lm_model, lm_dataloader), f)

In [None]:
import dill as pickle

with open('lm.pickle', 'rb') as f:
    lm_model, lm_dataloader = pickle.load(f)

### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

In [None]:
Having this small dataset with only 10000 sentences, our model still predicts 67 percent correctly. This seems quite good in our view. In any case, it also depends on the task, in which the model is used. Some critical tasks (e.g. in medical area) require a very high accuracy and this model wouldn't be enough. 

A random prediction approach would predict 50% correctly, so any model which performs better than this baseline has learned at least some embeddings correctly.

Suggest some improvements you could make to your language model.

[3 marks]

In [None]:
- larger dataset (lm_model uses sentences for training, so the actual dataset is even smaller than for the cbow_model)
- more epochs (but since dataset is smaller and we want to generalize more, we shouldn't use too many epochs)
- implement dropout (as for cbow_model)

Suggest some other metrics we can use to evaluate our system

[2 marks]

In [None]:
We could use recall and precision. This would give more insight for specific tasks (e.g. critical tasks/medical area).

# Literature


Neural architectures:

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)

[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

This assignment has a total of 63 marks.

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

Having this small dataset with only 10000 sentences, our model still predicts 67 percent correctly. This seems quite good in our view. In any case, it also depends on the task, in which the model is used. Some critical tasks (e.g. in medical area) require a very high accuracy and this model wouldn't be enough. 

A random prediction approach would predict 50% correctly, so any model which performs better than this baseline has learned at least some embeddings correctly.

Suggest some improvements you could make to your language model.

[3 marks]

- larger dataset (lm_model uses sentences for training, so the actual dataset is even smaller than for the cbow_model)
- more epochs (but since dataset is smaller and we want to generalize more, we shouldn't use too many epochs)
- implement dropout (as for cbow_model)

Suggest some other metrics we can use to evaluate our system

[2 marks]

We could use recall and precision. This would give more insight for specific tasks (e.g. critical tasks/medical area).

# Literature


Neural architectures:

[1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)

[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

This assignment has a total of 63 marks.