# Lab 3: Word Embeddings and Language Modelling

Adam Ek

In this lab we'll explore constructing *static* word embeddings (i.e. word2vec) and building language models. We'll also evaluate these systems on intermediate tasks, namely word similarity and identifying "good" and "bad" sentences.

* For this we'll use pytorch. Some basic operations that will be useful can be found here: https://jhui.github.io/2018/02/09/PyTorch-Basic-operations
* In general: we are not interested in getting state-of-the-art performance :) focus on the implementation and not results of your model. For this reason, you can use a subset of the dataset: the first 5000-10 000 sentences or so, on linux/mac: ```head -n 10000 inputfile > outputfile```. 
* If possible, use the MLTGpu, it will make everything faster :)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import os

# for gpu, replace "cpu" with "cuda:n" where n is the index of the GPU
#device = torch.device('cpu')
device = torch.device('cuda:0')

# Word2Vec embeddings

In this first part we'll construct a word2vec model which will give us *static* word embeddings (that is, they are fixed after training).

After we've trained our model we will evaluate the embeddings obtained on a word similarity task.

## Formatting data


First we need to load some data, you can download the file on canvas under files/03-lab-data/wiki-corpus.txt. The file contains 50 000 sentences randomly selected from the complete wikipedia. Each line in the file contains one sentence. The sentences are whitespace tokenized.

Your first task is to create a dataset suitable for word2vec. That is, we define some ```window_size``` then iterate over all sentences in the dataset, putting the center word in one field and the context words in another (separate the fields with ```tab```).

For example, the sentece "this is a lab" with ```window size = 4``` will be formatted as:
```
center, context
---------------------
this    is a
is      this a lab
a       this is lab
lab     is a
```

this will be our training examples when training the word2vec model.

[3 marks]

In [2]:
import os

# Imports
import pandas as pd
import numpy as np
import torch
from torch.utils.data import DataLoader, Dataset
import torch.optim as optim

import time
from collections import namedtuple
import collections
import toolz
from toolz import keyfilter, valfilter

In [3]:
# read data
def corpus_reader(data_path):
    with open(data_path) as f:
        data = f.read().splitlines()

    return data

In [4]:
# Implement custom torch Dataset
class ContextDataset(Dataset):
    
    def __init__(self,
                 raw_texts,
                 window_size=4,
                 tokenizer=None,
                 lower=True,
                 min_freq=0,
                 unk_label='<unk>',
                 pad_label='<pad>',
                 verbose=False):
        
        """
        raw_texts: list of texts
        """
        
        self.verbose=verbose
        
        self.window_size= 4
        self.min_freq = min_freq
        self.unk_label, self.unk_idx = unk_label, 0
        self.pad_label, self.pad_idx = pad_label, 1
        
        
        if tokenizer is None:
            self.tokenizer = lambda x: x.replace('.',' ').replace(',', ' ').replace('"', ' ').split() #simple punctuation and whitespace
        self.lower = lower
        
        # Load Data
        self.center, self.context = self.create_dataset(raw_texts)
        
        # Vocabulary
        self.word_to_idx = dict()
        self.word_to_idx[self.unk_label] = self.unk_idx
        self.word_to_idx[self.pad_label] = self.pad_idx
        self.word_to_idx.update({word:idx+max(self.word_to_idx.values())+1 for idx, word in enumerate(np.unique(self.center))})

        # Word Counts
        # center represents every instance of every word in the text
        self.word_counts = pd.value_counts(self.center).to_dict()
        
        # Apply min_freq - this was Adam's fix, we do not use it though because it was implemented after we had trained our model
        # on 50k sentences and we did not want to spend this much time and energy on retraining it for this assignment.
        if min_freq:
            to_remove = valfilter(lambda x: x <= min_freq, self.word_counts)
            self.word_to_idx = keyfilter(lambda x: x not in to_remove.keys(), self.word_to_idx)
            self.word_to_idx = {k:i for i, k in enumerate(self.word_to_idx.keys())}
            
        self.idx_to_word = {v:k for k,v in self.word_to_idx.items()}   
    
    def get_window_context(self, position_index, sentence_array, window_size):
        """
        returns: center_word, context_array

        sentence_array = ['this', 'is', 'a', 'lab', 'filler', 'filler']
        window_size = 4
        position_index = 1

        reutrns: is, [this, a, lab]
        """
        window_range = int(np.ceil(window_size/2)) # Assume window_size is even otherwise round up. (3 -> 4)
        context_indices = [int(i) for i in range(position_index-window_range, position_index+window_range+1) if (i>=0 and i!=position_index and i<len(sentence_array))] 
        context_array = [sentence_array[i] for i in context_indices]

        return sentence_array[position_index], context_array
    
    def create_dataset(self, raw_texts):
        
        t1 = time.time()
        c = 0
        contexts_dfs = []
        for sent_id, text in enumerate(raw_texts):
            
            if self.verbose:
                c+=1
                if c%1000 == 0:
                    t2 = time.time()
                    print(f'{c}/{len(raw_texts)}, dt = {round(t2-t1,3)}')
                    t1 = time.time()
                
            text = self.tokenize(text)
            
            windowed_contexts = pd.DataFrame([self.get_window_context(i, text, self.window_size) for i,_ in enumerate(text)], columns = ['center', 'context'])
            windowed_contexts['sentence_id'] = sent_id
            contexts_dfs.append(windowed_contexts) 
        
        dataset = pd.concat(contexts_dfs, axis=0, ignore_index=True)
        
        return np.array(dataset['center']), np.array(dataset['context'])
    
    def tokenize(self, string):
        if self.lower:
            string = string.lower()
        return np.array(self.tokenizer(string))

    def idx2word(self, idx_or_list):
        try:
            len(idx_or_list)
            return [self.idx_to_word.get(idx, self.unk_label) for idx in idx_or_list]
        except:
            return self.idx_to_word.get(idx_or_list, self.unk_label)
    
    def word2idx(self, word_or_list):
        
        if isinstance(word_or_list, str):
            return self.word_to_idx.get(word_or_list, self.unk_idx)
        else:
            return [self.word_to_idx.get(w, self.unk_idx) for w in word_or_list]
        
    def get_shuffled_data(self, seed=None): 
        np.random.seed(seed)
        
        rng = np.arange(len(self))
        np.random.shuffle(rng)
        return self.center[rng], self.context[rng]
        
    def __getitem__(self, idx):
        return self.center[idx], self.context[idx]
    
    def __len__(self):
        return len(self.center)
    


In [5]:
print('TEST data parser')
test_dataset = ContextDataset(['A bomb was thrown by an unknown party', 'coherence theory hypothesizes that a limited'], window_size=4)
print('----')
print('get_window_context test:', test_dataset.get_window_context(1, ['this', 'is', 'a', 'lab', 'filler', 'filler'], 4))
        

print('center:context')
for i,j in zip(test_dataset.center[0:5], test_dataset.context[0:5]):
    print(i,':', j)

print(test_dataset.word_to_idx)

TEST data parser
----
get_window_context test: ('is', ['this', 'a', 'lab'])
center:context
a : ['bomb', 'was']
bomb : ['a', 'was', 'thrown']
was : ['a', 'bomb', 'thrown', 'by']
thrown : ['bomb', 'was', 'by', 'an']
by : ['was', 'thrown', 'an', 'unknown']
{'<unk>': 0, '<pad>': 1, 'a': 2, 'an': 3, 'bomb': 4, 'by': 5, 'coherence': 6, 'hypothesizes': 7, 'limited': 8, 'party': 9, 'that': 10, 'theory': 11, 'thrown': 12, 'unknown': 13, 'was': 14}


In [6]:
# Actual Data

data_path = './wiki-corpus.50000.txt'
raw_texts = corpus_reader(data_path)
dataset = ContextDataset(raw_texts, window_size=4, verbose=True)


1000/50000, dt = 1.074
2000/50000, dt = 1.019
3000/50000, dt = 1.094
4000/50000, dt = 1.019
5000/50000, dt = 1.149
6000/50000, dt = 0.995
7000/50000, dt = 1.209
8000/50000, dt = 1.008
9000/50000, dt = 1.25
10000/50000, dt = 1.021
11000/50000, dt = 1.071
12000/50000, dt = 1.175
13000/50000, dt = 1.001
14000/50000, dt = 0.999
15000/50000, dt = 1.007
16000/50000, dt = 1.261
17000/50000, dt = 1.0
18000/50000, dt = 1.026
19000/50000, dt = 1.003
20000/50000, dt = 1.42
21000/50000, dt = 1.035
22000/50000, dt = 1.009
23000/50000, dt = 0.973
24000/50000, dt = 0.944
25000/50000, dt = 0.966
26000/50000, dt = 1.281
27000/50000, dt = 0.971
28000/50000, dt = 0.941
29000/50000, dt = 0.937
30000/50000, dt = 0.929
31000/50000, dt = 0.949
32000/50000, dt = 0.937
33000/50000, dt = 0.922
34000/50000, dt = 1.526
35000/50000, dt = 0.93
36000/50000, dt = 0.977
37000/50000, dt = 0.954
38000/50000, dt = 0.969
39000/50000, dt = 0.966
40000/50000, dt = 0.973
41000/50000, dt = 0.98
42000/50000, dt = 0.959
43000/5

In [7]:
for i,j in zip(dataset.center[0:10], dataset.context[0:10]):
    print(i,':', j)

anarchist : ['historian', 'george']
historian : ['anarchist', 'george', 'woodcock']
george : ['anarchist', 'historian', 'woodcock', 'reports']
woodcock : ['historian', 'george', 'reports', 'that']
reports : ['george', 'woodcock', 'that', 'the']
that : ['woodcock', 'reports', 'the', 'annual']
the : ['reports', 'that', 'annual', 'congress']
annual : ['that', 'the', 'congress', 'of']
congress : ['the', 'annual', 'of', 'the']
of : ['annual', 'congress', 'the', 'international']


We sampled 50 000 senteces completely random from the *whole* wikipedia for our training data. Give some reasons why this is good, and why it might be bad. (*note*: We'll have a few questions like these, one or two reasons for and against is sufficient)

[2 marks]

**GOOD**
- It is fair too assume wikipedia doesnt contain too much human bias (opinions, racist/discriminatory/abusive language).
- There is a chance for a big variety of topics, ergo, a big variety in vocabulary.  

**BAD**
- 50k sentences spread out over all of wikipedia might too small a sample, as the contexts can vary a lot.
- We may only get some contexts for certain words that have more than one meaning (e.g. if we do not get sentences about banks as financial institution but only about riverbanks, then the "meaning" of bank will be skewed). We do not get the full range of the use of a word. 
- The model may be missing a lot of the basic vocabulary if the sentences happen to describe highly abstract or complicated topics (so we will have the word for photosynthesis, but not for chair, for example).
- There is a small chance of ending up with sentences from only one semantic area which would also limit how general our model can be.
- The writing in Wikipedia reflects only one style of language really (this can be a + too depending on what the aim of the model is).
- We get a lot of numbers, symbols, and non-English words which is not great for English language modelling.

### Loading the data

We now need to load the data in an appropriate format for torchtext (https://torchtext.readthedocs.io/en/latest/). We'll use PyText for this and it'll follow the same structure as I showed you in the lecture (remember to lower-case all tokens). Create a function which returns a (bucket)iterator of the training data, and the vocabulary object (```Field```). 

(*hint1*: you can format the data such that the center word always is first, then you only need to use one field)

(*hint2*: the code I showed you during the leture is available in /files/pytorch_tutorial/ on canvas)

[4 marks]

In [8]:
# Custom loader

from collections import namedtuple
from toolz import take
from itertools import tee

def get_loader(dataset, batch_size, shuffle=True):
    
    if shuffle:
        centers, contexts = dataset.get_shuffled_data() 
    else:
        centers = dataset.center
        contexts = dataset.context
        
    centers = iter(centers)
    contexts = iter(contexts)
    Batch = namedtuple('Batch', ['contexts', 'centers'])
    for i in range(int(len(dataset)/batch_size)+1):
        batch_centers = take(batch_size, centers)
        if not batch_centers:
            continue
        batch_contexts, bc2 = tee(take(batch_size, contexts))
        
        #padding and word2idx - this part we think is refered to as Collate in torch.
        batch_centers = torch.tensor(dataset.word2idx(batch_centers))
        max_len = max([len(x) for x in bc2])
        batch_contexts = torch.tensor([dataset.word2idx(s)+[dataset.pad_idx]*(max_len-len(s)) for s in batch_contexts])

        yield Batch(batch_contexts, batch_centers)

In [9]:
print('TEST (CUSTOM) LOADER')
test_loader = get_loader(test_dataset, 5, shuffle=False)
for cont, cent in test_loader:
    print(np.shape(cont)) # Note: batch_first
    print(cent.tolist())
    print(test_dataset.idx2word(cent.tolist()))
    print([test_dataset.idx2word(c) for c in cont.tolist()])
    print()

TEST (CUSTOM) LOADER
torch.Size([5, 4])
[2, 4, 14, 12, 5]
['a', 'bomb', 'was', 'thrown', 'by']
[['bomb', 'was', '<pad>', '<pad>'], ['a', 'was', 'thrown', '<pad>'], ['a', 'bomb', 'thrown', 'by'], ['bomb', 'was', 'by', 'an'], ['was', 'thrown', 'an', 'unknown']]

torch.Size([5, 4])
[3, 13, 9, 6, 11]
['an', 'unknown', 'party', 'coherence', 'theory']
[['thrown', 'by', 'unknown', 'party'], ['by', 'an', 'party', '<pad>'], ['an', 'unknown', '<pad>', '<pad>'], ['theory', 'hypothesizes', '<pad>', '<pad>'], ['coherence', 'hypothesizes', 'that', '<pad>']]

torch.Size([4, 4])
[7, 10, 2, 8]
['hypothesizes', 'that', 'a', 'limited']
[['coherence', 'theory', 'that', 'a'], ['theory', 'hypothesizes', 'a', 'limited'], ['hypothesizes', 'that', 'limited', '<pad>'], ['that', 'a', '<pad>', '<pad>']]



We lower-cased all tokens above; give some reasons why this is a good idea, and why it may be harmful to our embeddings.

[2 marks]

**Good**

+ We will not get separate instances for the same word depending on whether or not it was the first word in the sentence - e.g. dog and Dog will not be separate embeddings, which is good, because it is the same word.
+ If we use a more advanced tokenizer, it may be able to keep proper nouns capitalized to avoid the phenomenon described in the "bad" section.

**Bad**

+ There may be cases where the uppercase and the lowercase of the same spelling are, in fact, different words: Patty is short for Patricia, but a patty is what you put in a burger; Billy is a name, but a billy goat is a male goat; Brown is a surname, but brown is a color; New York is a city, but a new york is someone's new Yorkshire terrier perhaps (funnily enough in Poland there is a big city the name of which literally translates to "boat" - prime example of a situation where context and uppercase matters). Or, even better, it and IT will be grouped together when they are definitely not the same thing!

## Word Embeddings Model

We will implement the CBOW model for constructing word embedding models.

In the CBOW model we try to predict the center word based on the context. That is, we take as input ```n``` context words, encode them as vectors, then combine them by summation. This will give us one embedding. We then use this embedding to predict *which* word in our vocabuary is the most likely center word. 

Implement this model 

[7 marks]

In [10]:
class CBOWModel(nn.Module):
    def __init__(self, dataset, vocab_size, embedding_dim, padding_idx=None):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.prediction = nn.Linear(embedding_dim, vocab_size) # we dont need context_size*embedding_dim because we will apply summation before.
        self.dataset = dataset
    
    def forward(self, context):
        # context (B, S)
        embedded_context = self.embeddings(context) #(B, S, D)
        projection = self.projection_function(embedded_context)
        predictions = self.prediction(projection)
        #return predictions
        output = F.softmax(predictions, dim=0) # why do we need this? what about log softmax?
        return output
        
    def projection_function(self, xs):
        #This is alternative to another linear layer. Basically linear layer with same coefficient.
        """
        This function will take as input a tensor of size (B, S, D)
        where B is the batch_size, S the window size, and D the dimensionality of embeddings
        this function should compute the sum over the embedding dimensions of the input, 
        that is, we transform (B, S, D) to (B, 1, D) or (B, D) 
        """
        xs_sum = torch.sum(xs, dim=1) 
        return xs_sum
    
    def get_embedding(self, word):
        idx_word = self.dataset.word2idx(word)
        word_embedding = self.embeddings(torch.tensor(idx_word))
        
        return word_embedding

In [11]:
# you can change these numbers to suit your needs :)
# what do these actually mean and how do changes in them affect our results? what are optimal/commonly used values? 
word_embeddings_hyperparameters = {'epochs':3,
                                   'batch_size':16,
                                   'embedding_size':128, #What is this?
                                   'learning_rate':0.001,
                                   'embedding_dim':128}

word_embeddings_hyperparameters = {'epochs':5,
                                   'batch_size':4096,
                                   'learning_rate':0.005,
                                   'embedding_dim':128}

Train your model. Iterate over the dataset, get outputs from your model, calculate loss and backpropagate.

We mentioned in the lecture that we use Negative Log Likelihood (https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss to train Word2Vec model. In this lab we'll take a shortcut when *training* and use Cross Entropy Loss (https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), basically it combines ```log_softmax``` and ```NLLLoss```. So what your model should output is a *score* for each word in our vocabulary. The ```CrossEntropyLoss``` will then assign probabilities and calculate the negative log likelihood loss.

[3 marks]

In [12]:
vocab = dataset.word_to_idx
len_vocab = len(vocab)
#dataset, vocab = get_data(...)

# build model and construct loss/optimizer
cbow_model = CBOWModel(dataset, len_vocab, word_embeddings_hyperparameters['embedding_dim'], padding_idx=dataset.pad_idx)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(cbow_model.parameters(), lr=word_embeddings_hyperparameters['learning_rate'])

# start training loop

for epoch in range(word_embeddings_hyperparameters['epochs']):
    loader =  get_loader(dataset, word_embeddings_hyperparameters['batch_size']) # in your script it was outside of the loop, it did not work here, is it because it's custom?
    total_loss = 0
    for i, batch in enumerate(loader):
        context = batch.contexts
        #target_word = batch.target_word
        centers = batch.centers
        #Note: this is not nessecary - the loss function understands.
        #targets = torch.zeros(len(centers), len_vocab)
        #for j, idx in enumerate(centers): #one hot encoding
         #   targets[j][idx] = 1
        
        
        # send your batch of sentences to the model
        output = cbow_model(context)
        
        loss = loss_fn(output, centers)
        total_loss += loss.item()
        print(f'Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
        
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

    print(f'Epoch {epoch} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')

Batch 0 : Average Loss = 11.22902
Batch 1 : Average Loss = 11.22902
Batch 2 : Average Loss = 11.22901
Batch 3 : Average Loss = 11.22898
Batch 4 : Average Loss = 11.22897
Batch 5 : Average Loss = 11.22894
Batch 6 : Average Loss = 11.22891
Batch 7 : Average Loss = 11.22888
Batch 8 : Average Loss = 11.22882
Batch 9 : Average Loss = 11.22874
Batch 10 : Average Loss = 11.22862
Batch 11 : Average Loss = 11.22846
Batch 12 : Average Loss = 11.22826
Batch 13 : Average Loss = 11.22805
Batch 14 : Average Loss = 11.22782
Batch 15 : Average Loss = 11.22753
Batch 16 : Average Loss = 11.22723
Batch 17 : Average Loss = 11.22686
Batch 18 : Average Loss = 11.22647
Batch 19 : Average Loss = 11.22604
Batch 20 : Average Loss = 11.22565
Batch 21 : Average Loss = 11.22527
Batch 22 : Average Loss = 11.22495
Batch 23 : Average Loss = 11.22461
Batch 24 : Average Loss = 11.22419
Batch 25 : Average Loss = 11.22381
Batch 26 : Average Loss = 11.22345
Batch 27 : Average Loss = 11.2231
Batch 28 : Average Loss = 11.22

## Evaluating the model

We will evaluate the model on a dataset of word similarities, WordSim353 (http://alfonseca.org/eng/research/wordsim353.html , also avalable in vanvas under files/03-l). The first thing we need to do is read the dataset and translate it to integers. What we'll do is to reuse the ```Field``` that records word indexes (the second output of ```get_data()```) and use it to parse the file.

The wordsim data is structured as follows:

```
word1 word2 score
...
```


The ```Field``` we got from ```read_data()``` has two built-in functions, ```stoi``` which maps a string to an integer and ```itos``` which maps an integer to a string. 

What our datareader needs to do is: 

```
for line in file:
    word1, word2, score = file.split()
    # encode word1 and word2 as integers
    word1_idx = vocab.vocab.stoi[word1]
    word2_idx = vocab.vocab.stoi[word2]
```

when we have the integers for ```word_1``` and ```word2``` we'll compute the similarity between their word embeddings with *cosine simlarity*. We can obtain the embeddings by querying the embedding layer of the model.

We calculate the cosine similarity for each word pair in the dataset, then compute the pearson correlation between the similarities we obtained with the scores given in the dataset. 

[4 marks]

In [13]:
# your code goes here
gs_path = 'wordsim_similarity_goldstandard.txt'
vocab = dataset
embeddings = cbow_model.embeddings(centers)

def read_wordsim(path, vocab, embeddings):
    dataset_sims = []
    model_sims = []
    with open(path) as f:
        for line in f:
            word1, word2, score = line.split()
        
            score = float(score)
            dataset_sims.append(score)
            
            # get the index for the word - not needed
            # word1_idx = vocab.word2idx(word1)     
            # word2_idx = vocab.word2idx(word2)

            # get the embedding of the word
            word1_emb = cbow_model.get_embedding(word1) # we do not need word indices
            word2_emb = cbow_model.get_embedding(word2) 
            
            cosine_similarity = F.cosine_similarity(word1_emb, word2_emb, dim=0)
            
            model_sims.append(cosine_similarity)
    
    return dataset_sims, model_sims

data, model = read_wordsim(gs_path, vocab, embeddings)

# checking if they are of the same length
print(len(data))
print(len(model))

pearson_correlation = np.corrcoef(np.array(data).astype(float), np.array(model).astype(float))
            
# the non-diagonals give the pearson correlation,
print(pearson_correlation)

203
203
[[1.         0.13155658]
 [0.13155658 1.        ]]


Do you think the model performs good or bad? Why?

[3 marks]

Pearson's coefficient spans between -1 and 1, with -1 being a strong negative association, 1 being strong positive association, and 0 being no association. The association here is around 0.13. According to [https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php] it is in the lower range of small positive association. This means that there is some level of similarity between the gold standard judgements and the word embedding similarities. Since the embedding models are supposed to reflect similarities between words, this is not terribly bad, but naturally the higher the performance, the better. The difference may be due to the kind of data we created our embeddings on or the low number of epochs, but also due to human judgements (which I assume the gold standard to be) taking into account features other than the embeddings do.  

What is worth pointing out is that when we ran the model on a smaller collection (5k sentences), we got correlation scores between 0.2 and 0.3, so higher. It is really interesting that the smaller model got better scores.

Select the 10 best and 10 worst performing word pairs, can you see any patterns that explain why *these* are the best and worst word pairs?

[3 marks]

In [14]:
lens = len(data)
sim_dict = {}
word_pairs = []
sorted_list = []

with open(gs_path) as f:
    for line in f:
        word1, word2, score = line.split()
        word_pairs.append((word1, word2))

for i in range(0, lens):
    sim_dict[word_pairs[i]] = model[i].tolist()
    
sorted_dict = dict(sorted(sim_dict.items(), key=lambda item: item[1]))
for k,v in sorted_dict.items():
    sorted_list.append((k,v))

bottom10 = sorted_list[:9]
top10 = sorted_list[-9:]

print(f'The top 10 best performing pairs of words are:')
for element in top10:
    words, score = element
    word1, word2 = words
    print(f'\t{word1}, {word2}: {score}')

print()
print(f'The bottom 10 worst performing pairs of words are:')
for element in bottom10:
    words, score = element
    word1, word2 = words
    print(f'\t{word1}, {word2}: {score}')
    

The top 10 best performing pairs of words are:
	plane, car: 0.19662246108055115
	furnace, stove: 0.20392639935016632
	lobster, wine: 0.20648574829101562
	street, block: 0.21518373489379883
	tiger, tiger: 1.0
	Arafat, Jackson: 1.0
	Japanese, American: 1.0
	Harvard, Yale: 1.0
	Mexico, Brazil: 1.0

The bottom 10 worst performing pairs of words are:
	wood, forest: -0.24051402509212494
	student, professor: -0.21034862101078033
	situation, conclusion: -0.20367661118507385
	king, queen: -0.19686846435070038
	precedent, collection: -0.16645506024360657
	atmosphere, landscape: -0.16415929794311523
	physics, chemistry: -0.16200031340122223
	problem, airport: -0.16017237305641174
	asylum, madhouse: -0.15234695374965668


As for the best performing pairs, there is quite some semantic similarity in the pairs, although they are not synonymous, except for the tiger-tiger pair. Japanese and American are both nationalities (or nation-related adjectives). Harvard and Yale are both solid American universities. Marathon and sprint are types of races. Mexico and Brazil are both countries. From the pairs that do not have a 1:1 similarity, street and block are both city planning elements, lobster and wine go well together during a fancy dinner, a furnace is like a big stove, and both a plane and a car are vehicles.

In the bottom pairs there are some surprising ones, like wood and forest, king and queen, physics and chemisty, asylum and madhouse. The rest seem less related indeed. I would guess that the contexts in which these words appeared in the training data made them distinctive enough not to be classified together. 

It is worth noting that re-running the model gives slightly different results. The discussion pertains to the first time we ran and evaluated it. This description pertains to the result we got with a 50k sample:  

The top 10 best performing pairs of words are:
+ plane, car: 0.19662246108055115
+ furnace, stove: 0.20392639935016632
+ lobster, wine: 0.20648574829101562
+ street, block: 0.21518373489379883
+ tiger, tiger: 1.0
+ Arafat, Jackson: 1.0
+ Japanese, American: 1.0
+ Harvard, Yale: 1.0
+ Mexico, Brazil: 1.0  

The bottom 10 worst performing pairs of words are:
+ wood, forest: -0.24051402509212494
+ student, professor: -0.21034862101078033
+ situation, conclusion: -0.20367661118507385
+ king, queen: -0.19686846435070038
+ precedent, collection: -0.16645506024360657
+ atmosphere, landscape: -0.16415929794311523
+ physics, chemistry: -0.16200031340122223
+ problem, airport: -0.16017237305641174
+ asylum, madhouse: -0.15234695374965668

Suggest some ways of improving the model we apply to WordSim353.

[3 marks]

+ More attention could be paid to the kind of sources we use for training data (more data, make sure that many semantic fields are covered by it).
+ Having the model go through more epochs could also help (but not too many to avoid overfitting!).
+ We did not implement dropout here, and that perhaps could also affect the performance.
+ We should curate the things in the dataset more, eliminate more symbols, non-English words, etc.

If we consider a scenario where we use these embeddings in a downstream task, for example sentiment analysis (roughly: determining whether a sentence is positive or negative). 

Give some examples why the sentiment analysis model would benefit from our embeddnings and one examples why our embeddings could hur the performance of the sentiment model.

[3 marks]

In [15]:
print(sorted_dict)



In [16]:
word_pairs = [('happy','sad'),('happy','glad'),('good','bad'),('good','great'),('useful','useless'),('useful','handy'),
              ('smart','stupid'),('smart','brilliant')]

for pair in word_pairs:
    word1, word2 = pair
    word1_emb = cbow_model.get_embedding(word1)  
    word2_emb = cbow_model.get_embedding(word2) 
            
    cosine_similarity = F.cosine_similarity(word1_emb, word2_emb, dim=0)

    print(f'{word1}, {word2}: {cosine_similarity}')

happy, sad: -0.1403135359287262
happy, glad: 0.11902550607919693
good, bad: -0.0582507848739624
good, great: -0.0650825947523117
useful, useless: -0.10068221390247345
useful, handy: -0.07618600130081177
smart, stupid: -0.0995870903134346
smart, brilliant: -0.22326377034187317


If we print the whole similarities dictionary, we can notice that there are pairs that are rated as very similar, but that would have different polarity when it comes to sentiment analysis (one is positive, one is negative), e.g. smart and stupid (1.0), profit and loss (0.05 - okay, this is not much but it's positive at least). I have also constructed some word pairs where the first (positive word) gets paired with a negative word, and later with a positive word close in meaning.  
+ happy vs. sad vs. glad: the scores are mediocre, with the good words pair having a higher score (the only one).
+ good vs. bad vs. great: the scores are low, and the pair with both positive words seems even more dissimilar.
+ useful vs. useless vs. handy: the scores are low, the pair of negative words has a slightly better score.
+ smart vs. stupid vs. brilliant: the scores are low, and the pair of both positive words has a lower score.  

It seems like the model is not good at recognizing positive or negative sentiment, so in this way it would be bad.

If we were trying to determine the semantic field of the comment or text, our model would be better, as in numerous cases it rates words from the same semantic field pretty high.

# Language modeling

In this second part we'll build a simple LSTM language model. Your task is to construct a model which takes a sentence as input and predict the next word for each word in the sentence. For this you'll use the ```LSTM``` class provided by PyTorch (https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). You can read more about the LSTM here: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

NOTE!!!: Use the same dataset (wiki-cropus.txt) as before.

Our setup is similar to before, we first encode the words as distributed representations then pass these to the LSTM and for each output we predict the next word.

For this we'll build a new dataloader with torchtext, the file we pass to the dataloader should contain one sentence per line, with words separated by whitespace.

```
word_1, ..., word_n
word_1, ..., word_k
...
```

in this dataloader you want to make sure that each sentence begins with a ```<start>``` token and ends with a ```<end>``` token, there is a keyword argument in ```Field``` for this :). But other than that, as before you read the dataset and output a iterator over the dataset and a vocabulary. 

Implement the dataloader, language model and the training loop (the training loop will basically be the same as for word2vec).

[12 marks]

In [4]:
data_path = './wiki-corpus.10000.txt'
raw_texts = corpus_reader(data_path)

In [5]:
# each row in our raw_texts contain one sentence.
raw_texts[0:10]

['Anarchist historian George Woodcock reports that " The annual Congress of the International had not taken place in 1870 owing to the outbreak of the Paris Commune , and in 1871 the General Council called only a special conference in London .',
 'A bomb was thrown by an unknown party near the conclusion of the rally , killing an officer .',
 'In the ensuing panic , police opened fire on the crowd and each other .',
 'Josiah Warren is widely regarded as the first American anarchist , and the four-page weekly paper he edited during 1833 , The Peaceful Revolutionist , was the first anarchist periodical published .',
 'Weak central coherence theory hypothesizes that a limited ability to see the big picture underlies the central disturbance in autism .',
 "The word autism first took its modern sense in 1938 when Hans Asperger of the Vienna University Hospital adopted Bleuler 's terminology autistic psychopaths in a lecture in German about child psychology .",
 'Most land areas are in an al

In [6]:
# Implement custom torch Dataset
class LMDataset(Dataset):
    
    def __init__(self,
                 sentences,
                 tokenizer=None,
                 lower=True,
                 min_freq=0,  # it is not implemented for this Dataset
                 unk_label='<unk>',
                 pad_label='<pad>',
                 start_label='<start>',
                 end_label='<end>'):
        
        """
        raw_texts: list of texts
        """
        
        self.window_size= 4
        self.min_freq = min_freq
        self.unk_label, self.unk_idx = unk_label, 0
        self.pad_label, self.pad_idx = pad_label, 1
        self.start_label, self.start_idx = start_label, 2
        self.end_label, self.end_idx = end_label, 3
        
        if tokenizer is None:
            self.tokenizer = lambda x: x.replace('.',' ').replace(',', ' ').replace('"', ' ').split() #simple punctuation and whitespace
        self.lower = lower
        
        # Load Data
        # we first tokenize the sentences and then create a unique words vocab BEFORE adding start and end labels, since otherwise
        # we were getting the nasty error that took us so long to figure out how to fix (and what it was in general)
        # this way our vocab is all the unique words in the sentences excluding these special tokens.
        self.sentences = [self.tokenize(s) for s in sentences]
        self.vocab = self.unique_words()
        self.sentences = [[self.start_label]+s+[self.end_label] for s in self.sentences]
        
        # Vocabulary
        self.word_to_idx = dict()
        self.word_to_idx[self.unk_label] = self.unk_idx
        self.word_to_idx[self.pad_label] = self.pad_idx
        self.word_to_idx[self.start_label] = self.start_idx
        self.word_to_idx[self.end_label] = self.end_idx
        self.word_to_idx.update({word:idx+max(self.word_to_idx.values())+1 for idx, word in enumerate(self.vocab)})

        self.idx_to_word = {v:k for k,v in self.word_to_idx.items()}
        
    def unique_words(self):
        all_words = []
        for s in self.sentences:
            all_words += s
        return np.unique(all_words)
    
    def tokenize(self, string):
        if self.lower:
            string = string.lower()
        return self.tokenizer(string)

    def idx2word(self, idx_or_list):
        try:
            len(idx_or_list)
            return [self.idx_to_word.get(idx, self.unk_label) for idx in idx_or_list]
        except:
            return self.idx_to_word.get(idx_or_list, self.unk_label)
    
    def word2idx(self, word_or_list):
        
        if isinstance(word_or_list, str):
            return self.word_to_idx.get(word_or_list, self.unk_idx)
        else:
            return [self.word_to_idx.get(w, self.unk_idx) for w in word_or_list]
        
    def get_shuffled_data(self, seed=None): 
        np.random.seed(seed)
        
        rng = np.arange(len(self))
        np.random.shuffle(rng)
        return self.sentences[rng]
        
    def __getitem__(self, idx):
        return self.sentences[idx]
    
    def __len__(self):
        return len(self.sentences)
    


In [7]:
test_sentences = ['A bomb was thrown by an unknown party.', 'coherence theory, hypothesizes that a limited.', 'A bomb was thrown by an unknown party.', 'coherence theory, hypothesizes that a limited.']
test_dataset = LMDataset(test_sentences)
print(test_dataset.sentences)
print(test_dataset.word_to_idx)

[['<start>', 'a', 'bomb', 'was', 'thrown', 'by', 'an', 'unknown', 'party', '<end>'], ['<start>', 'coherence', 'theory', 'hypothesizes', 'that', 'a', 'limited', '<end>'], ['<start>', 'a', 'bomb', 'was', 'thrown', 'by', 'an', 'unknown', 'party', '<end>'], ['<start>', 'coherence', 'theory', 'hypothesizes', 'that', 'a', 'limited', '<end>']]
{'<unk>': 0, '<pad>': 1, '<start>': 2, '<end>': 3, 'a': 4, 'an': 5, 'bomb': 6, 'by': 7, 'coherence': 8, 'hypothesizes': 9, 'limited': 10, 'party': 11, 'that': 12, 'theory': 13, 'thrown': 14, 'unknown': 15, 'was': 16}


In [8]:
# comparing the vocabulary with and without added tokens
print(test_dataset.unique_words())
print(test_dataset.vocab)

['<end>' '<start>' 'a' 'an' 'bomb' 'by' 'coherence' 'hypothesizes'
 'limited' 'party' 'that' 'theory' 'thrown' 'unknown' 'was']
['a' 'an' 'bomb' 'by' 'coherence' 'hypothesizes' 'limited' 'party' 'that'
 'theory' 'thrown' 'unknown' 'was']


In [9]:
# Try using pytorch DataLoader
from torch.nn.utils.rnn import pad_sequence 
class Collate():
    def __init__(self, word_to_idx, pad_idx, batch_first=True):
        self.pad_idx = pad_idx
        self.batch_first = batch_first
        self.word_to_idx = word_to_idx
    def __call__(self, batch):
        batch = [torch.tensor([self.word_to_idx[w] for w in s], device=device) for s in batch]
        batch = pad_sequence(batch, batch_first=self.batch_first, padding_value=self.pad_idx)
        return batch

def get_loader(dataset, batch_size=16, shuffle=True):
    return DataLoader(dataset, shuffle=shuffle, batch_size=batch_size, collate_fn=Collate(dataset.word_to_idx, dataset.pad_idx))

In [10]:
test_loader = get_loader(test_dataset, batch_size=2, shuffle=False)
for i in test_loader:
    print(i)
    
# pad_idx = 1

tensor([[ 2,  4,  6, 16, 14,  7,  5, 15, 11,  3],
        [ 2,  8, 13,  9, 12,  4, 10,  3,  1,  1]], device='cuda:0')
tensor([[ 2,  4,  6, 16, 14,  7,  5, 15, 11,  3],
        [ 2,  8, 13,  9, 12,  4, 10,  3,  1,  1]], device='cuda:0')


In [24]:
# you can change these numbers to suit your needs as before :)
lm_hyperparameters = {'epochs':3,
                      'batch_size':8,
                      'learning_rate':0.0001,
                      'embedding_dim':128,
                      'output_dim':128}

In [12]:
class LM_withLSTM(nn.Module):
    def __init__(self, dataset, vocab_size, embedding_dim, output_dim, padding_idx=None):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=output_dim, num_layers=1)
        self.predict_word = nn.Linear(output_dim, vocab_size)
        self.dataset = dataset
    
    def forward(self, seq):
        embedded_seq = self.embeddings(seq)
        timestep_representation, *_ = self.LSTM(embedded_seq)
        predicted_words = self.predict_word(timestep_representation)
        
        return predicted_words
    
    # This was useful for evaluatin the previous model so we implemented it here too, even though it turns out not to have
    # been useful at all.
    def get_embedding(self, word):
        idx_word = self.dataset.word2idx(word)
        word_embedding = self.embeddings(torch.tensor(idx_word))
        
        return word_embedding

In [30]:
# load data
LM_data = LMDataset(raw_texts) 
vocab = LM_data.word_to_idx

# build model and construct loss/optimizer
lm_model = LM_withLSTM(LM_data,
                       len(vocab), 
                       lm_hyperparameters['embedding_dim'],
                       lm_hyperparameters['output_dim'])
lm_model.to(device)

loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(lm_model.parameters(), lr=lm_hyperparameters['learning_rate'])

# start training loop
for epoch in range(lm_hyperparameters['epochs']):
    total_loss = 0
    dataset = get_loader(LM_data, batch_size=lm_hyperparameters["batch_size"], shuffle=True)
    for i, sentence in enumerate(dataset):
        
        # we remove the <end> token, as per instructions
        input_sentence = sentence[:, 0:-1]
        
        # send your batch of sentences to the model
        try:
            output = lm_model(input_sentence)
        except:
            print("Something went wrong with this batch")
            continue
        
        # we remove the <start> token, as per instructions
        gold_data = sentence[:, 1:]
        
        # the output and sentence variables are reshaped
        loss = loss_fn(output.view(lm_hyperparameters['batch_size'], len(vocab), -1), gold_data)
        total_loss += loss.item()
        
        # print average loss for every 100 batches, and later for the epoch too
        print_every=100
        if i%print_every==0:
            print(f'Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

    print(f'Epoch {epoch} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')

Batch 0 : Average Loss = 9.93662
Batch 100 : Average Loss = 8.18665
Batch 200 : Average Loss = 7.44131
Batch 300 : Average Loss = 7.15346
Batch 400 : Average Loss = 6.95966
Batch 500 : Average Loss = 6.86735
Batch 600 : Average Loss = 6.80722
Epoch 0 : Average Loss = 6.79817
Batch 0 : Average Loss = 5.95479
Batch 100 : Average Loss = 6.46978
Batch 200 : Average Loss = 6.4737
Batch 300 : Average Loss = 6.46158
Batch 400 : Average Loss = 6.45408
Batch 500 : Average Loss = 6.4607
Batch 600 : Average Loss = 6.4484
Epoch 1 : Average Loss = 6.45905
Batch 0 : Average Loss = 5.88237
Batch 100 : Average Loss = 6.34776
Batch 200 : Average Loss = 6.33271
Batch 300 : Average Loss = 6.42052
Batch 400 : Average Loss = 6.41153
Batch 500 : Average Loss = 6.42455
Batch 600 : Average Loss = 6.42659
Epoch 2 : Average Loss = 6.42386


### Evaluating the language model

We'll evaluate our model using the BLiMP dataset (https://github.com/alexwarstadt/blimp). The BLiMP dataset contains sets of linguistic minimal pairs for various syntactic and semantic phenomena, We'll evaluate our model on *existential quantifiers* (link: https://github.com/alexwarstadt/blimp/blob/master/data/existential_there_quantifiers_1.jsonl). This data, as the name suggests, investigate whether language models assign higher probability to *correct* usage of there-quantifiers. 

An example entry in the dataset is: 

```
{"sentence_good": "There was a documentary about music irritating Allison.", "sentence_bad": "There was each documentary about music irritating Allison.", "field": "semantics", "linguistics_term": "quantifiers", "UID": "existential_there_quantifiers_1", "simple_LM_method": true, "one_prefix_method": false, "two_prefix_method": false, "lexically_identical": false, "pairID": "0"}
```

Download the dataset and build a datareader (similar to what you did for word2vec). The dataset structure you should aim for is (you don't need to worry about the other keys for this assignment):

```
good_sentence_1, bad_sentence_1
...
```

your task now is to compare the probability assigned to the good sentence with to the probability assigned to the bad sentence. To compute a probability for a sentence we consider the product of the probabilities assigned to the *gold* tokens, remember, at timestep ```t``` we're predicting which token comes *next* e.g. ```t+1``` (basically, you do the same thing as you did when training).

In rough pseudo code what your code should do is:

```
accuracy = []
for good_sentence, bad_sentence in dataset:
    gs_lm_output = LanguageModel(good_sentence)
    gs_token_probabilities = softmax(gs_lm_output)
    gs_sentence_probability = product(gs_token_probabilities[GOLD_TOKENS])

    bs_lm_output = LanguageModel(bad_sentence)
    bs_token_probabilities = softmax(bs_lm_output)
    bs_sentence_probability = product(bs_token_probabilities[GOLD_TOKENS])

    # int(True) = 1 and int(False) = 0
    is_correct = int(gs_sentence_probability > bs_sentence_probability)
    accuracy.append(is_correct)

print(numpy.mean(accuracy))
    
```

[6 marks]

In [15]:
# your code goes here
import json

def evaluate_model(path, vocab, model):
    
    accuracy = []
    with open(path) as f:
        # iterate over one pair of sentences at a time
        for line in f:
            # load the data
            data = json.loads(line)
            good_s = data['sentence_good']
            bad_s = data['sentence_bad']
            
            # the data is tokenized as whitespace
            tok_good_s = tokenize(good_s)
            tok_bad_s = tokenize(bad_s)
            
            # encode your words as integers using the vocab from the dataloader, size is (S)
            # we use unsqueeze to create the batch dimension 
            # in this case our input is only ONE batch, so the size of the tensor becomes: 
            # (S) -> (1, S) as the model expects batches
            enc_good_s = torch.tensor([vocab.word2idx(x) for x in tok_good_s], device=device).unsqueeze(0)
            enc_bad_s = torch.tensor([vocab.word2idx(x) for x in tok_bad_s], device=device).unsqueeze(0)
            
            # pass your encoded sentences to the model and predict the next tokens
            # we use torch.no_grad() so we do not alter the model by running it on these sentences
            with torch.no_grad():
                good_s = model(enc_good_s)
                bad_s = model(enc_bad_s)
            
            # get probabilities with softmax
            gs_probs = F.softmax(good_s, dim=2)
            bs_probs = F.softmax(bad_s, dim=2)
            
            # select the probability of the gold tokens
            gs_sent_prob = find_token_probs(gs_probs, enc_good_s)
            bs_sent_prob = find_token_probs(bs_probs, enc_bad_s)
            
            accuracy.append(int(gs_sent_prob>bs_sent_prob))
            
    return accuracy

def tokenize(string):
    # we need to add the start token so that we can get the probability for the first word of the sentence, but we do not need 
    # any end token.
    new_string = ["<start>"] + string.replace('.',' ').replace(',', ' ').replace('"', ' ').lower().split()
    return new_string

def find_token_probs(model_probs, encoded_sentence):
    probs = []

    # iterate over the tokens in your encoded sentence
    # we iterate over the sentence WITHOUT the start token so that we get the token number (the row of the matrix) and the NEXT
    # word relative to the model_probs (i.e. the 0th row in model_probs includes the probabilities of the words that follow
    # <start>, so within that we want to look up the entry for the next word in the sentence, and not for <start>. This is perhaps
    # hard to explain but it is analogous to what we did when training the model with the start and end tokens and offsetting the
    # predictions and the gold standard).
    for token_nr, gold_token in enumerate(encoded_sentence[0, 1:]):

        # select the probability of the gold tokens and save
        # hint: pytorch indexing is helpful here ;)
        prob = model_probs[0, token_nr, gold_token]
        probs.append(float(prob))
        
    sentence_prob = np.cumprod(probs)[-1]
    
    return sentence_prob




In [31]:
path = 'existential_there_quantifiers_1.jsonl'
accuracy = evaluate_model(path, LM_data, lm_model)

print('Final accuracy:')
print(np.round(np.mean(accuracy), 3))

Final accuracy:
0.662


### Analysis

Our model get some score, say, 55% correct predictions. Is this good? Suggest some *baseline* (i.e. a stupid "model" we hope ours is better than) we can compare the model against.

[3 marks]

We agreed that the stupidest model we can get is one that gets 50% accuracy, as we have a 50% chance of choosing the "right" answer just by choosing randomly. Anything above that will be an improvement relative to that, so any model that has more than 50% accuracy will be "good".  

For a really good model we thought we could compare its performance against some state-of-the-art models (contemporary and past ones). However that performance will be hard to match, and we are not sure where to draw a realistic benchmark for a decent model, that is not just barely better than random guessing.

When we ran our model on 5000 sentences, we had a 78.8% accuracy, which is way more than we expected. We have tried to re-run the model with the 50000 sentences dataset, but that was causing the notebook to freeze when done locally on a CPU, and caused GPU memory issues on the server. We tested some other sets (10k, 15k) and depending on the size of the embeddings they sometimes also caused those issues. Running them with smaller embeddings or fewer epochs yielded really unsatisfactory results, which is why we are submitting this one where we re-ran the model on 5k and got 66.2% accuracy. It seems like a significant part of how good the model is here is up to us being lucky with the random embeddings at the start. Please do not hesistate to let us know if you want us to try to re-run it with any other data set.

Suggest some improvements you could make to your language model.

[3 marks]

We could try to pay more attention to or implement the following:  
+ dropout - as mentioned for the other model, we did not implement that, which means that our models may be prone to overfitting, learning its traning data rather than learning something from it. This sounds like a very reasonable thing to use; we did not, however, use it, since we prioritized getting the models to work and it was not a required element of the assignment. It may be worth seeing what difference it makes for a VG project, perhaps.
+ rerunning with different hyperparameters - as we have heard in the lectures, models tend to be tested for what the best hyperparameters for them are: perhaps our model will learn better with a different batch size, embedding size, or learning rate; perhaps we should have it go through a few more epochs? It is a lot of variables and they all can improve (or hurt) the performance of the model.
+ min_freq - we did not, in the end, use a minimum frequency (although we acknowledge that Adam did fix our version of it in the CBOW model). Eliminating very rare words could perhaps help our model.
+ curated data - once again, we are not really sure what our model is learning from. As per one of the readings for the following week, it is important to be aware of that! Perhaps better training data would yield us a better model.

Suggest some other metrics we can use to evaluate our system

[2 marks]

We are unsure if we understood properly what is meant by this question.
For just the type of evaluation that we just implemented:
+ We could normalize the probability over the length of the sentence and establish some baseline for "grammatical" vs. "ungrammatical" sentences and then calculate precision, recall, F1 score.

If it is about evaluating the whole system, not just this evaluation:
+ We could use an evaluation set (split the 50k sentences into training and testing data).
+ We could use other pairs of sentences from the same github repository and test how the model performs with other types of "incorrectness".
+ We could take sentences and mask words in them, and then compare the model predictions to the standard (inspired by MLM).    

# Literature


Neural architectures:
* Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. (Links to an external site.) Journal of Machine Learning Research, 3(6):1137–1155, 2003. (Sections 3 and 4 are less relevant today and hence you can glance through them quickly. Instead, look at the Mikolov papers where they describe training word embeddings with the current neural network architectures.)
* T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
* T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
    


Total marks: 63