# Lab 4: Recurrent models

This lab gives you practice with embeddings (word vectors, in this case) and recurrent neural models in NLP. The first part focuses on embeddings as input to a recurrent model, and the second part focuses on embeddings derived from recurrent models, applying them to the task of word sense disambuiguation following the approach from the original paper by Peters et al.

Everybody's machine is different and the neural computations required for this lab are more demanding than in the other assignments in this course. For this reason, it is advisable to use Google CoLab which guarantees a minimum level of performance. It is also recommended to use GPU acceleration; on CoLab, it can be turned on via <code>Runtime>Change Runtime type>GPU</code>.

## Part 1 (45 points)

In the first part of lab 4, we will play with training a recurrent model for part of speech tagging. As an easy exercise, you will observe what happens when you plug in pretrained word embeddings into a neural NLP model and experiment with different sizes of training data.

If you use Google Colab (we recommend so), it may be easiest to place this notebook and <code>lstm_tutorial.py</code> in <code>/Colab Notebooks</code> directory of your Google Drive. Run the code in the cell just below to enable Colab to access the files on Google drive. This will open a pop-up window where you can allow Colab to access your google drive.

In [1]:
#RUN THIS CELL IF USING COLAB TO USE GOOGLE DRIVE FOR STORING lstm_tutorial.py AND/OR DATA FILES
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [None]:
# Set the global variable for your environment to the directory where lstm_tutorial.py is located
# This is important for portability of the notebook and grading.   
WORKING_DIR = '/content/drive/My Drive/Colab Notebook' #Feel free to change this
%cd $WORKING_DIR

The neural network solutions in this lab rely on AllenNLP library version 0.9.0 (the other code of this lab assignment may work incorrectly with more recent versions); <code>overrides</code> is required for compatibility. Linguistic resources are from NLTK version 3.6.2 and might work incorrectly in other versions. Install these before proceeding; installation process may vary depending on your system. On CoLab, this can be done via the following command:

In [None]:
# IF USING COLAB, INSTALL allennlp AND nltk AS FOLLOWS
!pip install -U overrides==3.1.0 nltk==3.6.2 allennlp==0.9.0 
# This might require restart of the runtime (Runtime>restart runtime)
# After restart no need to run this cell again

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting overrides==3.1.0
  Downloading overrides-3.1.0.tar.gz (11 kB)
Collecting nltk==3.6.2
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 4.3 MB/s 
[?25hCollecting allennlp==0.9.0
  Downloading allennlp-0.9.0-py3-none-any.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 25.8 MB/s 
Collecting word2number>=1.1
  Downloading word2number-1.1.zip (9.7 kB)
Collecting tensorboardX>=1.2
  Downloading tensorboardX-2.5.1-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 37.2 MB/s 
[?25hCollecting flaky
  Downloading flaky-3.7.0-py2.py3-none-any.whl (22 kB)
Collecting conllu==1.3.1
  Downloading conllu-1.3.1-py2.py3-none-any.whl (9.3 kB)
Collecting responses>=0.7
  Downloading responses-0.21.0-py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 3.3 MB/s 
[?25hCollecting pyt

**Before you start**,  import required modules:

In [9]:
import random
import nltk
import allennlp
import random

In [2]:
print(f"NLTK version: {nltk.__version__}")
print(f"AllenNLP version: {allennlp.__version__}")

NLTK version: 3.6.2
AllenNLP version: 0.9.0


If you run this for the first time, you may need to download various data using NLTK:

In [3]:
nltk.download('brown')
nltk.download('semcor')
nltk.download('wordnet')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\frans\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package semcor to
[nltk_data]     C:\Users\frans\AppData\Roaming\nltk_data...
[nltk_data]   Package semcor is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\frans\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Exercise 1: prepare the data (5 points)

Linguistic data come in a variety of formats. You already had a chance to play with POS-annotated corpus data in Lab 1.

In the first exercise, you will access POS-annotated data in one format (NLTK) and save it on the disk in a text format. Start with the tagged sentences from the Brown corpus, which can be retrieved as below:

In [4]:
nltk.corpus.brown.tagged_sents()

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

Now randomize the order of all sentences in the corpus using <code>random.shuffle()</code> function with a seed `42` (for some determinism in the code behaviour) and select the first 50K sentences for training and the next 5K for validation.

In [6]:
# #YOUR CODE HERE
# It is important to keep intended values in the following vars
# DON'T CHANGE VARIABLE NAMES
random.seed(42)

data = list(nltk.corpus.brown.tagged_sents())
random.shuffle(data)
print(data[0])

training_brown = data[0:50000]
validation_brown = data[50000:55000] 
testing_brown = "FILL"

print("Length of training set: ", len(training_brown), "Length of validation set: ", len(validation_brown))


[('He', 'PPS'), ('let', 'VBD'), ('her', 'PPO'), ('tell', 'VB'), ('him', 'PPO'), ('all', 'ABN'), ('about', 'IN'), ('the', 'AT'), ('church', 'NN'), ('.', '.')]
Length of training set:  50000 Length of validation set:  5000
[('He', 'PPS'), ('let', 'VBD'), ('her', 'PPO'), ('tell', 'VB'), ('him', 'PPO'), ('all', 'ABN'), ('about', 'IN'), ('the', 'AT'), ('church', 'NN'), ('.', '.')]
Length of training set:  50000 Length of validation set:  5000


Define a function for saving your datasets to a text file in the following format:
* one sentence per line
* tokens separated by spaces
* POS tag separated from the token by "###", for example <code>said###VBD</code>.

In [7]:
def write_posdata(sentences, outfile):
    data_file = open(outfile,"w")
    # Loop over all words in all sentences
    # and write their contents to text file
    for sent in sentences: 
      to_write = "" 
      for word in sent:
        #print(word)
        to_write += word[0] + "###" + word[1] + " " 
      data_file.write(to_write)
      if(sent != sentences[-1]):
        data_file.write('\n')



Now save your data partitions in different sizes. We will start with small data samples since training on a large dataset may be very slow depending on the machine. We won't use the full 50K sentence training set in this lab since this might take too long.

In [8]:
write_posdata(training_brown,"train_brown.txt")
write_posdata(validation_brown,"validation_brown.txt")
write_posdata(training_brown[:50],"train_brown_50.txt")
write_posdata(validation_brown[:50],"validation_brown_50.txt")
write_posdata(training_brown[:500],"train_brown_500.txt")
write_posdata(validation_brown[:500],"validation_brown_500.txt")
write_posdata(training_brown[:5000],"train_brown_5000.txt")
write_posdata(validation_brown[:5000], "validation_brown_5000.txt")

Congratulations, you have now saved the POS tagged data for model training purposes!

## Exercise 2: train neural POS tagger models (15 points)

We will now play with a neural model. You have installed <code>allennlp</code> which contains all necessary components for this and the training code for an LSTM model, which follows an old AllenNLP tutorial, is contained in <code>lstm_tutorial.py</code>. PLace the latter in the same directory as this notebook. Let us start by loading the model code and data, starting with a tiny sample for demonstration purposes. 

In [9]:
from lstm_tutorial import *

train_dataset_tiny = reader.read("train_brown_50.txt")
validation_dataset_tiny = reader.read("validation_brown_50.txt")

50it [00:00, 3333.31it/s]
50it [00:00, 5378.00it/s]


First of all we need to initialize the vocabulary and define an embedding (vector) for each token. We set the embedding size at 300, common in realistic applications. By default, the embeddings are initialized randomly and updated during training (this can be changed but we start with a standard configuration). We also need to specify the <code>HIDDEN_DIM</code> parameter: the dimensionality of the hidden vector representations in the LSTM cell.

Download the smallest pretrained word vector model from https://nlp.stanford.edu/projects/glove/, unzip it, and extract the relevant file <code>'glove.6B.300d.txt'</code> in your working directory. The size of the file is 1GB; if using Google Drive with Colab, make sure you have sufficient space. Downloading and uploading the file might take a few minutes. You can <b>either</b> upload the relevant file from your personal machine `or` use the code below directly from CoLab:

In [None]:
#THIS CELL IS OPTIONAL, TO BE USED ON COLAB. YOU CAN USE wget AS BELOW OR ALTERNATIVELY UPLOAD GloVe EMBEDDINGS TO GOOGLE DRIVE FROM YOUR MACHINE
# download the file
!wget http://nlp.stanford.edu/data/glove.6B.zip
# unzip the file
!unzip -d . 'glove.6B.zip'
# remove useless contents
!rm 'glove.6B.200d.txt' 'glove.6B.100d.txt' 'glove.6B.50d.txt'

In [10]:
vocab_tiny = Vocabulary.from_instances(train_dataset_tiny + validation_dataset_tiny)

EMBEDDING_DIM = 300 # dimensionality of embedding vector
HIDDEN_DIM = 20 # nodes in layer

# Create embedding obj by specifying
# amount of embeddings to do and the dimensionality.
# this one would be initialised with random properties
token_embedding_tiny = Embedding(num_embeddings=vocab_tiny.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

100%|██████████| 100/100 [00:00<00:00, 14921.57it/s]


Initialize token embeddings with values from pretrained GloVe model:


In [11]:
# Read out glove file and fill in 
# token embeddings for each
glove_token_embedding_tiny = Embedding.from_params(vocab=vocab_tiny,
                            params=Params({'pretrained_file':'glove.6B.300d.txt', #  CHANGE BACK TO glove.6B.300d.txt
                                           'embedding_dim' : EMBEDDING_DIM}))

400000it [00:01, 222883.43it/s]


Now from embedding a single word with <code>token_embedding_tiny</code> we can proceed to mapping a word sequence into a sequence of vectors:

In [12]:
# Convert the randomly initialised embedder
# to a sequence of vectors
word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": token_embedding_tiny}) # Use random embedding
glove_word_embeddings_tiny = BasicTextFieldEmbedder({"tokens": glove_token_embedding_tiny}) # Use glove embedding

The following initializes parameters of an LSTM model using <code>word_embeddings_tiny</code> input encoding

In [13]:
# Use the random (non-glove based) embedding to train a model
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model_tiny = LstmTagger(word_embeddings_tiny, lstm, vocab_tiny)

Now define an LSTM model called <code>glove_model_tiny</code> that uses <code>glove_token_embedding_tiny</code>:

In [14]:
# Use non-random glove-based embedding to train a model
glove_model_tiny = LstmTagger(glove_word_embeddings_tiny, lstm, vocab_tiny)


Train the **basic model** for the tiny dataset. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**



In [15]:
basic_trainer_tiny=initialize_trainer(model_tiny, vocab_tiny, train_dataset_tiny, validation_dataset_tiny, batch_size=50)
basic_trainer_tiny.train()


accuracy: 0.0042, loss: 4.4558 ||: 100%|██████████| 1/1 [00:00<00:00,  6.14it/s]
accuracy: 0.0029, loss: 4.4593 ||: 100%|██████████| 1/1 [00:00<00:00, 40.35it/s]
accuracy: 0.0042, loss: 4.4498 ||: 100%|██████████| 1/1 [00:00<00:00, 19.21it/s]
accuracy: 0.0029, loss: 4.4541 ||: 100%|██████████| 1/1 [00:00<00:00, 35.89it/s]
accuracy: 0.0042, loss: 4.4438 ||: 100%|██████████| 1/1 [00:00<00:00, 18.66it/s]
accuracy: 0.0049, loss: 4.4489 ||: 100%|██████████| 1/1 [00:00<00:00, 29.70it/s]
accuracy: 0.0052, loss: 4.4379 ||: 100%|██████████| 1/1 [00:00<00:00, 20.83it/s]
accuracy: 0.0236, loss: 4.4438 ||: 100%|██████████| 1/1 [00:00<00:00, 49.00it/s]
accuracy: 0.0220, loss: 4.4319 ||: 100%|██████████| 1/1 [00:00<00:00, 16.02it/s]
accuracy: 0.0619, loss: 4.4386 ||: 100%|██████████| 1/1 [00:00<00:00, 42.11it/s]
accuracy: 0.0639, loss: 4.4259 ||: 100%|██████████| 1/1 [00:00<00:00, 18.59it/s]
accuracy: 0.1013, loss: 4.4335 ||: 100%|██████████| 1/1 [00:00<00:00, 37.61it/s]
accuracy: 0.1122, loss: 4.42

{'best_epoch': 999,
 'peak_cpu_memory_MB': 0,
 'peak_gpu_0_memory_MB': 0,
 'training_duration': '0:03:34.803111',
 'training_start_epoch': 0,
 'training_epochs': 999,
 'epoch': 999,
 'training_accuracy': 0.5041928721174004,
 'training_loss': 2.2626748085021973,
 'training_cpu_memory_MB': 0.0,
 'training_gpu_0_memory_MB': 0,
 'validation_accuracy': 0.4631268436578171,
 'validation_loss': 2.7383158206939697,
 'best_validation_accuracy': 0.4631268436578171,
 'best_validation_loss': 2.7383158206939697}

You have trained an LSTM POS tagger for the basic model. Now train the <code>glove_model_tiny</code>. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [16]:
#YOUR CODE HERE
basic_trainer_glove_tiny=initialize_trainer(glove_model_tiny, vocab_tiny, train_dataset_tiny, validation_dataset_tiny, batch_size=50)
basic_trainer_glove_tiny.train()



accuracy: 0.0010, loss: 4.4993 ||: 100%|██████████| 1/1 [00:00<00:00, 20.82it/s]
accuracy: 0.0029, loss: 4.4762 ||: 100%|██████████| 1/1 [00:00<00:00, 48.04it/s]
accuracy: 0.0021, loss: 4.4725 ||: 100%|██████████| 1/1 [00:00<00:00, 18.90it/s]
accuracy: 0.0039, loss: 4.4526 ||: 100%|██████████| 1/1 [00:00<00:00, 57.66it/s]
accuracy: 0.0021, loss: 4.4463 ||: 100%|██████████| 1/1 [00:00<00:00, 21.28it/s]
accuracy: 0.0069, loss: 4.4296 ||: 100%|██████████| 1/1 [00:00<00:00, 48.71it/s]
accuracy: 0.0052, loss: 4.4207 ||: 100%|██████████| 1/1 [00:00<00:00, 24.40it/s]
accuracy: 0.0098, loss: 4.4070 ||: 100%|██████████| 1/1 [00:00<00:00, 50.94it/s]
accuracy: 0.0126, loss: 4.3954 ||: 100%|██████████| 1/1 [00:00<00:00, 23.30it/s]
accuracy: 0.0216, loss: 4.3847 ||: 100%|██████████| 1/1 [00:00<00:00, 45.35it/s]
accuracy: 0.0210, loss: 4.3704 ||: 100%|██████████| 1/1 [00:00<00:00, 21.93it/s]
accuracy: 0.0472, loss: 4.3627 ||: 100%|██████████| 1/1 [00:00<00:00, 47.97it/s]
accuracy: 0.0409, loss: 4.34

{'best_epoch': 999,
 'peak_cpu_memory_MB': 0,
 'peak_gpu_0_memory_MB': 0,
 'training_duration': '0:01:48.141779',
 'training_start_epoch': 0,
 'training_epochs': 999,
 'epoch': 999,
 'training_accuracy': 0.7348008385744235,
 'training_loss': 1.3145437240600586,
 'training_cpu_memory_MB': 0.0,
 'training_gpu_0_memory_MB': 0,
 'validation_accuracy': 0.5447394296951819,
 'validation_loss': 2.320211887359619,
 'best_validation_accuracy': 0.5447394296951819,
 'best_validation_loss': 2.320211887359619}

## Exercise 3: Explore training parameters (25 points)

Create separate models on the basis of bigger datasets: the 500 sentence training and 500 sentence validation and 5000 sentence training and 5000 sentence validation. Using the full training set (50K sentences) is optional (your machine might be too slow). Initialize and train the **basic model** on 500 sentence training and 500 sentence validation data. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**  
Code in each cell usually takes max 10-15min on Colab's GPU (only the final two cells take 10-15min). 

In [17]:
# Shared code between Glove and Basic 
train_dataset_500 = reader.read("train_brown_500.txt") # TRAIN SET
validation_dataset_500 = reader.read("validation_brown_500.txt") # VAL SET
vocab_500 = Vocabulary.from_instances(train_dataset_500 + validation_dataset_500) # FULL VOCAB

train_dataset_5000 = reader.read("train_brown_5000.txt") # TRAIN SET
validation_dataset_5000 = reader.read("validation_brown_5000.txt") # VAL SET
vocab_5000 = Vocabulary.from_instances(train_dataset_5000 + validation_dataset_5000) # FULL VOCAB

lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True)) # LSTM instance

500it [00:00, 22960.35it/s]
500it [00:00, 2474.01it/s]
100%|██████████| 1000/1000 [00:00<00:00, 76945.59it/s]
5000it [00:00, 18049.94it/s]
5000it [00:00, 16047.41it/s]
100%|██████████| 10000/10000 [00:00<00:00, 76339.19it/s]


In [None]:
#train the basic model on 500 sentences
#YOUR CODE HERE

basic_token_embedding_500 = Embedding(num_embeddings=vocab_500.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

# Create sequence of vectors based on the embedding
basic_word_embeddings_500 = BasicTextFieldEmbedder({"tokens": basic_token_embedding_500}) # Use glove embedding

# Create and train the actual model
basic_model_500 = LstmTagger(basic_word_embeddings_500, lstm, vocab_500)
trainer_basic = initialize_trainer(basic_model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=50)
trainer_basic.train()


accuracy: 0.0198, loss: 5.1112 ||: 100%|██████████| 10/10 [00:00<00:00, 51.58it/s]
accuracy: 0.0510, loss: 5.0805 ||: 100%|██████████| 10/10 [00:00<00:00, 63.39it/s]
accuracy: 0.0549, loss: 5.0544 ||: 100%|██████████| 10/10 [00:00<00:00, 109.82it/s]
accuracy: 0.0506, loss: 5.0234 ||: 100%|██████████| 10/10 [00:00<00:00, 167.91it/s]
accuracy: 0.0541, loss: 4.9972 ||: 100%|██████████| 10/10 [00:00<00:00, 109.74it/s]
accuracy: 0.0544, loss: 4.9653 ||: 100%|██████████| 10/10 [00:00<00:00, 173.41it/s]
accuracy: 0.0576, loss: 4.9384 ||: 100%|██████████| 10/10 [00:00<00:00, 100.37it/s]
accuracy: 0.0659, loss: 4.9049 ||: 100%|██████████| 10/10 [00:00<00:00, 176.54it/s]
accuracy: 0.0998, loss: 4.8769 ||: 100%|██████████| 10/10 [00:00<00:00, 104.08it/s]
accuracy: 0.1624, loss: 4.8410 ||: 100%|██████████| 10/10 [00:00<00:00, 178.92it/s]
accuracy: 0.1369, loss: 4.8115 ||: 100%|██████████| 10/10 [00:00<00:00, 108.70it/s]
accuracy: 0.1376, loss: 4.7727 ||: 100%|██████████| 10/10 [00:00<00:00, 186.86

{'best_epoch': 999,
 'best_validation_accuracy': 0.7272297808012094,
 'best_validation_loss': 1.3427927136421203,
 'epoch': 999,
 'peak_cpu_memory_MB': 4203.48,
 'peak_gpu_0_memory_MB': 1332,
 'training_accuracy': 0.9134796861031091,
 'training_cpu_memory_MB': 4203.48,
 'training_duration': '0:04:39.531892',
 'training_epochs': 999,
 'training_gpu_0_memory_MB': 1332,
 'training_loss': 0.4832109808921814,
 'training_start_epoch': 0,
 'validation_accuracy': 0.7272297808012094,
 'validation_loss': 1.3427927136421203}

Now do the same training (500 sentence training and 500 sentence validation sets) with GloVE embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
glove_token_embedding_500 = Embedding.from_params(vocab=vocab_500,
                                  params=Params({'pretrained_file':'glove.6B.300d.txt', 'embedding_dim' : EMBEDDING_DIM}))

# Create sequence of vectors based on the embedding
glove_word_embeddings_500 = BasicTextFieldEmbedder({"tokens": glove_token_embedding_500}) # Use glove embedding

# Create and train the actual model
glove_model_500 = LstmTagger(glove_word_embeddings_500, lstm, vocab_500)
trainer_glove = initialize_trainer(glove_model_500, vocab_500, train_dataset_500, validation_dataset_500, batch_size=50)
trainer_glove.train()


19403it [00:00, 87465.10it/s]
accuracy: 0.0220, loss: 5.0580 ||: 100%|██████████| 10/10 [00:00<00:00, 101.03it/s]
accuracy: 0.0929, loss: 4.9036 ||: 100%|██████████| 10/10 [00:00<00:00, 167.40it/s]
accuracy: 0.1275, loss: 4.7727 ||: 100%|██████████| 10/10 [00:00<00:00, 91.05it/s]
accuracy: 0.1576, loss: 4.6308 ||: 100%|██████████| 10/10 [00:00<00:00, 140.60it/s]
accuracy: 0.1601, loss: 4.5040 ||: 100%|██████████| 10/10 [00:00<00:00, 100.04it/s]
accuracy: 0.1794, loss: 4.3716 ||: 100%|██████████| 10/10 [00:00<00:00, 154.44it/s]
accuracy: 0.1735, loss: 4.2607 ||: 100%|██████████| 10/10 [00:00<00:00, 102.59it/s]
accuracy: 0.1931, loss: 4.1530 ||: 100%|██████████| 10/10 [00:00<00:00, 149.00it/s]
accuracy: 0.1870, loss: 4.0641 ||: 100%|██████████| 10/10 [00:00<00:00, 110.06it/s]
accuracy: 0.2359, loss: 3.9867 ||: 100%|██████████| 10/10 [00:00<00:00, 175.44it/s]
accuracy: 0.2503, loss: 3.9157 ||: 100%|██████████| 10/10 [00:00<00:00, 105.96it/s]
accuracy: 0.2967, loss: 3.8630 ||: 100%|███████

{'best_epoch': 949,
 'best_validation_accuracy': 0.7508503401360545,
 'best_validation_loss': 1.245854139328003,
 'epoch': 958,
 'peak_cpu_memory_MB': 4215.464,
 'peak_gpu_0_memory_MB': 1352,
 'training_accuracy': 0.8871560544352836,
 'training_cpu_memory_MB': 4215.464,
 'training_duration': '0:04:33.569502',
 'training_epochs': 958,
 'training_gpu_0_memory_MB': 1352,
 'training_loss': 0.44819373190402984,
 'training_start_epoch': 0,
 'validation_accuracy': 0.7517006802721088,
 'validation_loss': 1.2459042489528656}

Use a bigger training set now with 5K sentence training and 5K sentence validation sets and random initial embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
#train the basic model on 500 sentences
#YOUR CODE HERE

basic_token_embedding_5000 = Embedding(num_embeddings=vocab_5000.get_vocab_size('tokens'),
                            embedding_dim=EMBEDDING_DIM)

# Create sequence of vectors based on the embedding
basic_word_embeddings_5000 = BasicTextFieldEmbedder({"tokens": basic_token_embedding_5000}) # Use glove embedding

# Create and train the actual model
basic_model_5000 = LstmTagger(basic_word_embeddings_5000, lstm, vocab_5000)
trainer_basic = initialize_trainer(basic_model_5000, vocab_5000, train_dataset_5000, validation_dataset_5000, batch_size=50)
trainer_basic.train()


accuracy: 0.0733, loss: 5.4954 ||: 100%|██████████| 100/100 [00:01<00:00, 93.66it/s]
accuracy: 0.1329, loss: 5.1081 ||: 100%|██████████| 100/100 [00:00<00:00, 133.49it/s]
accuracy: 0.1307, loss: 4.5852 ||: 100%|██████████| 100/100 [00:01<00:00, 81.24it/s]
accuracy: 0.1329, loss: 4.1627 ||: 100%|██████████| 100/100 [00:00<00:00, 169.87it/s]
accuracy: 0.1325, loss: 4.0267 ||: 100%|██████████| 100/100 [00:00<00:00, 108.60it/s]
accuracy: 0.1329, loss: 3.8944 ||: 100%|██████████| 100/100 [00:00<00:00, 175.87it/s]
accuracy: 0.1326, loss: 3.8627 ||: 100%|██████████| 100/100 [00:00<00:00, 110.32it/s]
accuracy: 0.1329, loss: 3.7878 ||: 100%|██████████| 100/100 [00:00<00:00, 170.98it/s]
accuracy: 0.1343, loss: 3.7812 ||: 100%|██████████| 100/100 [00:00<00:00, 109.40it/s]
accuracy: 0.1735, loss: 3.7224 ||: 100%|██████████| 100/100 [00:00<00:00, 167.05it/s]
accuracy: 0.1519, loss: 3.7255 ||: 100%|██████████| 100/100 [00:00<00:00, 107.12it/s]
accuracy: 0.1748, loss: 3.6715 ||: 100%|██████████| 100/

{'best_epoch': 248,
 'best_validation_accuracy': 0.8560853966251379,
 'best_validation_loss': 0.7326847392320633,
 'epoch': 257,
 'peak_cpu_memory_MB': 4320.536,
 'peak_gpu_0_memory_MB': 1438,
 'training_accuracy': 0.9314940636886422,
 'training_cpu_memory_MB': 4320.536,
 'training_duration': '0:06:59.799362',
 'training_epochs': 257,
 'training_gpu_0_memory_MB': 1438,
 'training_loss': 0.3680126863718033,
 'training_start_epoch': 0,
 'validation_accuracy': 0.8568049203595648,
 'validation_loss': 0.734617463350296}

Now do the same training (5K sentence training and 5K sentence validation sets) with GloVE embeddings. **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
glove_token_embedding_5000 = Embedding.from_params(vocab=vocab_5000,
                                  params=Params({'pretrained_file':'glove.6B.300d.txt', 'embedding_dim' : EMBEDDING_DIM}))

# Create sequence of vectors based on the embedding
glove_word_embeddings_5000 = BasicTextFieldEmbedder({"tokens": glove_token_embedding_5000}) # Use glove embedding

# Create and train the actual model
glove_model_5000 = LstmTagger(glove_word_embeddings_5000, lstm, vocab_5000)
trainer_glove = initialize_trainer(glove_model_5000, vocab_5000, train_dataset_5000, validation_dataset_5000, batch_size=50)
trainer_glove.train()


15525it [00:00, 30605.00it/s]Found line with wrong number of dimensions (expected: 300; actual: 62): disapproval -0.7224 -0.20175 0.47879 -0.047104 -0.5342 0.2427 -0.044323 -0.42326 -0.43648 -0.67577 -0.44901 0.53004 0.12845 -0.45036 0.51075 0.56344 0.24403 -0.93492 -0.0024757 -0.21545 0.50213 -0.24325 0.24024 -0.039066 -0.081791 0.46241 0.55508 -0.23364 0.33477 0.41562 -0.48405 -0.053207 -0.28972 -0.13055 -0.28959 -0.0068973 -0.70443 -0.62953 -0.0099722 -0.32955 0.13376 0.7939 -0.067696 -0.50736 0.22804 0.24775 -0.349 -0.18843 -0.10325 -0.31933 -0.096132 -0.020843 -0.2721 0.14379 -0.25857 -0.08733 -0.1319 0.29552 -0.3407 0.38866 -0.11296 0.3
19403it [00:00, 27984.09it/s]


NameError: ignored

For each trained model, record validation accuracy and training duration (they are returned along with other training stats after training a model) and accuracy on the training set. Fill in the numbers in the table below:

| model | validation accuracy | training accuracy | training duration|
|-------|---------------------|---------------|-------------------------------------------
| basic model on 50 sentences||||
| glove model on 50 sentences||||
| basic model on 500 sentences||||
| glove model on 500 sentences||||
| basic model on 5000 sentences||||
| glove model on 5000 sentences||||

**Question.** What do you conclude from these comparisons? when can it be especially beneficial to initialize a model with pretrained embeddings?

**Answer.** <font color="red">WRITE YOUR ANSWER HERE</font>

## Comment 
In this lab we used pretrained GloVe embeddings in a model for part of speech tagging. GloVe in its turn is also a neural word embedding model, but it had been trained on a completely different objective. GloVe vectors had been optimised on word cooccurrence matrix decomposition, i.e. on the task of predicting which words tend to occur with which other words. Part of speech certainly plays a role in determining statistical cooccurrence of words, but this role is indirect, and explicit part of speech information has not been used in training GloVe.

This makes our application an example of **transfer learning**, whereby a learned model trained on one objective (e.g. word cooccurrence) can benefit a different application (e.g. POS tagging), because some information is shared between them. 

## Part 2 - ELMo vectors (55 points)

> Indented block



In the second part of this lab we will reproduce the word sense disambiguation strategy that the authors of the ELMo vectors explored. The strategy consists in the following:

- create ELMo embeddings for all tokens in a sense-annotated corpus
- calculate mean sense vectors for each word sense in the training partition of the corpus
- for each sense-annotated token in the test partition of the corpus, assign to it the sense of the word to which its ELMo vector is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word by default.

As a sense annotated corpus, we can use SemCor, conveniently available within NLTK. <code>semcor.sents()</code> iterates over all sentences represented as lists of tokens, while <code>semcor.tagged_sents()</code> iterates over the same sentences with additional annotation including WordNet lemma identifiers (lemmas in WordNet stand for a word taken in a specific sense).

In [5]:
from nltk.corpus import wordnet as wn  
from nltk.corpus import semcor
import random
semcor.sents()

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [6]:
semcor.tagged_sents(tag="sem")
print(list(semcor.tagged_sents()[:100]))



## Exercise 1. Extract relevant data from SemCor (5 points)

Let's prepare SemCor data for the disambiguation task. Since this is just an educational exercise and we don't aim at replicating the full results, we can use only a subset of SemCor. Take the first 10K sentences of SemCor and split them **randomly** (with a seed=`42`) into 90% training and 10% testing partitions:

In [32]:
#YOUR CODE HERE
# Don't change variable names!

# semcor_list = list(semcor.tagged_sents(tag='sem')[:10000])
# random.shuffle(semcor_list)
#
# print(semcor_list[0])
# print(semcor_list[1])
#
# semcor_train = semcor_list[:9000]
# semcor_train_untagged = semcor_list[:9000]
# semcor_test = semcor_list[9000:10000]
#
# print(len(semcor_train), len(semcor_test))




data = list(zip(semcor.tagged_sents(tag="sem")[:10000], semcor.sents()[:10000]))
random.seed(42)
random.shuffle(data)

semcor_data_tagged, semcor_data = zip(*data)
semcor_train = semcor_data[:9000]
semcor_test = semcor_data[9000:]
semcor_train_tagged = semcor_data_tagged[:9000]
semcor_test_tagged = semcor_data_tagged[9000:]

Create a function that takes as input a sentence from SemCor and extracts a list which contains, for each token of the sentence, either the corresponding WordNet Lemma (e.g. <code>Lemma('friday.n.01.Friday')</code>) or <code>None</code>. <code>None</code> corresponds to tokens that are either 1) not annotated for word senses (e.g. articles); 2) are marked up as (part of) a named entity (e.g. "City of Atlanta" or placename "Fulton" annotated as  <code>Tree(Lemma('location.n.01.location'), [Tree('NE', ['Fulton'])])</code>).

In [33]:
def get_lemmas(semcor_sentence):
      #print(semcor_sentence)
      result = []
      for tree in semcor_sentence:
        tag_0 = ""
        tag_1 = ""

        # Check sem tags (NE, etc)
        # return if NE
        if(hasattr(tree[0], 'label')): # Does it have a label?
            tag_1 = tree[0].label()
            if(tag_1 == 'NE'):
              for elem in tree[0]:
                result.append(None)
              continue

        # Check if tree, if so, add 
        # label (lemma) to result
        if(type(tree) is nltk.tree.Tree): # Is object a tree?
            for elem in tree:
                result.append(tree.label())
        else:
          for elem in tree:
            result.append(None)
      return result
        
result = get_lemmas(semcor.tagged_sents(tag='sem')[0])

a = get_lemmas(semcor_train_tagged[22])
print(len(a))
print(len(semcor_train_tagged[22]))
print(semcor_train_tagged[22])

11
11
[Tree(Lemma('capital.n.01.capital'), ['Capital']), Tree(Lemma('flow.n.03.flow'), ['flows']), ['must'], ['be'], Tree(Lemma('align.v.04.coordinate'), ['coordinated']), ['with'], Tree(Lemma('national.a.02.national'), ['national']), Tree(Lemma('need.n.02.need'), ['needs']), ['and'], Tree(Lemma('planning.n.01.planning'), ['planning']), ['.']]


In [14]:
# TEST
get_lemmas(semcor.tagged_sents(tag='sem')[0])

[None,
 None,
 None,
 None,
 None,
 Lemma('state.v.01.say'),
 Lemma('friday.n.01.Friday'),
 None,
 Lemma('probe.n.01.investigation'),
 None,
 Lemma('atlanta.n.01.Atlanta'),
 None,
 Lemma('late.s.03.recent'),
 Lemma('primary.n.01.primary_election'),
 Lemma('primary.n.01.primary_election'),
 Lemma('produce.v.04.produce'),
 None,
 None,
 Lemma('evidence.n.01.evidence'),
 None,
 None,
 None,
 Lemma('abnormality.n.04.irregularity'),
 Lemma('happen.v.01.take_place'),
 Lemma('happen.v.01.take_place'),
 None]

You are now able to extract word senses (instantiated by WordNet lemmas) from the corpus. The next step is to associate senses with ELMo vectors. Create a dictionary of contextualized token embeddings from the training corpus grouped by the WordNet sense:

In [28]:
from collections import defaultdict

# DON'T CHANGE THE VARIABLE NAME
Train_embeddings = defaultdict(list)

Now let's create contextualized ELMo word embeddings for the tokens in this corpus. We can load the pretrained ELMo model and define a function <code>sentences_to_elmo()</code> that receives a list of tokenized sentences as input and produces their ELMo vectors.

In [22]:
from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://allennlp.s3.amazonaws.com/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
elmo = Elmo(options_file, weight_file, 1, dropout=0)

def sentences_to_elmo(sentences):
    character_ids = batch_to_ids(sentences)
    embeddings = elmo(character_ids)
    return embeddings

Now you can process the corpus sentences and produce their ELMo vectors. It is recommended to pass the input to ELMo encoder in batches. A suggested batch size is 50 sentences. For example, the code below processes the first 50 sentences from the corpus:

In [23]:
sentences=semcor.sents()[:50]
embeddings=sentences_to_elmo(sentences)

The <code>embeddings</code> that we obtained is a dictionary that contains a list of ELMo embeddings and a list of masks. The mask tells us which embeddings correspond to tokens in the original input sentences and which correspond to the padding (introduced to give all sentences in the batch the same length).
In principle all embeddings are stored in PyTorch tensors so that they can be used in bigger neural models, but we are not going to do it now. Note that PyTorch tensors can be converted to numpy arrays with `pyTorch_tensor.detach().numpy()`. 

In [24]:
embeddings['elmo_representations'][0]

tensor([[[-6.4617e-03,  6.0215e-03, -3.5598e-01,  ..., -1.1715e-02,
           7.0427e-02, -4.1873e-01],
         [-3.7781e-01,  2.8141e-01, -2.5836e-01,  ..., -4.8547e-01,
           2.5508e-01,  3.6381e-02],
         [ 9.1191e-01,  1.1779e+00, -8.4833e-01,  ...,  9.8472e-01,
           3.3675e-01,  1.6172e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[-6.4617e-03,  6.0215e-03, -3.5598e-01,  ..., -4.4876e-02,
           1.1313e-01, -9.9628e-02],
         [ 1.3721e-01, -2.0003e-01, -1.3074e-01,  ...,  5.9482e-01,
           9.3386e-01, -2.6757e-01],
         [ 1.7280e-01,  1.0801e+00, -5.4539e-01,  ...,  3.1966e-01,
          -5.6408e-01,  3.2461e-01],
         ...,
         [ 0.0000e+00,  0

We can check the size of the embeddings we got. It has three dimensions: 1) the number of sentences 2) the number of tokens (corresponds to the tokens in the longest original sentence of the batch; shorter ones were padded) and 3) the dimensionality of the Elmo vector (1024).

In [25]:
embeddings['elmo_representations'][0].detach().size()

# Elmo represententation of sentence 0:
# 1) Token num. of longest sentence of batch
# 2) Dimensionality of the vector
embeddings['elmo_representations'][0][0].detach().size()

 # Elmo represententation of sentence 0, word 0
embeddings['elmo_representations'][0][0][0].detach().size()

torch.Size([1024])

Another thing contained in the <code>embeddings</code> is the mask, a tensor encoding which token vectors correspond to original tokens and which are paddings. It has two dimensions, one corresponding to the sentences in the batch (50) and one corresponding to the token positions:

In [None]:
# Mask: Which vectors correspond to original tokens and which are paddings

print(embeddings['mask'].size())

# Each 1 means original token, 0 means padding
print(embeddings['mask'][0])
print(semcor.sents()[0])

## Exercise 2. Extract ELMo encoding of sentences using a mask (5 points)  

Now define a function <code>get_masked_vectors(embeddings)</code> that takes embeddings as input and returns a list of ELMo sentence encodings to which the mask has been applied.  The output should be a list of Torch tensors, where the padding vectors have been removed so each sentence is represented by an $n \times 1024$ tensor where $n$ is sentence length.

In [29]:
def get_masked_vectors(embeddings):
    total_tensor = []
    for i, mask in enumerate(embeddings['mask']): # Loop over all masks (sentences)
        sent_tensors = []
        for ii, bit in enumerate(mask): # Loop over each elem in mask
            if bit == 0:
                continue
            else:
                #print('sent', i, 'word', ii)
                target = embeddings['elmo_representations'][0][i][ii]
                sent_tensors.append(target)
        total_tensor.append(sent_tensors)
    return total_tensor



tensors = get_masked_vectors(embeddings)


## Exercise 3. Collect ELMo vectors from the training corpus (20 points)

Process the corpus updating your train word sense vectors in the dictionary. Iterate over the all the train sentences in the corpus, and retrieve for each lemma-annotated token (where lemma is not <code>None</code>) the corresponding ELMo vector. Store the ELMo sense embeddings that correspond to each lemma in the dictionary <code>Train_embeddings</code>. This step of processing the training corpus with ELMo is the most time consuming part of this assignment. However, it should not take forever. If this computation takes more than an hour, you may want to optimize your code or make sure you are using GPU acceleration. For the purposes of developing and debugging your solution, you may start by use a sample of 100 sentences, but then switch to the full 9K sentence training set. 

In [31]:
# might take ~25min on Colab's GPU
import torch


#print(len(semcor_train))
#print(semcor_train[5000])

for i, sent in enumerate(semcor_train[17:20]): # Get all train sentences
    print('\n Sent len:', len(sent))
    print('sent: ', sent)
    lemma_list = get_lemmas(sent) # Get all lemmas of that sentence
    print(' Lemma len', len(lemma_list))
    for ii, lemma in enumerate(lemma_list): # Get all lemmas
        if lemma is not None:
            print(' - Word:', ii, ':', sent[ii][0], ', Lemma:', lemma)
            print('   Vectors for sent: ', len(tensors[i]))
            word = sent[ii][0]

            Train_embeddings[word] = embeddings['elmo_representations'][0][i][ii]

            print('pass')
# For a word, go to appropriate sentence in elmo, then
#YOUR CODE HERE
#Don't forget to populate Train_embeddings 





 Sent len: 21
sent:  [['In'], ['the'], Tree(Lemma('eastern.s.01.eastern'), ['eastern']), Tree(Lemma('section.n.03.section'), ['section']), ['of'], ['the'], Tree(Lemma('state.n.01.state'), ['state']), ['the'], Tree(Lemma('newspaper.n.02.newspaper'), ['newspapers']), ["'"], Tree(Lemma('reaction.n.02.reaction'), ['reaction']), ['to'], Tree(Lemma('person.n.01.person'), [Tree('NE', ['Brown'])]), ["'s"], Tree('trial.n.00', ['trial']), ['and'], Tree(Lemma('conviction.n.02.sentence'), ['sentence']), Tree(Lemma('be.v.01.be'), ['were']), Tree(Lemma('basically.r.01.basically'), ['basically']), Tree(Lemma('identical.s.02.identical'), ['identical']), ['.']]
 Lemma len 21
 - Word: 2 : eastern , Lemma: Lemma('eastern.s.01.eastern')
   Vectors for sent:  26
pass
 - Word: 3 : section , Lemma: Lemma('section.n.03.section')
   Vectors for sent:  26
pass
 - Word: 6 : state , Lemma: Lemma('state.n.01.state')
   Vectors for sent:  26
pass
 - Word: 8 : newspapers , Lemma: Lemma('newspaper.n.02.newspaper')
 

IndexError: index 59 is out of bounds for dimension 0 with size 59

How many senses does your Train_embeddings contain? **<font color="red">Do not clear the output of this cell in the submitted version.</font>**

In [None]:
print(len(Train_embeddings))

## Exercise 4. Vector averaging (5 points)

Your <code>Train_embeddings</code> now is a list of all vectors for a given word sense in the training corpus. For our purposes, we do not need the full list but the mean vector for each sense. For each sense in <code>Train_embeddings</code>, substitute the list by the average ELMo vector on the list. One efficient way to do this is to convert the list to a tensor via <code>stack</code> function and use Torch's <code>mean</code> function. Below is an example of how an average of two (random) vectors stored in a tensor can be computed in PyTorch: 

In [None]:
 randtensor = torch.randn(2, 4)
 print("Tensor storing two 4-dimensional vectors:\n",randtensor)
 print("Average vector: \n",randtensor.mean(dim=0))

Now you are ready to update your <code>Train_embeddings</code> so that it maps lemmas not to lists but to averaged vectors.

In [None]:
#YOUR CODE HERE


## Exercise 5. Testing the sense vectors (20 points)

Test your sense embeddings on your test data, which is a subset of the SemCor corpus. Use the strategy outlined above, with 1st WordNet sense as a fallback: 

- rely on mean sense vectors for each word sense in the training partition of the corpus, as stored in <code>Train_embeddings</code>
- for each sense-annotated token <i>t</i> (e.g. the verb "run") in the test partition of the corpus, assign to it the sense of the word "Lemma('X.v.n.run')" to which the ELMo vector <i>t</i> is the closest according to the cosine distance metric
- as a backup strategy, use the 1st sense of the word (e.g. <code>Lemma('run.n.01.run')</code>) from WordNet. You can look it up using a built-in function from NLTK (e.g. <code>wn.lemmas('run')</code>). More on usage of WordNet with NLTK [here](https://www.nltk.org/howto/wordnet.html).

Calculate WSD accuracy in percentage points on your test data. Report three numbers
- overall accuracy (proportion of times the ELMo method+WordNet backup results in the correct sense annotation)
- WordNet baseline accuracy: what if you always select the first WordNet sense, ignoring the ELMo embedding?
- accuracy of the ELMo method just for the instances in which ELMo strategy is applicable
- accuracy of the WordNet baseline just for the instances in which ELMo strategy is applicable

For the purpose of testing the model, it is important to implement comparison of predicted and ground truth synsets correctly. To do this, use a string conversion, because ```==``` applied to WordNet lemmas only compares the words that express the two lemmas, ignoring the synsets. See the code below:

In [None]:
word="toy"
toy1 = wn.lemmas(word)[0]
toy2 = wn.lemmas(word)[6]
print("toy1 (noun):", toy1)
print("toy2 (verb):", toy2)
print("direct equality comparison of toy 1 and toy2: toy1==toy2",toy1==toy2)
print("string based comparison of toy 1 and toy2: str(wordnet.lemmas(word)[0])==str(wordnet.lemmas(word)[1])",
      str(wn.lemmas(word)[0])==str(wn.lemmas(word)[1]))

In [None]:
from torch.nn.functional import cosine_similarity

all_outcomes = []
#YOUR CODE HERE



Make sure you have the following variables defined so that this cell runs smoothly.  
**<font color="red">Do not delete the output of this cell in the submitted version.</font>**

In [None]:
# Don't round the numbers
print("Overall accuracy:", accuracy) 
print("WordNet baseline:", baseline_accuracy)
print("Accuracy in cases where ELMo method is used", elmo_accuracy)
print("Accuracy of the baseline in cases where ELMo method is applicable", baseline_on_elmo_data_accuracy)

If you reached this point, you were able to evaluate ELMo as a model of contextual semantic similarity of word usages. The idea behind the vector averaging is that a word when used in the same sense should have similar vector representations, while usages in distinct senses should have different vector representations.

Analyze the numbers above. What do they tell you?

**<font color="red">WRITE YOUR ANALYSIS HERE</font>**


## The end
Congratulations! this is the end of Lab 4.

**Acknowledgements**: Tejaswini Deoskar has given valuable comments that helped improve this lab assignment. Timothee Mickus helped to test this assignment and gave extensive feedback on the instructions. Many thanks to both.