# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 30 different words. 

In [30]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext

# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:3')

batch_size = 8
learning_rate = 0.001
epochs = 3

In [2]:
#other packages that we are going to use
import csv
import random
import math

import numpy as np
import pandas as pd

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 30 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

### DO NOT RE-RUN THESE CELLS!  
They are here ONLY to make the data split once and save it. Re-running them will mess up the data split and will render them useless for testing anything on the saved BERT model. Continue from "creating a baseline".

In [3]:
def data_split(path_to_dataset):
    # your code goes here
    with open(path_to_dataset) as file:
        lines = file.readlines()
    
    clean_lines = []
    for line in lines:
        clean_line = line.strip().split("\t")
        clean_lines.append(clean_line)
        
    shuffled_lines = random.sample(clean_lines, len(clean_lines))
    
    # it will not be an exact 80/20 unless the set allows for it, this way the training set is marginally larger than 80%
    training_cutoff = math.ceil(len(clean_lines)*0.8)  
    
    training = shuffled_lines[:training_cutoff]
    testing = shuffled_lines[training_cutoff:]
        
    return training, testing

In [4]:
train, test = data_split('wsd_data.txt')

In [5]:
print(len(train))
print(len(test))

print(len(test)/(len(test)+len(train)))

60840
15209
0.1999894804665413


In [6]:
def data_save(train, test, train_file, test_file):
    
    with open(train_file, 'w', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerows(train)
        
    with open(test_file, 'w', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerows(test)

In [7]:
data_save(train, test, 'wsd_train.csv', 'wsd_test.csv')

In [8]:
with open('wsd_test.csv') as f:
    reader = csv.reader(f)
    rows = []
    for row in reader:
        rows.append(row)
        
for row in rows:
    if len(row) != 4:
        print("oopsie, your file does not work")

### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

In [3]:
train_data = pd.read_csv('wsd_train.csv', header=None)
train_data.columns = ['word-sense', 'word-form', 'index', 'context']
test_data = pd.read_csv('wsd_test.csv', header=None)
test_data.columns = ['word-sense', 'word-form', 'index', 'context']

In [4]:
def mcs_baseline_2(train_data, test_data):
    msc_mapping = {}
    for word_form, df in train_data.groupby('word-form'):
        msc_mapping[word_form] = df['word-sense'].value_counts().idxmax()

    #'missing' incase (unlikely) all observations of one word-form ends up in test_data.
    msc_test_results = test_data['word-form'].apply(lambda x: msc_mapping.get(x, 'missing'))

    #return msc_test_results
    accuracy = np.sum(msc_test_results == test_data['word-sense'])/len(test_data)
    return accuracy      

In [5]:
print(f'MSC Baseline Accuracy: {round(mcs_baseline_2(train_data, test_data)*100,2)}%')

MSC Baseline Accuracy: 31.09%


These two yield similar results, but there is a bit of a difference between them, which is why we wanted to show both.

### What happens if we remove the word from the word sense? 
Example: keep%2:42:07:: -> 2:42:07::

In [6]:
data = pd.read_csv('wsd_data.txt', sep='\t', header=None)
data.columns = ['word-sense', 'word-form', 'index', 'context']

In [7]:
# this function will be needed for getting per-word-form accuracy
def get_wf(w):
    # keep%2:42:07:: -> keep
    splt = w.split('%')
    if len(splt)!=2:
        raise Exception('Unknown word-sense format')
    return splt[0]

In [8]:
def remove_word(w):
    # keep%2:42:07:: -> 2:42:07::
    splt = w.split('%')
    if len(splt)!=2:
        raise Exception('Unknown word-sense format')
    return splt[-1]

print(len(np.unique(data['word-sense'])))
print(len(np.unique(data['word-sense'].apply(remove_word))))

222
135


In [9]:
data['word-sense-generalized'] = data['word-sense'].apply(remove_word)

Observation/Idea: From 222 unique instances to 135, i.e there is some overlap between the word-sense codes. The result implies that we can possibly improve generalization by removing the word from word-sense. As the word itself is not unknown and can be readded after prediction.

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**2 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

We did not use torchtext as in the class it was suggested that doing custom datasets and loaders is an option.

In [16]:
from torch.utils.data import DataLoader, Dataset

In [17]:
# I implement a Dataset to keep track of vocab, word2idx, idx2word
# Dataset can also be used in DataLoader which gives batch loading, etc, for free.

class WSDDataset(Dataset):

    def __init__(self, data, unk_label='<unk>', pad_label='<pad>', generalize_word_sense=False):
        
        self.unk_idx, self.unk_label = 0, unk_label
        self.pad_idx, self.pad_label = 1, pad_label

        self.data = data.copy()
        self.data['context'] = self.data['context'].apply(self.tokenize)
        if generalize_word_sense:
            self.data['word-sense'] = self.data['word-sense'].apply(self.__remove_word)

        self.vocab = self.__unique_words()
        
        self.word2idx = dict()
        self.idx2word = dict()
        self.word2idx[self.unk_label] = self.unk_idx
        self.word2idx[self.pad_label] = self.pad_idx
        self.word2idx.update({word:idx+max(self.word2idx.values())+1 for idx, word in enumerate(self.vocab)})

        self.idx2word = {v:k for k,v in self.word2idx.items()}

        self.labels = list(np.unique(self.data['word-sense']))

    def __unique_words(self):
        all_words = []
        for s in self.data['context']:
            all_words += s
        return np.unique(all_words)
        
    def tokenize(self, string):
        # The tokenizer was given as a whitespace tokenizer
        return string.lower().split()

    # generalized word-sense tags
    def __remove_word(self, w):
        # keep%2:42:07:: -> 2:42:07::
        splt = w.split('%')
        if len(splt)!=2:
            raise Exception('Unknown word-sense format')
        return splt[-1]

    def __getitem__(self, idx):
        #x = self.data.iloc[0] #for test
        x = self.data.iloc[idx]
        out = (x['index'], x['context'], x['word-sense'])
        return out
        
    def __len__(self):
        return len(self.data)

In [18]:
from collections import namedtuple
from torch.nn.utils.rnn import pad_sequence 
#Q:Should target word be inlcuded in contexts or not?

word_sense_to_idx = {k:v for v,k in enumerate(sorted(np.unique(data['word-sense'])))}
idx_to_word_sense = {v:k for k,v in word_sense_to_idx.items()}

class Collate():
    def __init__(self, word_to_idx, pad_idx=1, unk_idx=0, word_sense_to_idx=word_sense_to_idx):
        self.pad_idx = pad_idx
        self.unk_idx = unk_idx
        self.word_to_idx = word_to_idx
        self.word_sense_to_idx = word_sense_to_idx
    def __call__(self, batch):
        batch = np.transpose(batch)
        indices = batch[0]
        word_senses = [self.word_sense_to_idx[ws] for ws in batch[2]]

        contexts = np.transpose(batch[1]) #batch first
        contexts = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in contexts]
        contexts = pad_sequence(contexts, batch_first=True, padding_value=self.pad_idx)

        return indices, contexts, word_senses


def dataloader(dataset, batch_size=32, shuffle=True):
    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=Collate(dataset.word2idx, dataset.pad_idx, dataset.unk_idx)
    )
    return loader


In [19]:
## -- testing dataset--
tt = train_data.iloc[0:4].copy()
display(tt)

tt_dset =  WSDDataset(tt)

print(tt_dset[0])

print(len(tt_dset))

print('uniq word sense', (tt_dset.labels))

print(len(tt_dset.word2idx))


Unnamed: 0,word-sense,word-form,index,context
0,positive%3:00:01::,positive.a,66,[ Original : English ] If the verb “ is locate...
1,professional%3:01:01::,professional.a,76,There is lack of professionals and human resou...
2,point%1:09:02::,point.n,33,"You do n't need worry , Angelo . I tell you , ..."
3,keep%2:41:01::,keep.v,41,The strength of the dollar vis-à-vis other maj...


(66, ['[', 'original', ':', 'english', ']', 'if', 'the', 'verb', '“', 'is', 'located', '”', 'means', 'a', 'real', 'flight', 'of', 'a', 'craft', 'in', 'airspace', 'on', 'the', 'basis', 'of', 'principles', 'and', 'technology', 'of', 'aeronautics', 'on', 'the', 'one', 'hand', 'and', 'the', 'movement', 'of', 'an', 'object', 'to', ',', 'in', 'and', 'from', 'orbit', 'on', 'the', 'basis', 'of', 'principles', 'and', 'technology', 'of', 'astronautics', 'on', 'the', 'other', ',', 'the', 'reply', 'to', 'this', 'question', 'should', 'be', 'positive', '.', 'this', 'answer', ',', 'however', ',', 'is', 'subject', 'to', 'further', 'consideration', 'taking', 'into', 'account', 'the', 'purposes', 'served', 'by', 'each', 'aerospace', 'object', '(', 'see', 'below', 'replies', 'to', 'questions', '3', 'and', '4', ')', '.'], 'positive%3:00:01::')
4
uniq word sense ['keep%2:41:01::', 'point%1:09:02::', 'positive%3:00:01::', 'professional%3:01:01::']
186


In [20]:
tt2 = train_data.iloc[0:3].copy()
tt2 =  WSDDataset(tt2)

loader = dataloader(tt2, batch_size=3,shuffle=False)
for b in loader:
    #print()
    print(b)

## Note: I should probably clump together contexts of small, medium and large size to reduce padding. I remember Nikolai talking about this in a previous course.
# Its possible that BatchIterator does this automatically which would be an advantage of using it.

  return array(a, dtype, copy=False, order=order)


(array([66, 76, 33], dtype=object), tensor([[ 13, 103,  11,  52,  14,  71, 140, 149, 153,  79,  85, 154,  89,  15,
         118,  56,  97,  15,  43,  73,  20,  98, 140,  29,  97, 110,  22, 137,
          97,  18,  98, 140,  99,  65,  22, 140,  92,  97,  21,  96, 143,   6,
          73,  22,  60, 101,  98, 140,  29,  97, 110,  22, 137,  97,  26,  98,
         140, 104,   6, 140, 121, 143, 142, 116, 126,  30, 108,   8, 142,  24,
           6,  68,   6,  79, 132, 143,  61,  42, 136,  78,  17, 140, 115, 124,
          35,  49,  19,  96,   4, 123,  32, 120, 143, 117,   9,  22,  10,   5,
           8,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   1,   1],
        [141,  79,  84,  97, 113,  22,  69, 122, 143,  63, 140,  41,  28, 125,
          97,  91,  66,   8, 142,  75, 139,  48,  42, 126,  30,  64, 143, 140,
          57,  80,  11, 143,  87,  91,  66, 125,  41,  22,  40,  28,  12, 143,
          63, 130,  27, 143, 140,  90, 109,  60,  91,  22,  

In [21]:
print(b[2])
b[1].size()

[172, 180, 160]


torch.Size([3, 117])

# 2.1 Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:

In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[5 marks]**


In [22]:
class WSDModel_approach1(nn.Module): 
    def __init__(self, word2idx, wordsense2idx, embedding_dim=32, hidden_size=128,padding_idx=1):
        super().__init__()
        self.vocab_size = len(word2idx)
        self.output_dim = len(wordsense2idx)
        self.hidden_size = hidden_size
        
        # your code goes here
        self.embeddings = nn.Embedding(self.vocab_size, embedding_dim, padding_idx=padding_idx)
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=self.hidden_size, num_layers=1, bidirectional=True)
        self.classifier = nn.Linear(self.hidden_size*2, self.output_dim)
    
    def forward(self, contexts, indices):
        # your code goes here
        embedded_contexts = self.embeddings(contexts)

        timestep_representation, *_ = self.LSTM(embedded_contexts)

        timestep_representation = torch.stack(
            [tp[i] for i, tp in zip(indices, timestep_representation)])
        
        predictions = self.classifier(timestep_representation)
        
        return predictions

In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[5 marks]**

In [23]:
class WSDModel_approach2(nn.Module):
    def __init__(self, word2idx, wordsense2idx, embedding_dim=32, hidden_size=128, padding_idx=1):
        super().__init__()
        self.vocab_size = len(word2idx)
        self.output_dim = len(wordsense2idx)
        self.hidden_size = hidden_size

        # your code goes here
        self.embeddings = nn.Embedding(
            self.vocab_size, embedding_dim, padding_idx=padding_idx)
        self.LSTM = nn.LSTM(input_size=embedding_dim,
                            hidden_size=self.hidden_size, num_layers=1, batch_first=True, bidirectional=True)
        self.classifier = nn.Linear(self.hidden_size*2, self.output_dim)

    def forward(self, contexts):
        # your code goes here
        embedded_contexts = self.embeddings(contexts)

        timestep_representation, (final_hidden, final_cell) = self.LSTM(embedded_contexts)
        hidden = torch.cat((final_hidden[0, :, :], final_hidden[1, :, :]), dim=1)
        # timestep_representation = torch.stack([tp[i] for i, tp in zip(indices, timestep_representation)])

        predictions = self.classifier(hidden)

        return predictions


### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [24]:
train_dataset = WSDDataset(train_data)
# is the test dataset encoded using the same indices? 
test_dataset = WSDDataset(test_data)

vocab = train_dataset.vocab
labels = train_dataset.labels

In [25]:
import torch.optim as optim


params = {'lr':0.001, 'batch_size':128, 'hidden_size':128, 'embedding_dim':32, 'epochs':2}

model = WSDModel_approach2(train_dataset.word2idx, labels)
model.to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=params['lr'])

for epoch in range(1,params['epochs']+1):
    
    train_iter = dataloader(train_dataset, batch_size=params['batch_size'])
    test_iter = dataloader(test_dataset, batch_size=128)
    total_loss = 0
    for i, batch in enumerate(train_iter):
        
        if True: #i < 50: # For testing
            contexts = batch[1]
            indices = batch[0]
            # this causes an error when we run it on cuda, not on cpu!!!
            word_senses = torch.Tensor(batch[2]).long().to(device)
            
            # send your batch of sentences to the model

            # output = model(contexts, indices)
            output = model(contexts)
            
            loss = loss_function(output, word_senses)
            total_loss += loss.item()

            if i%10==0:
                print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
            # calculate gradients
            loss.backward()
            # update model weights
            optimizer.step()
            # reset gradients
            optimizer.zero_grad()
        

    print(f'Epoch {epoch} : Average Training Loss = {round(total_loss/(i+1),5)}')#, end='\r')
    test = next(iter(test_iter))
    with torch.no_grad():
        correct = 0
        for batch in test_iter:
            # test_output = model(batch[1], batch[0])
            test_output = model(batch[1])
            test_output = torch.argmax(test_output, dim=1)
            targets = torch.tensor(batch[2], device=device)
            correct += torch.sum(test_output == targets)
        test_accu = correct/len(test_dataset)
    print(f'Epoch {epoch} : Test Accuracy = {round(float(test_accu), 5)}')

 Batch 0 : Average Loss = 5.40113
 Batch 10 : Average Loss = 5.36726
 Batch 20 : Average Loss = 5.25468
 Batch 30 : Average Loss = 5.18754
 Batch 40 : Average Loss = 5.1327
 Batch 50 : Average Loss = 5.10392
 Batch 60 : Average Loss = 5.07561
 Batch 70 : Average Loss = 5.06603
 Batch 80 : Average Loss = 5.061
 Batch 90 : Average Loss = 5.0512
 Batch 100 : Average Loss = 5.04151
 Batch 110 : Average Loss = 5.03208
 Batch 120 : Average Loss = 5.02628
 Batch 130 : Average Loss = 5.01858
 Batch 140 : Average Loss = 5.01475
 Batch 150 : Average Loss = 5.01063
 Batch 160 : Average Loss = 5.00838
 Batch 170 : Average Loss = 5.00775
 Batch 180 : Average Loss = 5.00612
 Batch 190 : Average Loss = 5.00291
 Batch 200 : Average Loss = 5.00007
 Batch 210 : Average Loss = 5.0
 Batch 220 : Average Loss = 4.99772
 Batch 230 : Average Loss = 4.99622
 Batch 240 : Average Loss = 4.99559
 Batch 250 : Average Loss = 4.99503
 Batch 260 : Average Loss = 4.99182
 Batch 270 : Average Loss = 4.99111
 Batch 280 

In [26]:
import torch.optim as optim


params = {'lr':0.001, 'batch_size':128, 'hidden_size':128, 'embedding_dim':32, 'epochs':2}

model = WSDModel_approach1(train_dataset.word2idx, labels)
model.to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=params['lr'])

for epoch in range(1,params['epochs']+1):
    
    train_iter = dataloader(train_dataset, batch_size=params['batch_size'])
    test_iter = dataloader(test_dataset, batch_size=128)
    total_loss = 0
    for i, batch in enumerate(train_iter):
        
        if True: #i < 50: # For testing
            contexts = batch[1]
            indices = batch[0]
            # this causes an error when we run it on cuda, not on cpu!!!
            word_senses = torch.Tensor(batch[2]).long().to(device)
            
            # send your batch of sentences to the model

            output = model(contexts, indices)
            # output = model(contexts)
            
            loss = loss_function(output, word_senses)
            total_loss += loss.item()

            if i%10==0:
                print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
            # calculate gradients
            loss.backward()
            # update model weights
            optimizer.step()
            # reset gradients
            optimizer.zero_grad()
        

    print(f'Epoch {epoch} : Average Training Loss = {round(total_loss/(i+1),5)}')#, end='\r')
    test = next(iter(test_iter))
    with torch.no_grad():
        correct = 0
        for batch in test_iter:
            test_output = model(batch[1], batch[0])
            # test_output = model(batch[1])
            test_output = torch.argmax(test_output, dim=1)
            targets = torch.tensor(batch[2], device=device)
            correct += torch.sum(test_output == targets)
        test_accu = correct/len(test_dataset)
    print(f'Epoch {epoch} : Test Accuracy = {round(float(test_accu), 5)}')

 Batch 0 : Average Loss = 5.4024
 Batch 10 : Average Loss = 5.33116
 Batch 20 : Average Loss = 5.22887
 Batch 30 : Average Loss = 5.08474
 Batch 40 : Average Loss = 4.94669
 Batch 50 : Average Loss = 4.81926
 Batch 60 : Average Loss = 4.67897
 Batch 70 : Average Loss = 4.52248
 Batch 80 : Average Loss = 4.36471
 Batch 90 : Average Loss = 4.21292
 Batch 100 : Average Loss = 4.07355
 Batch 110 : Average Loss = 3.93886
 Batch 120 : Average Loss = 3.81154
 Batch 130 : Average Loss = 3.69018
 Batch 140 : Average Loss = 3.57506
 Batch 150 : Average Loss = 3.47335
 Batch 160 : Average Loss = 3.37907
 Batch 170 : Average Loss = 3.29552
 Batch 180 : Average Loss = 3.2181
 Batch 190 : Average Loss = 3.14448
 Batch 200 : Average Loss = 3.08398
 Batch 210 : Average Loss = 3.0249
 Batch 220 : Average Loss = 2.96933
 Batch 230 : Average Loss = 2.91595
 Batch 240 : Average Loss = 2.86733
 Batch 250 : Average Loss = 2.82181
 Batch 260 : Average Loss = 2.77947
 Batch 270 : Average Loss = 2.73752
 Batch

# 2.2 Running a transformer for WSD

In this section of the lab you'll try out the transformer, specifically the BERT model. For this we'll use the huggingface library (https://huggingface.co/).

You can find the documentation for the BERT model here (https://huggingface.co/transformers/model_doc/bert.html) and a general usage guide here (https://huggingface.co/transformers/quickstart.html).

What we're going to do is *fine-tune* the BERT model, i.e. update the weights of a pre-trained model. That is, we have a model that is trained on language modeling, but now we apply it to word sense disambiguation with the word representations it learnt from language modeling.

We'll use the same data splits for training and testing as before, but this time you'll not use a torchtext dataloader. Rather now you create an iterator that collects N sentences (where N is the batch size) then use the BertTokenizer to transform the sentence into integers. For your dataloader, remember to:
* Shuffle the data in each batch
* Make sure you get a new iterator for each *epoch*
* Create a vocabulary of *sense-labels* so you can calculate accuracy 

We then pass this batch into the BERT model and train as before. The BERT model will encode the sentence, then we send this encoded sentence into a prediction layer (you can either the the sentence-representation from bert, or the ambiguous word) like before and collect sense predictions.

About the hyperparameters and training:
* For BERT, usually a lower learning rate works best, between 0.0001-0.000001.
* BERT takes alot of resources, running it on CPU will take ages, utilize the GPUs :)
* Since BERT takes alot of resources, use a small batch size (4-8)
* Computing the BERT representation, make sure you pass the mask

**[10 marks]**

In [10]:
from torch.utils.data import Dataset
import pandas as pd

In [11]:
import transformers

2022-05-23 20:26:01.244884: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0


#### Huggingface documentation used:
+ https://huggingface.co/docs/transformers/preprocessing  
+ https://huggingface.co/docs/transformers/training  

#### Tutorials used:  
+ The dataset, dataloader, and the implementation (including certain decisions made there) were heavily inspired by the following video:  https://youtu.be/vNKIg8rXK6w?t=678, from the timestamped section to around 35:40. 

In [12]:
# we start with making a list of possible classes (word senses), so that it is
# the same no matter if some of the classes are missing from, say, the test set.
# We tried doing this with a set of classes, but this yielded inconsistent results when re-running the notebook with the same
# data splits, so despite iterating through a list being more time consuming, we settled on this one.
data = pd.read_csv('wsd_test.csv', header=None)
data2 = pd.read_csv('wsd_train.csv', header=None)
        
classes = []
for i in range(0,len(data)):
    item = data.iloc[i]
    word_sense = str(item[0])
    if word_sense not in classes:
        classes.append(word_sense)
    else:
        continue
        
for i in range(0,len(data2)):
    item = data2.iloc[i]
    word_sense = str(item[0])
    if word_sense not in classes:
        classes.append(word_sense)
    else:
        continue

In [13]:
print(classes)

['lead%2:41:12::', 'time%1:11:00::', 'line%1:09:00::', 'find%2:40:02::', 'position%1:26:00::', 'see%2:39:02::', 'line%1:04:01::', 'critical%5:00:00:indispensable:00', 'serve%2:41:00::', 'common%5:00:00:shared:00', 'extend%2:30:02::', 'keep%2:42:00::', 'extend%2:30:01::', 'life%1:28:01::', 'serve%2:41:02::', 'professional%5:00:00:white-collar:00', 'national%3:00:00::', 'build%2:36:00::', 'national%5:00:00:public:00', 'professional%3:00:02::', 'see%2:31:00::', 'hold%2:31:10::', 'case%1:11:00::', 'build%2:30:02::', 'line%1:15:02::', 'security%1:14:00::', 'serve%2:42:03::', 'time%1:28:05::', 'active%3:00:07::', 'positive%5:00:00:plus:00', 'force%1:04:01::', 'keep%2:41:03::', 'point%1:10:01::', 'build%2:36:09::', 'place%1:15:04::', 'bring%2:36:01::', 'find%2:31:10::', 'force%1:14:02::', 'serve%2:33:00::', 'physical%5:00:02:material:01', 'national%3:01:01::', 'active%3:00:03::', 'build%2:31:03::', 'place%1:04:01::', 'see%2:39:00::', 'lead%2:38:01::', 'follow%2:41:00::', 'position%1:04:01::',

In [14]:
class BERT_Dataset(Dataset):
    def __init__(self, data_path, tokenizer, attributes, max_token_len=128):
        self.data_path = data_path
        self.tokenizer = tokenizer
        self.attributes = list(attributes)
        self.max_token_len = max_token_len
        self.prepare_data()
        
    def prepare_data(self):
        data = pd.read_csv(self.data_path, header=None)
        self.data = data
        
    def __len__(self):
        return(len(self.data))
    
    def __getitem__(self, index):
        item = self.data.iloc[index]
        word_sense = str(item[0])
        ### the following are unnecessary
        #lemma = str(item[1]) 
        #position = int(item[2])
        context = str(item[3])
        attributes = torch.FloatTensor([1 if x == word_sense else 0 for x in self.attributes])
        tokens = self.tokenizer.encode_plus(context, 
                                            add_special_tokens=True, 
                                            return_tensors='pt',
                                            truncation=True, 
                                            max_length=self.max_token_len, 
                                            padding='max_length', 
                                            return_attention_mask=True)
        
        return {
            'input_ids': tokens.input_ids.flatten().to(device), 
            'attention_mask': tokens.attention_mask.flatten().to(device), 
            'labels': attributes.to(device)
        }
        

In [15]:
from transformers import BertTokenizerFast
model_name = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(model_name)

In [16]:
from torch.utils.data import DataLoader

In [17]:
def dataloader_for_bert(path_to_file, tokenizer, classes, batch_size=4, max_token_len=128, epochs=3):
    dataset = BERT_Dataset(path_to_file, tokenizer, classes, max_token_len=max_token_len)
    loaders = []
    for i in range(0, epochs):
        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
        loaders.append(loader)
    
    if len(loaders) == 1:
        return loaders[0]
    else:
        return loaders

In [33]:
from transformers import BertModel

learning_rate = 0.000001
batch_size = 4
epochs = 3

In [34]:
class BERT_WSD(nn.Module):
    def __init__(self, model_name, classes):
        super().__init__()
        # your code goes here
        self.bert = BertModel.from_pretrained(model_name, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size, len(classes))

    def forward(self, batch):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels'] 
        #bert
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        #getting a sentence representation
        pooled_output = torch.mean(output.last_hidden_state, 1)  # so we get 1 representation for every sentence
        #classification
        predictions = self.classifier(pooled_output)
        
        return predictions

In [35]:
# IMPORTANT: the model that we fine-tuned is saved as trained_BERT.model, see below. Do not re-run this
# code if you need to access anything from the model. Instead, move below to where it is loaded before
# evaluation and run it from there.

loss_fn = nn.BCEWithLogitsLoss()
model = BERT_WSD(model_name, classes)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loaders = dataloader_for_bert('wsd_train.csv', tokenizer, classes, batch_size=batch_size, max_token_len=128, epochs=epochs)

for epoch in range(0, epochs):
    loader = loaders[epoch]
    total_loss = 0
    for i, batch in enumerate(loader):
        labels = batch['labels'] 
        output = model(batch)
        
        loss = loss_fn(output, labels)
        total_loss += loss.item()
        print_every=100
        if i%print_every==0:
            print(f'Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
        
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

    print(f'Epoch {epoch} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
    
# test model after all epochs are completed

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batch 0 : Average Loss = 0.69477
Batch 100 : Average Loss = 0.67001
Batch 200 : Average Loss = 0.62396
Batch 300 : Average Loss = 0.57158
Batch 400 : Average Loss = 0.53222
Batch 500 : Average Loss = 0.50294
Batch 600 : Average Loss = 0.48033
Batch 700 : Average Loss = 0.46211
Batch 800 : Average Loss = 0.44687
Batch 900 : Average Loss = 0.43371
Batch 1000 : Average Loss = 0.42208
Batch 1100 : Average Loss = 0.4116
Batch 1200 : Average Loss = 0.40203
Batch 1300 : Average Loss = 0.39317
Batch 1400 : Average Loss = 0.38489
Batch 1500 : Average Loss = 0.37712
Batch 1600 : Average Loss = 0.36976
Batch 1700 : Average Loss = 0.36278
Batch 1800 : Average Loss = 0.35612
Batch 1900 : Average Loss = 0.34974
Batch 2000 : Average Loss = 0.34362
Batch 2100 : Average Loss = 0.33773
Batch 2200 : Average Loss = 0.33206
Batch 2300 : Average Loss = 0.32658
Batch 2400 : Average Loss = 0.32128
Batch 2500 : Average Loss = 0.31616
Batch 2600 : Average Loss = 0.3112
Batch 2700 : Average Loss = 0.30638
Batch 

In [36]:
# This saves the fine-tuned model for later reusing; please avoid fully re-running the code above
torch.save(model, 'trained_BERT.model')

In [37]:
bert_loaded = torch.load('trained_BERT.model')

In [38]:
def find_wordsense(labels, classes):
    for i in range(0, len(classes)):
        if labels[i] == 1:
            return classes[i]

In [39]:
bert_loaded.eval()
test_loader = dataloader_for_bert('wsd_test.csv', tokenizer, classes, batch_size=1, max_token_len=128, epochs=1)
loss_fn = nn.BCEWithLogitsLoss()
total_acc = 0
total_samples = 0
test_loss = 0
per_wf_acc = {}

for ws in classes:
    wf = get_wf(ws)
    if wf not in per_wf_acc:
        per_wf_acc[wf] = {'sum': 0, 'count': 0}
    else:
        continue

for i, batch in enumerate(test_loader):
    labels = batch['labels'][0]
    ws = find_wordsense(labels, classes)
    wf = get_wf(ws)
    
    with torch.no_grad():
        predictions = bert_loaded(batch)[0]
     
    # calculating the loss 
    batch_loss = loss_fn(predictions, labels)
    test_loss += batch_loss.item()

    # one-hot encoding predictions
    top_pred = torch.max(predictions)
    one_hot_preds = []
    for prediction in predictions:
        if prediction == top_pred:
            one_hot_preds.append(1)
        else:
            one_hot_preds.append(0)
            
    # doing accuracy        
    if one_hot_preds == list(labels):
        total_acc += 1
        total_samples += 1
        per_wf_acc[wf]['sum'] += 1
        per_wf_acc[wf]['count'] +=1
    else:
        total_acc += 0
        total_samples += 1
        per_wf_acc[wf]['sum'] += 0
        per_wf_acc[wf]['count'] +=1
    
    
    print_every=1000
    if i%print_every==0:
        print(f'At batch {i} : Average Accuracy = {total_acc/total_samples}, Average Loss = {np.round(test_loss/(i+1), 4)}')#, end='\r')

per_wf_acc_calc = {}
for k,v in per_wf_acc.items():
    summed = v['sum']
    count = v['count']
    accuracy = summed / count
    per_wf_acc_calc[k] = accuracy
        
full_accuracy = total_acc / total_samples
    

print(f'Test set accuracy = {full_accuracy}')
print(f'Test set loss = {np.round(test_loss/(i+1), 4)}')
print('Per word form accuracy:')
for k,v in per_wf_acc_calc.items():
    print(f'\tThe accuracy for the word form "{k}" = {v}')

At batch 0 : Average Accuracy = 0.0, Average Loss = 0.021
At batch 1000 : Average Accuracy = 0.5404595404595405, Average Loss = 0.016
At batch 2000 : Average Accuracy = 0.5402298850574713, Average Loss = 0.0159
At batch 3000 : Average Accuracy = 0.5438187270909697, Average Loss = 0.0158
At batch 4000 : Average Accuracy = 0.5476130967258186, Average Loss = 0.0158
At batch 5000 : Average Accuracy = 0.5512897420515896, Average Loss = 0.0158
At batch 6000 : Average Accuracy = 0.5539076820529911, Average Loss = 0.0157
At batch 7000 : Average Accuracy = 0.5513498071704043, Average Loss = 0.0158
At batch 8000 : Average Accuracy = 0.5511811023622047, Average Loss = 0.0158
At batch 9000 : Average Accuracy = 0.5504943895122764, Average Loss = 0.0158
At batch 10000 : Average Accuracy = 0.5508449155084492, Average Loss = 0.0158
At batch 11000 : Average Accuracy = 0.5514044177802018, Average Loss = 0.0158
At batch 12000 : Average Accuracy = 0.5535372052328973, Average Loss = 0.0157
At batch 13000 :

# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[4 marks]**

LSTM - predicting for the word index: we look at the context, both sides, we predict a word at the given index; it's predicting the best fitting word if word senses were separate words. This is a bit like BOW modelling, but we have bidirectional context.
LSTM - final hidden state: we have representations of the whole sentence, from both directions, and we use that to try to predict the right word sense. It's a bit more like a classification task. 
Extra:
BERT - sentence representation (pooled word representations): similar to the final hidden state one, we have a representation of the sentence based on the words in the sentence, we try to predict a "class" based on that.

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how do the models compare to each other? 

Expand on the evaluation by sorting the word-forms by the number of senses they have. Are word-forms with fewer senses easier to predict? Give a short explanation of the results you get based on the number of senses per word.

**[6 marks]**

### BERT  
As seen in the code and results below, the trend is not 100% consistent, but it seems like the more senses a given word has, the better the model is at predicting them correctly, which I found to be surprising. It is the words with 10 or 11 forms that have the highest performance (70-80%), although this is by no means fully consistent across the board. I would have expected the words with fewer senses to be better recognized, as there are fewer senses to choose from, ergo, even when randomly selected, there is a higher chance of being correct. However, we always predict from all 222 classes, so that probably influences it. In addition, the words with more senses may be better featured in the data on which the model is fine-tuned, so it is just better at recognizing them.

In [40]:
words = {}
for cls in classes:
    form = get_wf(cls)
    if form not in words:
        words[form] = 1
    else:  # if it is
        words[form] += 1

sorted_words = dict(sorted(words.items(), key=lambda item: item[1]))  # now we have the word forms sorted by how many senses they have

for wf, nr_senses in sorted_words.items():
    print(f'The word form "{wf}", with {nr_senses} word senses had the accuracy of {per_wf_acc_calc[wf]}')

The word form "common", with 4 word senses had the accuracy of 0.4556213017751479
The word form "major", with 4 word senses had the accuracy of 0.42038216560509556
The word form "bad", with 4 word senses had the accuracy of 0.5457142857142857
The word form "time", with 5 word senses had the accuracy of 0.3688888888888889
The word form "critical", with 5 word senses had the accuracy of 0.5258064516129032
The word form "professional", with 5 word senses had the accuracy of 0.5242494226327945
The word form "active", with 5 word senses had the accuracy of 0.5905797101449275
The word form "positive", with 5 word senses had the accuracy of 0.3147410358565737
The word form "order", with 5 word senses had the accuracy of 0.6380697050938338
The word form "position", with 6 word senses had the accuracy of 0.5786026200873362
The word form "national", with 6 word senses had the accuracy of 0.3348519362186788
The word form "physical", with 6 word senses had the accuracy of 0.5888594164456233
The wo

How does the LSTMs perform in comparison to BERT? What's the difference between representations obtained by the LSTMs and BERT? **[2 marks]**

LSTMs are worse than BERT. 
For LSTMs we do word embeddings from scratch so they are worse than BERT ones, where we fine-tune the existing ones. 
(rephrase the first answer)

What could we do to improve our LSTM word sense disambiguation models and our BERT model? **[4 marks]**

LSTMs:
+ experimenting more with hyperparameters
+ combining the two approaches (use both the index prediction and the final hidden state to predict the word sense)

BERT:
+ we could test using the initial sentence representation (on the first token) and see if that gives better results

Both:
+ dropout
+ for now the models choose, regardless of the word we are predicting the sense of, a class/sense out of 222 available ones. This means that "keep" can get not only 'keep%2:41:03::' but also 'force%1:18:00::' predicted. Perhaps it would work better if the classifier could only pick from the word senses applicable for the given word. For now just choosing blindly gives us a 1 in 222 chance of being correct. If we narrow it down, we get a, say, 1 in 4 chance. That is a rather big change.

# Readings:

[1] Kågebäck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf