# Word Sense Disambiguation using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

A problem with static distributional vectors is the difficulty of distinguishing between different *word senses*. We will continue our exploration of word vectors by considering *trainable vectors* or *word embeddings* for Word Sense Disambiguation (WSD).

The goal of word sense disambiguation is to train a model to find the sense of a word (homonyms of a word-form). For example, the word "bank" can mean "sloping land" or "financial institution". 

(a) "I deposited my money in the **bank**" (financial institution)

(b) "I swam from the river **bank**" (sloping land)

In case a) and b) we can determine that the meaning of "bank" based on the *context*. To utilize context in a semantic model we use *contextualized word representations*. Previously we worked with *static word representations*, i.e. the representation does not depend on the context. To illustrate we can consider sentences (a) and (b), the word **bank** would have the same static representation in both sentences, which means that it becomes difficult for us to predict its sense. What we want is to create representations that depend on the context, i.e. *contextualized embeddings*. 

We will create contextualized embeddings with Recurrent Neural Networks. You can read more about recurrent neural netoworks [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Your overall task in this lab is to create a neural network model that can disambiguate the word sense of 30 different words. 

In [105]:
# first we import some packages that we need
import torch
import torch.nn as nn
import torchtext

# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:0')

batch_size = 8
learning_rate = 0.001
epochs = 3

In [107]:
#other packages that we are going to use
import csv
import random
import math

import numpy as np
import pandas as pd

# 1. Working with data

A central part of any machine learning system is the data we're working with. In this section we will split the data (the dataset is located here: ``wsd-data/wsd_data.txt``) into a training set and a test set. We will also create a baseline to compare our model against. Finally, we will use TorchText to transform our data (raw text) into a convenient format that our neural network can work with.

## Data

The dataset we will use contain different word sense for 30 different words. The data is organized as follows (values separated by tabs): 
- Column 1: word-sense
- Column 2: word-form
- Column 3: index of word
- Column 4: white-space tokenized context

### Splitting the data

Your first task is to seperate the data into a *training set* and a *test set*. The training set should contain 80% of the examples and the test set the remaining 20%. The examples for the test/training set should be selected **randomly**. Save each dataset into a .csv file for loading later. **[2 marks]**

#### Approach used for 2.2

In [3]:
def data_split(path_to_dataset):
    # your code goes here
    with open(path_to_dataset) as file:
        lines = file.readlines()
    
    clean_lines = []
    for line in lines:
        clean_line = line.strip().split("\t")
        clean_lines.append(clean_line)
        
    shuffled_lines = random.sample(clean_lines, len(clean_lines))
    
    # it will not be an exact 80/20 unless the set allows for it, this way the training set is marginally larger than 80%
    training_cutoff = math.ceil(len(clean_lines)*0.8)  
    
    training = shuffled_lines[:training_cutoff]
    testing = shuffled_lines[training_cutoff:]
        
    return training, testing

In [4]:
train, test = data_split('wsd_data.txt')

In [5]:
print(len(train))
print(len(test))

print(len(test)/(len(test)+len(train)))

60840
15209
0.1999894804665413


In [6]:
def data_save(train, test, train_file, test_file):
    
    with open(train_file, 'w', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerows(train)
        
    with open(test_file, 'w', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerows(test)

In [7]:
# I had to create the appropriate folder first on the server at least
data_save(train, test, 'wsd_data/wsd_train.csv', 'wsd_data/wsd_test.csv')

In [8]:
with open('wsd_test.csv') as f:
    reader = csv.reader(f)
    rows = []
    for row in reader:
        rows.append(row)
        
for row in rows:
    if len(row) != 4:
        print("oopsie, your file does not work")

#### Another approach, used for 2.1 

In [110]:
data_path = 'wsd_data.txt'
data = pd.read_csv(data_path, sep='\t', header=None)
data.columns = ['word-sense', 'word-form', 'index', 'context']
data.head()

Unnamed: 0,word-sense,word-form,index,context
0,keep%2:42:07::,keep.v,15,Action by the Committee In pursuance of its ma...
1,national%3:01:00::,national.a,25,A guard of honour stood in formation in honour...
2,build%2:31:03::,build.v,38,The principle that statistics should be timely...
3,place%1:04:00::,place.n,36,"Again , he appealed for additional support for..."
4,position%1:04:01::,position.n,76,"Also , the IAEA has the lowest number of women..."


In [122]:
def data_split_lstm(path_to_dataset, frac=0.8, seed=None):
    # Load data
    data = pd.read_csv(path_to_dataset, sep='\t', header=None)
    data.columns = ['word-sense', 'word-form', 'index', 'context']

    data = data.sample(frac=1) #Shuffle
    N = int(len(data)*frac)
    train_data = data.iloc[0:N]
    test_data = data.iloc[N:]
    return train_data, test_data

train_data_lstm, test_data_lstm = data_split_lstm(data_path)

### Creating a baseline

Your second task is to create a *baseline* for the task. A baseline is a "reality check" for a model, given a very simple heuristic/algorithmic/model solution to the problem, can our neural network perform better than this?
The baseline you are to create is the "most common sense" (MCS) baseline. For each word form, find the most commonly assigned sense to the word, and label a words with that sense. **[2 marks]**

E.g. In a fictional dataset, "bank" have two senses, "financial institution" which occur 5 times and "side of river" 3 times. Thus, all 8 occurences of bank is labeled "financial institution" and this yields an MCS accuracy of 5/8 = 62.5%. If a model obtain a higher score than this, we can conclude that the model *at least* is better than selecting the most frequent word sense.

#### Approach 1, with all the data pooled together

In [9]:
def mcs_baseline(data):
    # your code goes here
    # I assume we are to do it on all of the data.        
    sense_counts = {}
    for line in data:
        sense = line[0]
        form = line[1]
        if form not in sense_counts:
            sense_counts[form] = {}
        if sense not in sense_counts[form]:
            sense_counts[form][sense] = 1
        else:  # if they are in
            sense_counts[form][sense] += 1
            
    most_common_senses = {}
    for k,v in sense_counts.items():
        most_common_sense = max(v, key=v.get)
        most_common_senses[k] = most_common_sense
        
    accuracy = []
    for line in data:
        sense = line[0]
        form = line[1]
        if sense == most_common_senses[form]:
            accuracy.append(1)
        else:
            accuracy.append(0)
        
    rounded_acc = np.round(np.mean(accuracy), 3)
            
    return rounded_acc

In [10]:
full_data = train + test
accuracy = mcs_baseline(full_data)
print(accuracy)

0.32


#### Approach 2, with training data and test data split

In [112]:
def mcs_baseline_2(train_data, test_data):
    msc_mapping = {}
    for word_form, df in train_data.groupby('word-form'):
        msc_mapping[word_form] = df['word-sense'].value_counts().idxmax()

    #'missing' incase (unlikely) all observations of one word-form ends up in test_data.
    msc_test_results = test_data['word-form'].apply(lambda x: msc_mapping.get(x, 'missing'))

    #return msc_test_results
    accuracy = np.sum(msc_test_results == test_data['word-sense'])/len(test_data)
    return accuracy      

In [124]:
print(f'MSC Baseline Accuracy: {round(mcs_baseline_2(train_data_lstm, test_data_lstm)*100,2)}%')

MSC Baseline Accuracy: 31.34%


These two yield similar results, but there is a bit of a difference between them, which is why we wanted to show both.

### What happens if we remove the word from the word sense? 
Example: keep%2:42:07:: -> 2:42:07::

In [125]:
def remove_word(w):
    # keep%2:42:07:: -> 2:42:07::
    splt = w.split('%')
    if len(splt)!=2:
        raise Exception('Unknown word-sense format')
    return splt[-1]

print(len(np.unique(data['word-sense'])))
print(len(np.unique(data['word-sense'].apply(remove_word))))

222
135


In [126]:
data['word-sense-generalized'] = data['word-sense'].apply(remove_word)

Observation/Idea: From 222 unique instances to 135, i.e there is some overlap between the word-sense codes. The result implies that we can possibly improve generalization by removing the word from word-sense. As the word itself is not unknown and can be readded after prediction.

### Creating data iterators

To train a neural network, we first need to prepare the data. This involves converting words (and labels) to a number, and organizing the data into batches. We also want the ability to shuffle the examples such that they appear in a random order.  

To do all of this we will use the torchtext library (https://torchtext.readthedocs.io/en/latest/index.html). In addition to converting our data into numerical form and creating batches, it will generate a word and label vocabulary, and data iterators than can sort and shuffle the examples. 

Your task is to create a dataloader for the training and test set you created previously. So, how do we go about doing this?

1) First we create a ``Field`` for each of our columns. A field is a function which tokenize the input, keep a dictionary of word-to-numbers, and fix paddings. So, we need four fields, one for the word-sense, one for the position, one for the lemma and one for the context. 

2) After we have our fields, we need to process the data. For this we use the ``TabularDataset`` class. We pass the name and path of the training and test files we created previously, then we assign which field to use in each column. The result is that each column will be processed by the field indicated. So, the context column will be tokenized and processed by the context field and so on. 

3) After we have processed the dataset we need to build the vocabulary, for this we call the function ``build_vocab()`` on the different ``Fields`` with the output from ``TabularDataset`` as input. This looks at our dataset and creates the necessary vocabularies (word-to-number mappings). 

4) Finally, the last step. In the last step we load the data objects given by the ``TabularDataset`` and pass it to the ``BucketIterator`` class. This class will organize our examples into batches and shuffle them around (such that for each epoch the model observe the examples in a different order). When we are done with this we can let our function return the data iterators and vocabularies, then we are ready to train and test our model!

Implement the dataloader. [**2 marks**]

*hint: for TabularDataset and BucketIterator use the class function splits()* 

#### Approach 1, using torchtext, not used in 2.1

In [127]:
def dataloader_torchtext(path):
    # your code goes here
    field_sense = torchtext.data.Field(lower=True, batch_first=True)
    field_lemma = torchtext.data.Field(lower=True, batch_first=True)
    field_position = torchtext.data.Field(lower=True, batch_first=True)
    field_context = torchtext.data.Field(lower=True, batch_first=True)
    
    train_dataset, test_dataset = torchtext.data.TabularDataset.splits(path,
                                                   train = 'wsd_train.csv',
                                                   test = 'wsd_test.csv',
                                                   format = 'csv', 
                                                   fields = [('sense', field_sense),
                                                    ('lemma', field_lemma),
                                                    ('position', field_position), 
                                                    ('context', field_context)],)
    # adding the csv_reader_params as per your POS tutorial caused issues, so we decided not to include that in this notebook.
    
    field_sense.build_vocab(train_dataset)
    field_lemma.build_vocab(train_dataset)
    field_position.build_vocab(train_dataset)
    field_context.build_vocab(train_dataset)
    
    train_iterator, test_iterator = torchtext.data.BucketIterator.splits((train_dataset, test_dataset), 
                                                                         batch_size = batch_size,
                                                                         # should this be set to true? what key? 
                                                                         # it did not work without it
                                                                         sort=False,
                                                                         repeat=True, 
                                                                         shuffle=True, 
                                                                         device=device)
    
    return train_iterator, test_iterator, field_sense, field_lemma, field_position, field_context

In [128]:
train, test, senses, lemmas, positions, contexts = dataloader('wsd_data/')

In [129]:
print(len(senses.vocab))  # so I know it works :) 

224


#### Approach 2, using custom datasets and loaders (based on pytorch), used in part 2.1

In [130]:
from torch.utils.data import DataLoader, Dataset

In [131]:
# I implement a Dataset to keep track of vocab, word2idx, idx2word
# Dataset can also be used in DataLoader which gives batch loading, etc, for free.

class WSDDataset(Dataset):

    def __init__(self, data, unk_label='<unk>', pad_label='<pad>', generalize_word_sense=False):
        
        self.unk_idx, self.unk_label = 0, unk_label
        self.pad_idx, self.pad_label = 1, pad_label

        self.data = data.copy()
        self.data['context'] = self.data['context'].apply(self.tokenize)
        if generalize_word_sense:
            self.data['word-sense'] = self.data['word-sense'].apply(self.__remove_word)

        self.vocab = self.__unique_words()
        
        self.word2idx = dict()
        self.idx2word = dict()
        self.word2idx[self.unk_label] = self.unk_idx
        self.word2idx[self.pad_label] = self.pad_idx
        self.word2idx.update({word:idx+max(self.word2idx.values())+1 for idx, word in enumerate(self.vocab)})

        self.idx2word = {v:k for k,v in self.word2idx.items()}

        self.labels = list(np.unique(self.data['word-sense']))

    def __unique_words(self):
        all_words = []
        for s in self.data['context']:
            all_words += s
        return np.unique(all_words)
        
    def tokenize(self, string):
        # The tokenizer was given as a whitespace tokenizer
        return string.lower().split()

    # generalized word-sense tags
    def __remove_word(self, w):
        # keep%2:42:07:: -> 2:42:07::
        splt = w.split('%')
        if len(splt)!=2:
            raise Exception('Unknown word-sense format')
        return splt[-1]

    def __getitem__(self, idx):
        #x = self.data.iloc[0] #for test
        x = self.data.iloc[idx]
        out = (x['index'], x['context'], x['word-sense'])
        return out
        
    def __len__(self):
        return len(self.data)

In [132]:
from collections import namedtuple
from torch.nn.utils.rnn import pad_sequence 
#Q:Should target word be inlcuded in contexts or not?

word_sense_to_idx = {k:v for v,k in enumerate(sorted(np.unique(data['word-sense'])))}
idx_to_word_sense = {v:k for k,v in word_sense_to_idx.items()}

class Collate():
    def __init__(self, word_to_idx, pad_idx=1, unk_idx=0, word_sense_to_idx=word_sense_to_idx):
        self.pad_idx = pad_idx
        self.unk_idx = unk_idx
        self.word_to_idx = word_to_idx
        self.word_sense_to_idx = word_sense_to_idx
    def __call__(self, batch):
        batch = np.transpose(batch)
        indices = batch[0]
        word_senses = [self.word_sense_to_idx[ws] for ws in batch[2]]

        contexts = np.transpose(batch[1]) #batch first
        contexts = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in contexts]
        contexts = pad_sequence(contexts, batch_first=True, padding_value=self.pad_idx)

        return indices, contexts, word_senses


def dataloader(dataset, batch_size=32, shuffle=True):
    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=Collate(dataset.word2idx, dataset.pad_idx, dataset.unk_idx)
    )
    return loader


In [133]:
## -- testing dataset--
tt = train_data_lstm.iloc[0:4].copy()
display(tt)

tt_dset =  WSDDataset(tt)

print(tt_dset[0])

print(len(tt_dset))

print('uniq word sense', (tt_dset.labels))

print(len(tt_dset.word2idx))


Unnamed: 0,word-sense,word-form,index,context
59525,line%1:04:01::,line.n,54,Owing to insufficient information and the lack...
45474,point%1:10:01::,point.n,42,There is an increasing recognition of the role...
39271,line%1:04:01::,line.n,29,Paragraph Recommendation of the Board of Audit...
49420,active%3:00:01::,active.a,27,Even his neck seemed thicker and therefore sho...


(54, ['owing', 'to', 'insufficient', 'information', 'and', 'the', 'lack', 'of', 'technical', 'aerial', 'surveillance', 'equipment', ',', 'minurso', 'could', 'neither', 'confirm', 'nor', 'refute', 'these', 'allegations', '.', 'since', 'my', 'last', 'report', 'to', 'the', 'council', ',', 'minurso', 'has', 'reviewed', 'the', 'manner', 'in', 'which', 'restrictions', 'on', 'the', 'freedom', 'of', 'movement', 'of', 'minurso', 'military', 'observers', 'are', 'assessed', 'to', 'bring', 'it', 'more', 'in', 'line', 'with', 'the', 'exact', 'terms', 'of', 'military', 'agreement', 'no.', '1', '.', 'following', 'this', 'review', ',', 'in', 'october', '2007', ',', 'minurso', 'military', 'observers', 'began', 'recording', 'restrictions', 'on', 'their', 'freedom', 'to', 'visit', 'strong', 'points', 'and', 'units', 'as', 'freedom-of-movement', 'violations', '.'], 'line%1:04:01::')
4
uniq word sense ['active%3:00:01::', 'line%1:04:01::', 'point%1:10:01::']
155


In [134]:
tt2 = train_data_lstm.iloc[0:3].copy()
tt2 =  WSDDataset(tt2)

loader = dataloader(tt2, batch_size=3,shuffle=False)
for b in loader:
    #print()
    print(b)

## Note: I should probably clump together contexts of small, medium and large size to reduce padding. I remember Nikolai talking about this in a previous course.
# Its possible that BatchIterator does this automatically which would be an advantage of using it.

(array([54, 42, 29], dtype=object), tensor([[ 73, 115,  49,  48,  15, 109,  52,  70, 106,   9, 105,  37,   3,  58,
          32,  64,  30,  67,  86, 113,  13,   4, 101,  62,  53,  89, 115, 109,
          33,   3,  58,  43,  93, 109,  56,  46, 120,  90,  71, 109,  41,  70,
          60,  70,  58,  57,  68,  16,  18, 115,  25,  51,  59,  46,  54, 121,
         109,  38, 107,  70,  57,  11,  65,   5,   4,  40, 114,  92,   3,  46,
          69,   6,   3,  58,  57,  68,  23,  85,  90,  71, 110,  41, 115, 119,
         104,  76,  15, 116,  17,  42, 118,   4],
        [111,  50,  14,  47,  82,  70, 109,  97,  70, 109,  95, 115,  44, 122,
         109,  44,  15,  45,  96,  60,   4,  66,   3,  29, 102, 122, 109,  44,
          15,  45,  96,  60,  32,  15, 100,  22,  36,  61,  59,   3,   8,  34,
          75, 108, 114,  98,  91, 115,  99,   4,  28,  72, 109,  95, 115,  44,
           7, 109, 103,  80,   2,  31,   1,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1,  

  return array(a, dtype, copy=False, order=order)


In [136]:
print(b[2])
b[1].size()

[120, 161, 120]


torch.Size([3, 92])

# 2.1 Creating and running a Neural Network for WSD

In this section we will create and run a neural network to predict word senses based on *contextualized representations*.

### Model

We will use a bidirectional Long-Short-Term Memory (LSTM) network to create a representation for the sentences and a Linear classifier to predict the sense of each word.

When we initialize the model, we need a few things:

    1) An embedding layer: a dictionary from which we can obtain word embeddings
    2) A LSTM-module to obtain contextual representations
    3) A classifier that compute scores for each word-sense given *some* input


The general procedure is the following:

    1) For each word in the sentence, obtain word embeddings
    2) Run the embedded sentences through the RNN
    3) Select the appropriate hidden state
    4) Predict the word-sense 

**Suggestion for efficiency:**  *Use a low dimensionality (32) for word embeddings and the LSTM when developing and testing the code, then scale up when running the full training/tests*
    
Your tasks will be to create two different models (both follow the two outlines described above), described below:

In the first approach to WSD, you are to select the index of our target word (column 3 in the dataset) and predict the word sense. **[5 marks]**


In [137]:
class WSDModel_approach1(nn.Module): 
    def __init__(self, word2idx, wordsense2idx, embedding_dim=32, hidden_size=128,padding_idx=1):
        super().__init__()
        self.vocab_size = len(word2idx)
        self.output_dim = len(wordsense2idx)
        self.hidden_size = hidden_size
        
        # your code goes here
        self.embeddings = nn.Embedding(self.vocab_size, embedding_dim, padding_idx=padding_idx)
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=self.hidden_size, num_layers=1, bidirectional=True)
        self.classifier = nn.Linear(self.hidden_size*2, self.output_dim)
    
    def forward(self, contexts, indices):
        # your code goes here
        embedded_contexts = self.embeddings(contexts)

        timestep_representation, *_ = self.LSTM(embedded_contexts)

        timestep_representation = torch.stack([tp[i] for i, tp in zip(indices, timestep_representation)])
        
        predictions = self.classifier(timestep_representation)
        
        return predictions

In the second approach to WSD, you are to predict the word sense based on the final hidden state given by the RNN. **[5 marks]**

In [None]:
### not tested yet!

# inspired by: https: // www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/
class WSDModel_approach2(nn.Module):
    def __init__(self, num_words, num_labels, i_dim, o_dim):
        super().__init__()
        # matrix containing the word embeddings
        self.embedd = nn.Embedding(num_words, i_dim)
        self.lstm = nn.LSTM(i_dim, o_dim, batch_first=True, bidirectional=True)  # rnn module
        # predict which class an example belongs to
        self.classifier = nn.Linear(o_dim*2, num_labels)
        # your code goes here
    
    def forward(self, batch):
        embedded_batch = self.embeddings(batch.context)
        contextualized_representation, (final_hidden, final_cell) = self.lstm(embedded_batch)
        # final_hidden = [batch size, num layers * num directions,hid dim]
        # final_cell = [batch size, num layers * num directions,hid dim]
        # in the nn.LSTM documentation: For bidirectional LSTMs, forward and backward are directions 0 and 1 respectively,
        # so I chose to use them instead of what in the example in the web page (mainly because I felt insecure about them)
        hidden = torch.cat((final_hidden[0,:,:], final_hidden[1,:,:]), dim=1)
        predictions = self.classifier(hidden)
        
        return predictions
    
    # We can use the training and testing steps in POS model given by Adam to train this part
    # in case it doesn't work with what we already have now

### Training and testing the model

Now we are ready to train and test our model. What we need now is a loss function, an optimizer, and our data. 

- First, create the loss function and the optimizer.
- Next, we iterate over the number of epochs (i.e. how many times we let the model see our data). 
- For each epoch, iterate over the dataset (``train_iter``) to obtain batches. Use the batch as input to the model, and let the model output scores for the different word senses.
- For each model output, calculate the loss (and print the loss) on the output and update the model parameters.
- Reset the gradients and repeat.
- After all epochs are done, test your trained model on the test set (``test_iter``) and calculate the total and per-word-form accuracy of your model.

Implement the training and testing of the model **[4 marks]**

**Suggestion for efficiency:** *when developing your model, try training and testing the model on one or two batches (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [139]:
train_dataset = WSDDataset(train_data_lstm)
test_dataset = WSDDataset(test_data_lstm)

vocab = train_dataset.vocab
labels = train_dataset.labels

In [142]:
import torch.optim as optim


params = {'lr':0.001, 'batch_size':128, 'hidden_size':128, 'embedding_dim':32, 'epochs':2}

model = WSDModel_approach1(train_dataset.word2idx, labels)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=params['lr'])

for epoch in range(1,params['epochs']+1):
    
    train_iter = dataloader(train_dataset, batch_size=params['batch_size'])
    test_iter = dataloader(test_dataset, batch_size=128)
    total_loss = 0
    for i, batch in enumerate(train_iter):
        
        if True: #i < 50: # For testing
            contexts = batch[1]
            indices = batch[0]
            # this causes an error when we run it on cuda, not on cpu!!!
            word_senses = torch.Tensor(batch[2], device=device).long()
            
            # send your batch of sentences to the model

            output = model(contexts, indices)
            
            loss = loss_function(output, word_senses)
            total_loss += loss.item()

            if i%10==0:
                print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
            # calculate gradients
            loss.backward()
            # update model weights
            optimizer.step()
            # reset gradients
            optimizer.zero_grad()
        

    print(f'Epoch {epoch} : Average Training Loss = {round(total_loss/(i+1),5)}')#, end='\r')
    test = next(iter(test_iter))
    with torch.no_grad():
        correct = 0
        for batch in test_iter:
            test_output = model(batch[1], batch[0])
            test_output = torch.argmax(test_output, dim=1)
            targets = torch.tensor(batch[2], device=device)
            correct += torch.sum(test_output == targets)
        test_accu = correct/len(test_dataset)
    print(f'Epoch {epoch} : Test Accuracy = {round(float(test_accu), 5)}')

RuntimeError: legacy constructor expects device type: cpu but device type: cuda was passed

# 2.2 Running a transformer for WSD

In this section of the lab you'll try out the transformer, specifically the BERT model. For this we'll use the huggingface library (https://huggingface.co/).

You can find the documentation for the BERT model here (https://huggingface.co/transformers/model_doc/bert.html) and a general usage guide here (https://huggingface.co/transformers/quickstart.html).

What we're going to do is *fine-tune* the BERT model, i.e. update the weights of a pre-trained model. That is, we have a model that is trained on language modeling, but now we apply it to word sense disambiguation with the word representations it learnt from language modeling.

We'll use the same data splits for training and testing as before, but this time you'll not use a torchtext dataloader. Rather now you create an iterator that collects N sentences (where N is the batch size) then use the BertTokenizer to transform the sentence into integers. For your dataloader, remember to:
* Shuffle the data in each batch
* Make sure you get a new iterator for each *epoch*
* Create a vocabulary of *sense-labels* so you can calculate accuracy 

We then pass this batch into the BERT model and train as before. The BERT model will encode the sentence, then we send this encoded sentence into a prediction layer (you can either the the sentence-representation from bert, or the ambiguous word) like before and collect sense predictions.

About the hyperparameters and training:
* For BERT, usually a lower learning rate works best, between 0.0001-0.000001.
* BERT takes alot of resources, running it on CPU will take ages, utilize the GPUs :)
* Since BERT takes alot of resources, use a small batch size (4-8)
* Computing the BERT representation, make sure you pass the mask

**[10 marks]**

In [90]:
from torch.utils.data import Dataset
import pandas as pd

In [91]:
import transformers

#### Huggingface documentation used:
+ https://huggingface.co/docs/transformers/preprocessing  
+ https://huggingface.co/docs/transformers/training  

#### Tutorials used:  
+ The dataset, dataloader, and the implementation (including certain decisions made there) were heavily inspired by the following video:  https://youtu.be/vNKIg8rXK6w?t=678, from the timestamped section to around 35:40. 

In [92]:
# we start with making a list of possible classes (word senses), so that it is
# the same no matter if some of the classes are missing from, say, the test set.
data = pd.read_csv('wsd_data/wsd_test.csv', header=None)
data2 = pd.read_csv('wsd_data/wsd_train.csv', header=None)
        
classes_set = set()
for i in range(0,len(data)):
    item = data.iloc[i]
    word_sense = str(item[0])
    if word_sense not in classes_set:
        classes_set.add(word_sense)
    else:
        continue
        
for i in range(0,len(data2)):
    item = data2.iloc[i]
    word_sense = str(item[0])
    if word_sense not in classes_set:
        classes_set.add(word_sense)
    else:
        continue
        
classes = list(classes_set)

In [93]:
print(classes)

['national%5:00:00:domestic:00', 'common%5:00:00:shared:00', 'bad%3:00:00::', 'case%1:09:00::', 'build%2:36:00::', 'build%2:36:09::', 'security%1:04:00::', 'build%2:30:00::', 'serve%2:33:00::', 'follow%2:41:00::', 'critical%3:00:02::', 'force%1:04:01::', 'build%2:30:10::', 'point%1:15:00::', 'hold%2:31:10::', 'order%1:10:00::', 'serve%2:41:13::', 'line%1:04:00::', 'hold%2:31:01::', 'physical%5:00:00:natural:03', 'line%1:14:02::', 'order%1:26:00::', 'position%1:26:02::', 'force%1:07:00::', 'major%3:00:01::', 'lead%2:41:00::', 'place%1:15:04::', 'keep%2:40:09::', 'find%2:40:01::', 'hold%2:41:15::', 'see%2:31:00::', 'security%1:21:04::', 'point%1:09:02::', 'case%1:26:00::', 'find%2:32:01::', 'case%1:10:01::', 'keep%2:42:07::', 'see%2:39:02::', 'bring%2:35:04::', 'hold%2:36:00::', 'extend%2:40:05::', 'critical%5:00:00:indispensable:00', 'positive%5:00:00:formal:01', 'keep%2:35:10::', 'force%1:14:02::', 'place%1:10:02::', 'build%2:36:04::', 'extend%2:40:04::', 'follow%2:42:02::', 'find%2:40

In [94]:
class BERT_Dataset(Dataset):
    def __init__(self, data_path, tokenizer, attributes, max_token_len=128):
        self.data_path = data_path
        self.tokenizer = tokenizer
        self.attributes = list(attributes)
        self.max_token_len = max_token_len
        self.prepare_data()
        
    def prepare_data(self):
        data = pd.read_csv(self.data_path, header=None)
        self.data = data
        
    def __len__(self):
        return(len(self.data))
    
    def __getitem__(self, index):
        item = self.data.iloc[index]
        word_sense = str(item[0])
        ### the following are unnecessary
        #lemma = str(item[1]) 
        #position = int(item[2])
        context = str(item[3])
        attributes = torch.FloatTensor([1 if x == word_sense else 0 for x in self.attributes])
        tokens = self.tokenizer.encode_plus(context, 
                                            add_special_tokens=True, 
                                            return_tensors='pt',
                                            truncation=True, 
                                            max_length=self.max_token_len, 
                                            padding='max_length', 
                                            return_attention_mask=True)
        
        return {
            'input_ids': tokens.input_ids.flatten().to(device), 
            'attention_mask': tokens.attention_mask.flatten().to(device), 
            'labels': attributes.to(device)
        }
        

In [95]:
from transformers import BertTokenizerFast
model_name = 'bert-base-uncased'
tokenizer = BertTokenizerFast.from_pretrained(model_name)

In [96]:
from torch.utils.data import DataLoader

In [97]:
def dataloader_for_bert(path_to_file, tokenizer, classes, batch_size=4, max_token_len=128, epochs=3):
    dataset = BERT_Dataset(path_to_file, tokenizer, classes, max_token_len=max_token_len)
    loaders = []
    for i in range(0, epochs):
        loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
        loaders.append(loader)
    
    if len(loaders) == 1:
        return loaders[0]
    else:
        return loaders

In [98]:
from transformers import BertModel

learning_rate = 0.000001
batch_size = 4
epochs = 3

In [99]:
class BERT_WSD(nn.Module):
    def __init__(self, model_name, classes):
        super().__init__()
        # your code goes here
        self.bert = BertModel.from_pretrained(model_name, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size, len(classes))

    def forward(self, batch):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels'] 
        #bert
        output = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        #getting a sentence representation
        pooled_output = torch.mean(output.last_hidden_state, 1)
        #classification
        predictions = self.classifier(pooled_output)
        
        return predictions

In [100]:
# IMPORTANT: the model that we fine-tuned is saved as trained_BERT.model, see below. Do not re-run this
# code if you need to access anything from the model. Instead, move below to where it is loaded before
# evaluation and run it from there.

loss_fn = nn.BCEWithLogitsLoss()
model = BERT_WSD(model_name, classes)
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
loaders = dataloader_for_bert('wsd_data/wsd_train.csv', tokenizer, classes, batch_size=batch_size, max_token_len=128, epochs=epochs)

for epoch in range(0, epochs):
    loader = loaders[epoch]
    total_loss = 0
    for i, batch in enumerate(loader):
        labels = batch['labels'] 
        output = model(batch)
        
        loss = loss_fn(output, labels)
        total_loss += loss.item()
        print_every=100
        if i%print_every==0:
            print(f'Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
        
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

    print(f'Epoch {epoch} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
    
# test model after all epochs are completed

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batch 0 : Average Loss = 0.70404
Batch 100 : Average Loss = 0.67569
Batch 200 : Average Loss = 0.62259
Batch 300 : Average Loss = 0.56707
Batch 400 : Average Loss = 0.52717
Batch 500 : Average Loss = 0.49792
Batch 600 : Average Loss = 0.47547
Batch 700 : Average Loss = 0.45743
Batch 800 : Average Loss = 0.44239
Batch 900 : Average Loss = 0.4294
Batch 1000 : Average Loss = 0.41794
Batch 1100 : Average Loss = 0.40762
Batch 1200 : Average Loss = 0.39819
Batch 1300 : Average Loss = 0.38947
Batch 1400 : Average Loss = 0.38134
Batch 1500 : Average Loss = 0.37369
Batch 1600 : Average Loss = 0.36645
Batch 1700 : Average Loss = 0.35958
Batch 1800 : Average Loss = 0.35302
Batch 1900 : Average Loss = 0.34674
Batch 2000 : Average Loss = 0.34071
Batch 2100 : Average Loss = 0.3349
Batch 2200 : Average Loss = 0.32931
Batch 2300 : Average Loss = 0.32391
Batch 2400 : Average Loss = 0.31869
Batch 2500 : Average Loss = 0.31364
Batch 2600 : Average Loss = 0.30874
Batch 2700 : Average Loss = 0.30399
Batch 

In [101]:
# This saves the fine-tuned model for later reusing; please avoid fully re-running the code above
torch.save(model, 'trained_BERT.model')

In [102]:
bert_loaded = torch.load('trained_BERT.model')

In [103]:
def find_wordform(labels, classes):
    for i in range(0, len(classes)):
        if labels[i] == 1:
            return classes[i]

In [104]:
bert_loaded.eval()
test_loader = dataloader_for_bert('wsd_data/wsd_test.csv', tokenizer, classes, batch_size=1, max_token_len=128, epochs=1)
total_acc = 0
total_samples = 0
test_loss = 0
per_wf_acc = {}

for wf in classes:
    per_wf_acc[wf] = {'sum': 0, 'count': 0}

for i, batch in enumerate(test_loader):
    labels = batch['labels'][0]
    wf = find_wordform(labels, classes)
    
    with torch.no_grad():
        predictions = bert_loaded(batch)[0]
     
    # calculating the loss 
    batch_loss = loss_fn(predictions, labels)
    test_loss += batch_loss.item()

    # one-hot encoding predictions
    top_pred = torch.max(predictions)
    one_hot_preds = []
    for prediction in predictions:
        if prediction == top_pred:
            one_hot_preds.append(1)
        else:
            one_hot_preds.append(0)
            
    # doing accuracy        
    if one_hot_preds == list(labels):
        total_acc += 1
        total_samples += 1
        per_wf_acc[wf]['sum'] += 1
        per_wf_acc[wf]['count'] +=1
    else:
        total_acc += 0
        total_samples += 1
        per_wf_acc[wf]['sum'] += 0
        per_wf_acc[wf]['count'] +=1
    
    
    print_every=1000
    if i%print_every==0:
        print(f'At batch {i} : Average Accuracy = {total_acc/total_samples}, Average Loss = {np.round(test_loss/(i+1), 4)}')#, end='\r')

per_wf_acc_calc = {}
for k,v in per_wf_acc.items():
    summed = v['sum']
    count = v['count']
    accuracy = summed / count
    per_wf_acc_calc[k] = accuracy
        
full_accuracy = total_acc / total_samples
    

print(f'Test set accuracy = {full_accuracy}')
print(f'Test set loss = {np.round(test_loss/(i+1), 4)}')
print('Per word form accuracy:')
for k,v in per_wf_acc_calc.items():
    print(f'\tThe accuracy for the word form {k} = {v}"')

At batch 0 : Average Accuracy = 1.0, Average Loss = 0.0098
At batch 1000 : Average Accuracy = 0.5304695304695305, Average Loss = 0.016
At batch 2000 : Average Accuracy = 0.5227386306846576, Average Loss = 0.016
At batch 3000 : Average Accuracy = 0.5348217260913029, Average Loss = 0.0157
At batch 4000 : Average Accuracy = 0.5396150962259435, Average Loss = 0.0156
At batch 5000 : Average Accuracy = 0.5416916616676665, Average Loss = 0.0156
At batch 6000 : Average Accuracy = 0.544075987335444, Average Loss = 0.0156
At batch 7000 : Average Accuracy = 0.5437794600771318, Average Loss = 0.0156
At batch 8000 : Average Accuracy = 0.545681789776278, Average Loss = 0.0156
At batch 9000 : Average Accuracy = 0.5454949450061104, Average Loss = 0.0156
At batch 10000 : Average Accuracy = 0.5483451654834517, Average Loss = 0.0155
At batch 11000 : Average Accuracy = 0.548677392964276, Average Loss = 0.0155
At batch 12000 : Average Accuracy = 0.5467877676860262, Average Loss = 0.0155
At batch 13000 : Av

# 3. Evaluation

Explain the difference between the first and second approach. What kind of representations are the different approaches using to predict word-senses? **[4 marks]**

Evaluate your model with per-word-form *accuracy* and comment on the results you get, how does the model perform in comparison to the baseline, and how do the models compare to each other? 

Expand on the evaluation by sorting the word-forms by the number of senses they have. Are word-forms with fewer senses easier to predict? Give a short explanation of the results you get based on the number of senses per word.

**[6 marks]**

How does the LSTMs perform in comparison to BERT? What's the difference between representations obtained by the LSTMs and BERT? **[2 marks]**

What could we do to improve our LSTM word sense disambiguation models and our BERT model? **[4 marks]**

# Readings:

[1] Kågebäck, M., & Salomonsson, H. (2016). Word Sense Disambiguation using a Bidirectional LSTM. arXiv preprint arXiv:1606.03568.

[2] https://cl.lingfil.uu.se/~nivre/master/NLP-LexSem.pdf