#Sense Embeddings for Word Sense Disambiguation


In [0]:
# Install necessary libraries at the outset
!pip install -q gensim
!pip install -q conllu
!pip install torch==1.2

Collecting torch==1.2
[?25l  Downloading https://files.pythonhosted.org/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
[K     |████████████████████████████████| 748.9MB 16kB/s 
Installing collected packages: torch
  Found existing installation: torch 1.1.0
    Uninstalling torch-1.1.0:
      Successfully uninstalled torch-1.1.0
Successfully installed torch-1.2.0


In [0]:
import collections
import copy
import gensim
import numpy
import os
import random
import torch

import torch.nn as nn

from conllu import parse
from sklearn.metrics.pairwise import cosine_similarity
from torch.utils.data import Dataset

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
embeddings_path = "/content/drive/My Drive/Session-2A/Embeddings/WordNet-Wikipedia-embeddings.txt"
train_data_path = "/content/drive/My Drive/Session-2A/Data/semcor.tsv" 
test_data_path = "/content/drive/My Drive/Session-2A/Data/senseval2.tsv" 
lexicon_path = "/content/drive/My Drive/Session-2A/Data/wn30.txt" 
f_sensekey2synset = "/content/drive/My Drive/Session-2A/Data/sensekey2synset.pkl" 


# Simple mapping between POS tags
POS_MAP = {"NOUN": "n", "VERB": "v", "ADJ": "a", "ADV": "r"}
CUSTOM_FIELDS = ('form', 'lemma', 'pos', 'synsets')

## Load the embeddings
We want to load the pretrained embeddings for synsets and lemmas. This includes getting the vectors, as well as dictionaries mapping integer indices to the string representation of words, and 
vice versa. We will use the gensim library to load the pretrained model.


In [0]:
def load_embeddings(embeddings_path):
  # Load with gensim
  # Input file can be in binary ot text format
  model = gensim.models.KeyedVectors.load_word2vec_format(embeddings_path, 
                                                          binary=False, 
                                                          datatype=numpy.float32)
  embeddings = model.vectors
  zeros = numpy.zeros(len(embeddings[0]), dtype=numpy.float32)
  # Gensim provides a dict mapping ints to words (strings)
  id2src = model.index2word
  # We want to be able to go from words to ints as well
  src2id = {v:(k+1) for k, v in enumerate(id2src)}
  # Insert a zero vector for the padding symbol
  src2id["<PAD>"] = 0
  embeddings = numpy.insert(embeddings, 0, copy.copy(zeros), axis=0)
  # Make sure we have a vector for unknown inputs
  if "UNK" not in src2id:
      if "unk" in src2id:
          src2id["<UNK>"] = src2id["unk"]
          id2src[src2id["<UNK>"]] = "<UNK>"
          del src2id["unk"]
      else:
          unk = numpy.zeros(len(embeddings[0]), dtype=numpy.float32)
          src2id["<UNK>"] = len(src2id)
          embeddings = numpy.concatenate((embeddings, [unk]))
  return embeddings, src2id, id2src
  
embeddings, src2id, id2src = load_embeddings(embeddings_path)
embeddings = torch.Tensor(embeddings)
embeddings_dim = embeddings.shape[1]

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


## Constructing a dictionary of the lexicon
In order to do WSD with a predefined set of word senses, we first need to
acquire such suitable representation of the lexicon. We get ours from 
the WordNet (WN) resource (http://wordnetweb.princeton.edu/perl/webwn). We will
read a WN dictionary file and extract a mapping between the lemmas in the 
dictionary and all possible word senses (synsets) per lemma. Note that the
function below allows us to include POS information per lemmas, i.e. to
differentiate between instances of the same string when those are used as 
different POS categories. E.g. "sleep" as verb will match only verbal synsets
in the dictionary, and "sleep" as a noun will match nominal synsets.

*Sleep-n:*

* sleep, slumber (a natural and periodic state of rest during which consciousness of the world is suspended) "he didn't get enough sleep last night"; "calm as a child in dreamless slumber"
* sleep, sopor (a torpid state resembling deep sleep)
* sleep, nap (a period of time spent sleeping) "he felt better after a little sleep"; "there wasn't time for a nap"
* rest, eternal rest, sleep, eternal sleep, quietus (euphemisms for death (based on an analogy between lying in a bed and in a tomb)) "she was laid to rest beside her husband"; "they had to put their family pet to sleep"

*Sleep-v:*

* sleep, kip, slumber, log Z's, catch some Z's (be asleep)
* sleep (be able to accommodate for sleeping) "This tent sleeps six people"


Whether we want to use this option depends on whether we have available POS information 
from a previous processing step. In our case we do (encoded in the data sets),
so we will use it.

In [0]:
def get_wordnet_lexicon(lexicon_path, pos_filter=False):
  lemma2synsets = {}
  with open(lexicon_path, "r") as lexicon:
    # Example line from the dictionary file:
    # language 06282651-n:48 07109196-n:5 07051975-n:2 05808557-n:1 05650820-n:1 06304059-n:0  
    for line in lexicon.readlines():
      fields = line.split(" ")
      lemma_base, synsets = fields[0], fields[1:]
      for i, entry in enumerate(synsets):
        synset = entry[:10].strip()
        if pos_filter:
          pos = synset[-1]
          lemma = lemma_base + "-" + pos
        else:
          lemma = lemma_base
        if lemma not in lemma2synsets:
          lemma2synsets[lemma] = [synset]
        else:
          lemma2synsets[lemma].append(synset)
  lemma2synsets = collections.OrderedDict(sorted(lemma2synsets.items()))
  return lemma2synsets
  
lemma2synsets = get_wordnet_lexicon(lexicon_path, pos_filter=True)

In [0]:
### TO DO 1 ###
### Find the synsets IDs matching the lemma "bear" ###
### Get a dictionary that discriminates between "bear" as verb, and as noun ###

## Dataset formats
WSD data comes in various formats. One of the most popular formats lately has been that provided by the Universal Evaluation Framework (UEF): http://lcl.uniroma1.it/wsdeval/. The training and test datasets are available in UEF format in the Drive folder, in case you want to play with them, but today we are going to use a simpler format with tab-separated columns holding the following features per word:

* wordform    
* lemma    
* POS    
* synset-ID

The .tsv files are already in the folder, but the transformation function is included here. You can examine and use it to transform the rest of the UEF files, if you want to play with them later on (you can download those from the UEF website). Note that you will need to provide the filepath to the sensekey2synset dictionary -- this is just a mapping between different IDs in WordNet.

In [0]:
def transform_uef2tsv(path_to_data, path_to_keys, output_path):
  sensekey2synset = pickle.load(open(f_sensekey2synset, "rb"))
  data, keys = "", ""
  codes2keys = {}
  f_codes2keys = open(path_to_keys, "r")
  for line in f_codes2keys.readlines():
    fields = line.strip().split()
    codes2keys[fields[0]] = fields[1:]
  corpora = ET.parse(path_to_data).getroot().findall("corpus")
  for corpus in corpora:
    texts = corpus.findall("text")
    sentence_str = []
    for text in texts:
      sentences = text.findall("sentence")
      for sentence in sentences:
        this_sent = ""
        elements = sentence.findall(".//")
        for element in elements:
          wordform = element.text
          lemma = element.get("lemma")
          pos = element.get("pos")
          if element.tag == "instance":
            synsets = [sensekey2synset[key] for key in codes2keys[element.get("id")]]
          else:
            synsets = ["_"]
          this_sent += "\t".join([wordform, lemma, pos, ",".join(synsets)]) + "\n"
        sentence_str.append(this_sent)
  dataset_str = "\n".join(sentence_str)
  with open(os.path.join(output_path, data.split(".")[0] + ".tsv"), "w") as f:
    f.write(dataset_str)
  return

## Creating Dataset objects
Now we need to read the data that we will be using for training and testing.
PyTorch has the class Dataset which allows for data to be sampled on the fly,
so that not all of the information needs to be kept in memory. We will extend 
the Dataset class to WSDataset, which simply parses the .tsv data, and then 
can supply samples from it which contain all the features needed by the 
neural network to train and classify senses. 

The helper function parse_tsv is used to read the .tsv file and the Sample
class simply stores the different kinds of information per sample from the data 
set.


In [0]:
class WSDataset(Dataset):

  def __init__(self, tsv_file, src2id, embeddings, embeddings_dim, lemma2synsets):
    # Our data has some pretty long sentences, so we will set a large max length
    # Alternatively, can throw them out or truncate them
    self.data = parse_tsv(open(tsv_file, "r").read(), max_length=300)
    self.src2id = src2id
    self.embeddings = embeddings
    self.embeddings_dim = embeddings_dim
    self.lemma2synsets = lemma2synsets

  def __len__(self):
      return len(self.data)

  def __getitem__(self, idx):
    ''' Prepare one sample sentence of the data '''
    # Get a sentence
    sample = self.data[idx]
    # Get an integer ID for each lemma in the sentence (<UNK> if unfamiliar)
    # Note that we are working with lemmas for the input, not the word forms
    inputs = [self.src2id[lemma] if lemma in self.src2id 
              else self.src2id["<UNK>"] for lemma in sample.lemmas]
    targets, neg_targets, mask = [], [], []
    for i, label in enumerate(sample.synsets):
      target, neg_target = torch.zeros(self.embeddings_dim), torch.zeros(self.embeddings_dim)
      if label == "_":
        mask.append(False)
      else:
        mask.append(True)
        lemma_pos = sample.lemmas[i] + "-" + POS_MAP[sample.pos[i]] # e.g. "bear-n"
        # Take care of cases of multiple labels, e.g. "01104026-a,00357790-a"
        these_synsets = label.split(",")
        for synset in these_synsets:
          if synset in self.src2id:
            synset_embedding = torch.Tensor(self.embeddings[self.src2id[synset]])
            target += synset_embedding
        target /= len(these_synsets)
        # Pick negative targets too
        # Copy the list of synsets, so that we don't change the dict
        neg_options = copy.copy(self.lemma2synsets[lemma_pos])
        for synset in these_synsets:
          # Get rid of the gold synsets
          neg_options.remove(synset)
        while True:
          # If no synsets remain in the list, pick any synset at random
          if len(neg_options) == 0:
            neg_synset = random.choice(list(self.src2id))
            break
          neg_synset = random.choice(neg_options)
          # Make sure the chosen synset has a matching embedding, else remove from list
          if neg_synset in self.src2id:
            break
          else:
            neg_options.remove(neg_synset)
        neg_target = torch.Tensor(self.embeddings[self.src2id[neg_synset]])
      targets.append(target)
      neg_targets.append(neg_target)
    data = {"lemmas": sample.lemmas,
            "length": sample.length,
            "pos": sample.pos,
            "synsets": sample.synsets,
            "inputs": torch.tensor(inputs, dtype=torch.long),
            "targets": torch.stack(targets).clone().detach(),
            "neg_targets": torch.stack(neg_targets).clone().detach(),
            "mask": torch.tensor(mask, dtype=torch.bool)}
    return data

class Sample():

  def __init__(self):
    ''' Simple structure to hold the sentence features '''
    self.length = 0
    self.forms = []
    self.lemmas = []
    self.pos = []
    self.synsets = []

def parse_tsv(f_dataset, max_length):
  sentences = parse(f_dataset, CUSTOM_FIELDS)
  data = []
  for sentence in sentences:
    sample = Sample()
    for token in sentence:
      sample.forms.append(token["form"])
      sample.lemmas.append(token["lemma"])
      sample.pos.append(token["pos"])
      sample.synsets.append(token["synsets"])
    sample.length = len(sample.forms)
    # Take care to pad all sequences to the same length
    sample.forms += (max_length - len(sample.forms)) * ["<PAD>"]
    sample.lemmas += (max_length - len(sample.lemmas)) * ["<PAD>"]
    sample.pos += (max_length - len(sample.pos)) * "_"
    sample.synsets += (max_length - len(sample.synsets)) * "_"
    data.append(sample)
  return data

In [0]:
### TO DO 2 ###

# Create datasets per the training and evaluation data 
trainset = WSDataset()
testset = WSDataset()
# Associate the datasets with dataloaders, specifying batch size and shuffling
batch_size = 128
trainloader = torch.utils.data.DataLoader()
testloader = torch.utils.data.DataLoader()

## Defining the WSD architecture

In [0]:
# Let's now define the WSD model itself
class WSDModel(nn.Module):

  def __init__(self, embeddings_dim, embedding_weights, hidden_dim, 
               hidden_layers, dropout):
      super(WSDModel, self).__init__()
      self.hidden_layers = hidden_layers
      self.hidden_dim = hidden_dim
      # We can also initialize new embeddings here (see Sequence Tagging session)
      self.word_embeddings = nn.Embedding.from_pretrained(embedding_weights, 
                                                          freeze=False)
      self.lstm = nn.LSTM(embeddings_dim, 
                          hidden_dim, 
                          hidden_layers, 
                          bidirectional=True, 
                          batch_first=True, 
                          dropout=dropout)
      ### TO DO 3 ###
      # We want output with the size of the lemma&synset embeddings
      self.output = 

  def forward(self, X, X_lengths, mask):
      X = self.word_embeddings(X) # shape is [batch_size,max_length,embeddings_dim]
      X = torch.nn.utils.rnn.pack_padded_sequence(X, 
                                                  X_lengths, 
                                                  batch_first=True, 
                                                  enforce_sorted=False)
      X, _ = self.lstm(X)
      # pad_packed_sequence cuts the sequences in the batch to the greatest sequence length
      X, _ = torch.nn.utils.rnn.pad_packed_sequence(X, batch_first=True) # shape is [batch_size, max_length_of_X, 2* hidden_layer]
      # Therefore, make sure mask has the same shape as X
      mask = mask[:, :X.shape[1]] # shape is [batch_size, max_length_of_X]
      # Make mask broadcastable to X
      mask = torch.reshape(mask, (mask.shape[0], mask.shape[1], 1))
      # Select only RNN outputs that correspond to synset-tagged words in the data
      X = torch.masked_select(X, mask)
      # masked_select flattens the tensor, but we need it as matrix 
      X = X.view(-1, 2 * self.hidden_dim) # shape is [num_labels, 2*hidden_dim]
      outputs = self.output(X)
      return outputs

## Specifying the model parameters

In [0]:
n_hidden = 1000  
hidden_layers = 1
dropout = 0.2
learning_rate = 0.2  
alpha = 0.85 
epochs = 100 
### TO DO 4 ###
model =  
loss_function = torch.nn.MSELoss()
# optimizer = torch.optim.Adam(model.parameters())
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

  "num_layers={}".format(dropout, num_layers))


## Define a method for choosing synset based on the RNN outputs and calculating accuracy
The RNN model that we have only returns vectors at the ouput step, it doesn't tell us which synset we should choose in context. But we have taken care to calculate output vectors that are the same size as our lemma and synset embeddings in the pretrained vector space model. Therefore we use a distance metric, such as cosine similarity to compare the output vector to the vectors for the synsets we want to choose from. To get those, we need to consult the lexicon dictionary. 

In [0]:
def calculate_accuracy(outputs, lemmas, pos, gold_synsets, lemma2synsets, embeddings, src2id, pos_filter=True):
  matches, total = 0, 0
  for i, output in enumerate(torch.unbind(outputs)):
    if pos_filter:
      lemma = lemmas[i] + "-" + POS_MAP[pos[i]]
    else:
      lemma = lemmas[i]
    possible_synsets = lemma2synsets[lemma]
    synset_choice, max_similarity = "", -100.0
    for synset in possible_synsets:
      synset_embedding = embeddings[src2id[synset]] if synset in src2id else embeddings[src2id["<UNK>"]]
      cos_sim = cosine_similarity(output.view(1, -1).detach().numpy(),
                                  synset_embedding.view(1, -1).detach().numpy())[0][0]
      if cos_sim > max_similarity:
        max_similarity = cos_sim
        synset_choice = synset
    # Recall that sometime we have more than 1 gold synset, separated by commas
    if synset_choice in gold_synsets[i].split(","):
      matches += 1
    total += 1
  return matches, total

## Constructing the training and development loops

In [23]:
# How often do we want to evaluate on the test data?
eval_at = 10
for epoch in range(epochs):
  print("***** Start of epoch " + str(epoch) + " *****")
  # Keep track how the training loss changes across all steps
  average_loss = 0.0
  for step, data in enumerate(trainloader):
    # Tell the model it is in training mode (so that e.g. dropout is applied)
    model.train()
    optimizer.zero_grad()
    outputs = model(data["inputs"], data["length"], data["mask"])
    # We want to use mask again in order to get only the targets and neg_targets for synsets
    mask = torch.reshape(data["mask"], (data["mask"].shape[0], data["mask"].shape[1], 1))
    targets = torch.masked_select(data["targets"], mask)
    targets = targets.view(-1, 300) # shape is [num_labels, embedding_dim]
    neg_targets = torch.masked_select(data["neg_targets"], mask) 
    neg_targets = neg_targets.view(-1, 300) # ditto
    # A straightforward loss function would be to just compute it on the outputs and gold label vectors:
    # loss = loss_function(outputs, targets)
    # But we can use the negative example as well to learn finer distinctions
    ### TO DO 5 ###
    loss = 
    loss.backward()
    average_loss += loss
    optimizer.step()
    if step % eval_at == 0:
      # Tell the model we are in eval mode
      model.eval()
      print("Step " + str(step))
      # In numpy we can use mask as an index into arrays
      # But be careful to give them the same dimensions
      lemmas = numpy.asarray(data['lemmas']).transpose()[data["mask"]]
      synsets = numpy.asarray(data['synsets']).transpose()[data["mask"]]
      pos = numpy.asarray(data['pos']).transpose()[data["mask"]]
      matches, total = calculate_accuracy(outputs, lemmas, pos, synsets, lemma2synsets,
                                          embeddings, src2id, pos_filter=True)
      train_accuracy = matches * 1.0 / total
      print("Training accuracy at step " + str(step) + ": " + str(train_accuracy))
      average_loss /= (eval_at if step != 0 else 1)
      print("Average loss (training) is: " + str(average_loss.detach().numpy()))
      average_loss = 0.0
      # Loop over the eval data
      matches_all, total_all = 0, 0
      for eval_data in testloader:
        outputs = model(eval_data["inputs"], eval_data["length"], eval_data["mask"])
        lemmas = numpy.asarray(eval_data['lemmas']).transpose()[eval_data["mask"]]
        synsets = numpy.asarray(eval_data['synsets']).transpose()[eval_data["mask"]]
        pos = numpy.asarray(eval_data['pos']).transpose()[eval_data["mask"]]
        matches, total = calculate_accuracy(outputs, lemmas, pos, synsets, lemma2synsets,
                                            embeddings, src2id, pos_filter=True)
        matches_all += matches
        total_all += total
      test_accuracy = matches_all * 1.0 / total_all
      print("Test accuracy at step " + str(step) + " is: " + str(test_accuracy))

***** Start of epoch 0 *****
Step 0
Training accuracy at step 0: 0.35844471445929527
Average loss (training) is: 0.1929859
Test accuracy at step 0 is: 0.40709903593339175
Step 10
Training accuracy at step 10: 0.44740177439797213
Average loss (training) is: 0.1919314
Test accuracy at step 10 is: 0.45135845749342685
Step 20
Training accuracy at step 20: 0.48353096179183136
Average loss (training) is: 0.19166212
Test accuracy at step 20 is: 0.48115687992988604
Step 30
Training accuracy at step 30: 0.493844049247606
Average loss (training) is: 0.19092858
Test accuracy at step 30 is: 0.49780893952673094
Step 40
Training accuracy at step 40: 0.48435374149659866
Average loss (training) is: 0.19016786
Test accuracy at step 40 is: 0.5144609991235758
Step 50
Training accuracy at step 50: 0.5289672544080605
Average loss (training) is: 0.18943569
Test accuracy at step 50 is: 0.5311130587204207
Step 60
Training accuracy at step 60: 0.5655172413793104
Average loss (training) is: 0.18892235
Test accu

KeyboardInterrupt: ignored

## Stuff to try out


* Tuning the hyperparameters
* Modifying the loss and changing the optimizer
* Adding attention layers
* Including additional tasks in a multi-task learning setup (e.g. WSD classifier, POS tagging, dependency parsing, semantic role labeler, etc.)
* Dynamically swapping some of the inputs with gold synsets
* Training on synthetic data (see http://ixa2.si.ehu.es/ukb/ for a software tool to generate artifical corpora)

## References:


* Camacho-Collados, José, Mohammad Taher Pilehvar, and Roberto Navigli. "Nasari: a novel approach to a semantically-aware representation of items." Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015.
* Chen, Xinxiong, Zhiyuan Liu, and Maosong Sun. "A unified model for word sense representation and disambiguation." Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014.
* Goikoetxea, Josu, Aitor Soroa, and Eneko Agirre. "Random walks and neural network language models on knowledge bases." Proceedings of the 2015 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies. 2015.
* Iacobacci, Ignacio, Mohammad Taher Pilehvar, and Roberto Navigli. "Sensembed: Learning sense embeddings for word and relational similarity." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2015.
* Johansson, Richard, and Luis Nieto Pina. "Embedding a semantic network in a word space." Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015.
* Mancini, Massimiliano, et al. "Embedding words and senses together via joint knowledge-enhanced training." arXiv preprint arXiv:1612.02703 (2016).
* Rothe, Sascha, and Hinrich Schütze. "Autoextend: Extending word embeddings to embeddings for synsets and lexemes." arXiv preprint arXiv:1507.01127 (2015).






