# 4. Get sense vectors

The code in this notebook is aimed at creating reference vector representations for each sense in the dataset with WordNet senses.

## Importing libraries

In [4]:
import pandas as pd
from glob import glob
from tqdm.auto import tqdm
import json
import os
from pathlib import Path
import numpy as np

We use [anthevec](https://github.com/AntheSevenants/anthevec) to get vectors for whole words.

In [1]:
from anthevec.anthevec.embedding_retriever import EmbeddingRetriever

In [2]:
from constants import *

## Load senses

We will create representations for all corpus examples found for a specific sense.

In [5]:
sense_example_filenames = glob(f"{SENSE_EX}*.csv")

## Transformer stuff

In [5]:
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, RobertaModel, RobertaConfig

MODEL_NAME = "pdelobelle/robbert-v2-dutch-base"

tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_NAME)
config = RobertaConfig.from_pretrained(MODEL_NAME, output_hidden_states=True)
model = RobertaForSequenceClassification.from_pretrained(MODEL_NAME, config=config)

Some weights of the model checkpoint at pdelobelle/robbert-v2-dutch-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at pdelobelle/robbert-v2-dutch-base and are newly initialized: ['classifier.dense.weight', 'class

In [6]:
import spacy
# We need to load spacy, since the BERT tokeniser will split into word pieces,
# and not individual words. We're interested in words, so we get an "alternative"
# tokeniser, which we'll map the word pieces on.
nlp = spacy.load("nl_core_news_sm")

## Creating representations

Now, we find all senses in our dataset and create vectors for each corpus example of each sense.

In [None]:
for sense_example_filename in tqdm(sense_example_filenames):
    sense_vectors = {} # will hold all vectors for this sense
    
    # Find all example sentences for this sense
    try:
        corpus_examples = pd.read_csv(sense_example_filename)
    except:
        # No file contents
        continue
        
    output_path = f"{SENSE_EX_VEC}/{corpus_examples.iloc[0]['sense']}.json"
    if os.path.exists(output_path):
        continue
    
    # For each example sentence, compute a vector
    for index, row in corpus_examples.iterrows():
        key = row["docid"] + "_" + row["xmlid"]
        
        try:
            embedding_retriever = EmbeddingRetriever(model, tokenizer, nlp, [ row["center_sentence"].split(" ") ])
            
            # Word indices of lassy tokens are unreliable
            # We find them ourselves
            if row["word_index"] is None:
                lemmas = list(map(lambda token: token.lemma_, embedding_retriever.tokens[sentence_index]))
                word_index = lemmas.index(self.lemma)
            else:
                word_index = row["word_index"] - 1

            # We create vectors for each layer separately...
            vectors = []
            for layer_index in list(range(1, 13)):
                vector = embedding_retriever.get_hidden_state(0, word_index, [ layer_index ])
                vector = list(vector.astype('float64'))
                vectors.append(vector)
                
            # ...and also append the layer average 
            vectors.append(list(embedding_retriever.get_hidden_state(0, word_index, list(range(1, 13))).astype('float64')))
                
            sense_vectors[key] = vectors
        # Sometimes there will be errors, because this methodology isn't perfect. But it won't matter in the end because we have enough representations.
        except Exception as e:
            print(e)
            print(word_index, row["xmlid"], row["center_sentence"])
            sense_vectors[key] = None
    
    with open(output_path, "wt") as writer:
        writer.write(json.dumps(sense_vectors))

Each sense gets its own JSON file. Inside which the first twelve elements represent layers 1 to 12, and the final element represents the average.