## WSD using BERT Masked Language Model
This notebook explores the a part of the idea proposed by Ajit Rakasekharan in his blog post 
[Examining BERT raw embeddings.](https://towardsdatascience.com/examining-berts-raw-embeddings-fd905cb22df7) 

The idea is that examining the predictions of a masked language model for a masked ambiguous word can yield insights into the semantic meaning of the ambiguous word.

We use the HuggingFace BERT for Masked LM with weights from a bert-base-cased pre-trained model for our experiment.

We mask the ambiguous word (here we have used bank for our test) in sentences, and then send them through a BERT MLM model. Output is an array of logits for each position of the input sequence. So assuming a sentence with T tokens and a vocabulary size of V, the predictions of the MLM is (1, T, V) where 1 is the batch size (1 input sentence at a time in our experiment).

In order to find the top k predictions, the logits for the masked position is softmaxed and the top k values chosen.



## Prepare your environment

As always, we highly recommend that you install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  

### Masked LM Model and Tokenizer 
[tutorial](https://huggingface.co/docs/transformers/tasks/language_modeling)  
Task is to predict words that are masked using BERT, so we will use BERTMaskedLM model and BERTTokenizer and use the pre-trained bert-base-uncased model.

In [1]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForMaskedLM

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BertForMaskedLM.from_pretrained('bert-base-cased', return_dict=True)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We are going to use the pre-trained BERT language model in inference mode only.

The tokenizer tokenizes the input sequence and pads it with the [CLS] and [SEP] tokens.

The output produced by the model has two components, loss and logits. The logits component has shape (1, number_of_tokens, vocab_size) where the leading 1 represents the single input sentence.

We will identify the logits corresponding to the position of our masked token, identify the top 5 vocabulary words predicted for that position, and return the softmax probabilities for each of the top 5 predicted words.

In [3]:
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

In [4]:
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])


['[CLS]', 'The', 'capital', 'of', 'France', 'is', '[MASK]', '.', '[SEP]']

In [5]:
outputs

MaskedLMOutput(loss=None, logits=tensor([[[ -7.1545,  -6.9931,  -7.1826,  ...,  -5.9124,  -5.6733,  -5.9854],
         [ -8.0190,  -8.1319,  -8.0509,  ...,  -6.5679,  -6.4058,  -6.8998],
         [ -4.9772,  -6.1781,  -6.0669,  ...,  -5.6362,  -4.6603,  -5.1241],
         ...,
         [ -3.4420,  -3.2557,  -3.5733,  ...,  -2.4606,  -2.6495,  -3.1952],
         [-10.5890, -10.4620, -11.7181,  ...,  -7.4646,  -9.9542,  -8.3927],
         [-14.8900, -14.8873, -14.4569,  ..., -11.6588, -13.0151, -11.6073]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [6]:
def get_mask_index(input_ids, tokenizer):
    x = input_ids[0]
    is_masked = torch.where(x == tokenizer.mask_token_id, x, 0)
    mask_idx = torch.nonzero(is_masked)
    return mask_idx.item()

mask_idx = get_mask_index(inputs.input_ids, tokenizer)
mask_idx

6

In [7]:
def get_top_k_predictions(pred_logits, mask_idx, top_k):
    probs = torch.nn.functional.softmax(pred_logits[0, mask_idx, :], dim=-1)
    top_k_weights, top_k_indices = torch.topk(probs, top_k, sorted=True)
    top_k_pct_weights = [100 * x.item() for x in top_k_weights]
    top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_indices)
    return list(zip(top_k_tokens, top_k_pct_weights))


get_top_k_predictions(outputs.logits, mask_idx, 5)

[('Paris', 44.46825087070465),
 ('Lyon', 9.396003931760788),
 ('Toulouse', 8.234518766403198),
 ('Lille', 7.515139877796173),
 ('Marseille', 5.692288279533386)]

### WSD Test Sentences
We take our pair of sentences for disambiguating the word bank and mask them, and extract the top 20 predictions from the pre-trained BERT MLM model.

As expected, the first set of predictions predominantly point to some sort of financial institution, whereas the second set of predictions predominantly point to some geographical formation around bodies of water.

In [8]:
sentences = [
  "Go to the [MASK] and deposit your pay check.",
  "Jim and Janet went down to the river [MASK] to admire the swans."
]

In [9]:
def get_predictions(sentence, tokenizer, model):
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    mask_idx = get_mask_index(inputs.input_ids, tokenizer)
    top_preds = get_top_k_predictions(outputs.logits, mask_idx, 20)
    return top_preds

In [10]:
get_predictions(sentences[0], tokenizer, model)

[('bank', 70.31391263008118),
 ('office', 10.280615836381912),
 ('register', 1.7452005296945572),
 ('store', 1.6284782439470291),
 ('bathroom', 0.9394792839884758),
 ('library', 0.8934846147894859),
 ('desk', 0.8724375627934933),
 ('counter', 0.7977348752319813),
 ('hotel', 0.5163734778761864),
 ('lobby', 0.49569723196327686),
 ('kitchen', 0.3637079382315278),
 ('garage', 0.34799312707036734),
 ('door', 0.34127470571547747),
 ('car', 0.3311377251520753),
 ('house', 0.2649053931236267),
 ('airport', 0.2547033363953233),
 ('elevator', 0.24911393411457539),
 ('back', 0.24807692971080542),
 ('computer', 0.24019642733037472),
 ('banks', 0.23491440806537867)]

In [11]:
get_predictions(sentences[1], tokenizer, model)

[('##bank', 32.602137327194214),
 ('below', 13.03199678659439),
 ('bank', 11.940894275903702),
 (',', 5.626494437456131),
 ('##boat', 3.1638894230127335),
 ('##front', 2.7332261204719543),
 ('basin', 1.621054857969284),
 ('##bed', 1.2178409844636917),
 ('together', 1.184169389307499),
 ('bed', 0.9657169692218304),
 ('again', 0.8369819261133671),
 ('deck', 0.8356173522770405),
 ('valley', 0.7271395064890385),
 ('mouth', 0.7227548863738775),
 ('boat', 0.7151047699153423),
 ('pier', 0.6493300199508667),
 ('house', 0.6301576271653175),
 ('banks', 0.5700556561350822),
 ('pool', 0.5345691461116076),
 ('Thames', 0.49955458380281925)]

## Assignment
In this week's assignment, you are tasked with processing SemCor data and feed the data into BERT masked-LM. After that, use the predictions to find the most likely sense of the target word using WordNet similarity.

### Data Preprocessing 
You can find a sample of SemCor dataset [here](https://drive.google.com/file/d/1inmv3rUcGrtiS4VQwTMsT9HF-iL8jc5V/view?usp=sharing) and load the data using the following methods.

In [12]:
import json
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
sents = []
tokens = []
wn_id = []
lemmatizer = WordNetLemmatizer()

with open('semcor.sample.jsonl') as f:
    for line in f:
        data = json.loads(line)
        sents.append(data['sent'])
        tokens.append(data['tokens'])
        wn_id.append(data['wnid'])


In [13]:
print(sents[10])
print(tokens[10])
print(wn_id[10])

implementation of georgia 's automobile title law was also recommended by the outgoing jury . 
['implementation', 'of', 'georgia', "'s", 'automobile', 'title', 'law', 'was', 'also', 'recommended', 'by', 'the', 'outgoing', 'jury', '.']
['implementation%1:04:01::', 0, 'georgia%1:15:00::', 0, 'automobile%1:06:00::', 'title%1:10:04::', 'law%1:10:00::', 0, 'also%4:02:00::', 'recommend%2:32:01::', 0, 0, 'outgoing%3:00:00::', 'jury%1:14:00::', 0]


In [14]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\love4\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\love4\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [54]:
# The WordNet ID can be converted to NLTK Lemma using the following function
wn.lemma_from_key('implementation%1:04:01::')

Lemma('execution.n.06.implementation')

### TODO 
Please implement a method to convert the data to BERT Masked-LM format and keep track of the headword. Store the data into the following lists

word[i] = 'implementation'  
ground_truth[i] = 'implementation%1:04:01::'  
sent[i] = "[MASK] of georgia 's automobile title law was also recommended by the outgoing jury ."  



In [129]:
word = []
ground_truth = []
sent = []
for i in range(len(sents)):
    for j in range(len(wn_id[i])):
        if (wn_id[i][j] != 0):
            word.append(tokens[i][j])
            ground_truth.append(wn_id[i][j])
            sent.append(sents[i].replace(tokens[i][j], '[MASK]', 1))

In [127]:
print(word[0])
print(ground_truth[0])
print(sent[0])

said
say%2:32:00::
the fulton_county_grand_jury [MASK] friday an investigation of atlanta 's recent primary_election produced " no evidence " that any irregularities took_place . 


#### Identify the top 5 predictions other than the headword using Masked-LM 
1. Use get_predictions to get the predicted words
2. Use lemmatizer to lemmatize the prediction
3. Remove headword
4. Keep top 5 unique predictions

In [136]:
import pandas as pd
candidate_lemmas = []
for i in range(len(word)):

    df = pd.DataFrame(get_predictions(sent[i], tokenizer, model), columns=['word', 'probability'])
    df = df.loc[df['word'] != word[i]].reset_index(drop=True)
    df = df.iloc[0:5]
    for w in df['word']:
        df['word', i] = lemmatizer.lemmatize(w)
    candidate_lemmas.append(df['word'].tolist())

In [20]:
# candidate_lemmas = [[lemmatizer.lemmatize(word) for word, probability in get_predictions(sentence, tokenizer, model)[1:6]] for sentence in sent]

example:  
candidate_lemmas = ['office', 'register', 'store', 'bathroom', 'library']


Identify the most similar sense of headword with relation to the 5 unique candidates

In [None]:
# No synset found for key 'recent%3:00:00:past:00'
# 用lemma 去找synset 有很多不存在
# 另外有些word 沒有synsets (8個)
# 有些candidate lemma 也沒有synsets (32個)

In [174]:
predicted_sense = []
for i in range(len(word)):
    if wn.synsets(word[i]):
        tmp = {}
        for lemma in candidate_lemmas[i]:
            if wn.synsets(lemma):
                tmp[lemma] = max([wn.synsets(word[i])[0].wup_similarity(synset) for synset in wn.synsets(lemma)])
                # tmp[lemma] = max([wn.lemma_from_key(key).synset().wup_similarity(synset) for synset in wn.synsets(lemma)])
        if tmp:
            predicted_sense.append(max(tmp, key=tmp.get))
        else:
            predicted_sense.append(candidate_lemmas[i][0])
    else :
        predicted_sense.append(candidate_lemmas[i][0])

For evaluation purpose, for i = 50, please run the process and print out the following:  
1. word[50]
2. ground_truth[50] (in synset or lemma)
3. sent[50]
4. candidate_lemmas
5. predicted_sense (in synset or lemma)    

Also, please print out the accuracy of the process over our dataset

In [176]:
print(word[50])
print(wn.lemma_from_key(ground_truth[50]))
print(sent[50])
print(candidate_lemmas[50])
print(predicted_sense[50])

size
Lemma('size.n.01.size')
" only a relative handful of such reports was received " , the jury said , " considering the widespread interest in the election , the number of voters and the [MASK] of_this city " . 
['population', 'status', 'reputation', 'character', 'state']
character


In [189]:
score = 0
for i in range(len(word)):
    if wn.synsets(word[i]) and wn.synsets(predicted_sense[i]):
        score += wn.synsets(word[i])[0].wup_similarity(wn.synsets(predicted_sense[i])[0])
score / len(word)

0.44761938338005225

## TA's Note

Congratulations, you made it to the end of the tutorial! Make sure you make an appointment to show your work and turn in your finished assignment before next week's lesson. We will ask you to run your code, so double check that everything is working and that your model is saved. Don't worry if you didn't pass the evaluation requirements, you'll still get partial points for trying.