## Supervised WSD with Bayes

We will use the Bayes classifier to label (classify) the words with a WordNet sense. For this we need a context window surrounding the target word (the word for which we search the sense). The context window should contain only "content words" (words with important meaning that bring information, like nouns, verbs etc),

We note P(s|c) the probability for sense s in the context c. For each such sense of the target word the probability is computed and we take the sense with the highest probability compared to the others.

In order to compute the probability P(s|c), we use the formula: P(s|c)=(P(c|s)*P(s))/P(c). P(s) is the probability of a sense without any context. For P(c|s) we would need a training set (with texts that contain the target word, already labeled with its correct sense).

NLTK already has the classifier implemented. We can use the NLTK NaiveBayesClassifier: https://www.nltk.org/_modules/nltk/classify/naivebayes.html

The Naive Bayes classifier will first compute the prior probability for the senses (or, generally speaking, for the class labels) - this is determined by the label's frequncy in the training set. The features are used to see the likelyhood of having that label in a given context.

nltk.NaiveBayesClassifier.train(train_set)

**train_set** must contain a list with the classes and features for each class. The **train_set** list will contain tuples of two elements. The first element is a dictionary with the features (name and value of each feature), the second element is the class label.

You can also use Naive Bayes classifier from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Useful link: https://www.nltk.org/book/ch06.html

For today's task, you need to train the NLTK Bayes classifier on senseval, on a word of your choice.

In [None]:
import nltk
nltk.download('senseval')

[nltk_data] Downloading package senseval to /root/nltk_data...
[nltk_data]   Package senseval is already up-to-date!


True

In [None]:
senseval.fileids()

['hard.pos', 'interest.pos', 'line.pos', 'serve.pos']

In [None]:
from nltk.corpus import senseval
instances = senseval.instances('hard.pos')
print(instances[2])
print(instances[2].senses)
print(instances[3500])
print(instances[3500].senses)

SensevalInstance(word='hard-a', position=3, context=[('i', 'PRP'), ('find', 'VBP'), ('it', 'PRP'), ('hard', 'JJ'), ('to', 'TO'), ('believe', 'VB'), ('that', 'IN'), ('the', 'DT'), ('sacramento', 'NNP'), ('river', 'NNP'), ('will', 'MD'), ('ever', 'RB'), ('be', 'VB'), ('quite', 'RB'), ('the', 'DT'), ('same', 'JJ'), (',', ','), ('although', 'IN'), ('i', 'PRP'), ('certainly', 'RB'), ('wish', 'VBP'), ('that', 'IN'), ('i', 'PRP'), ("'m", 'VBP'), ('wrong', 'JJ'), ('.', '.')], senses=('HARD1',))
('HARD1',)
SensevalInstance(word='hard-a', position=6, context=[('he', 'PRP'), ('said', 'VBD'), ('tuesday', 'NNP'), ('there', 'EX'), ('are', 'VBP'), ('no', 'DT'), ('hard', 'JJ'), ('feelings', 'NNS'), ('.', '.')], senses=('HARD2',))
('HARD2',)


The classes used for training are the senses, and the features are the surrounding words.

### Exercise (1 point)

Apply Naive Bayes classifier on one of the files in senseval dataset (for example, for the word interest). Use 90% of the phrases for training and 10% for testing the classifier. The phrases should be taken in a random order (shuffle the phrases before training and testing). For the testing set print the predictions of the classifier and the correct labels from the corpus and also print "true" if they are the same, and "false" if they are different. In the end, print the accuracy of the classifier. Write the output to a txt file.

In [None]:
import random
import nltk
from nltk.corpus import senseval
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

instances = senseval.instances('hard.pos')
contexts = [" ".join(inst.context[0]) for inst in instances]
labels   = [inst.senses[0] for inst in instances]
random.seed(13)
idx = list(range(len(contexts)))
random.shuffle(idx)
split = int(0.9 * len(idx))
train_idx, test_idx = idx[:split], idx[split:]

train_texts = [contexts[i] for i in train_idx]
train_lbls  = [labels[i]   for i in train_idx]
test_texts  = [contexts[i] for i in test_idx]
test_lbls   = [labels[i]   for i in test_idx]

vect = CountVectorizer()
X_train = vect.fit_transform(train_texts)
X_test  = vect.transform(test_texts)

clf = MultinomialNB()
clf.fit(X_train, train_lbls)
preds = clf.predict(X_test)

out_file = 'naive_bayes_interest_output.txt'
with open(out_file, 'w') as f:
    for p, t in zip(preds, test_lbls):
        f.write(f"{p}\t{t}\t{str(p==t).lower()}\n")
    acc = sum(p==t for p,t in zip(preds, test_lbls)) / len(test_lbls)
    f.write(f"Accuracy: {acc:.4f}\n")

print(f"Wrote results to ./{out_file}")


## BERT

BERT (Bidirectional Encoder Representations from Transformers) was introduced in 2018. BERT, as opposed to word2vec (context-free model), is a contextual model, trained using the Masked Language Modeling (MLM) technique. This means that part of the words in the input are masked (considered unknown) and BERT is trained to predict them based on their context in the input text. It is also trained to predict what sentence comes before a given sentence (Next Sentence Prediction). As a result of this training process, BERT learns contextual representations of tokens based on their context. BERT is **bidirectional** as it considers both the left and right context of each word in the input during training.

**Transformers** are a deep learning architecture that use the attention mechanism (self-attention) to create vector encodings of the words in a text, based on their surrounding text (context). BERT has encoder-only transformer architecture. The encoder receives the input and turns into an embedding to be used by the neural network.

**The attention mechanism** assigns weights for each token in the input in relation to the other tokens in the context of the input, thus obtaining a contextual model. Attention helps decide how important a token is in an input, and how it depends on the other values in the input. Bert uses a multihead attention mechanism, which means it looks at multiple words at the same time.

We will use BERT from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased).

### Bert tokenizer

It is in charge of preparing the inputs for the model. The BERT tokenizer uses subword-based tokenization, which splits unknown words into smaller words or characters such that the model can derive some meaning from the tokens. BERT uses the [WordPiece](https://paperswithcode.com/method/wordpiece) algorithm to generate the vocabulary. The WordPiece algorithm generates subwords based on the likelihood of characters occurring together.

### Training and fine tuning

BERT is usually pretrained, and if we want to specialize it on a task we fine tune it by training it on our data for a small amount of epochs (usually 2-4), with a learning rate from the following: 3e-4, 1e-4, 5e-5, 3e-5.

### Important terms

- **Logits** - the predicted values (the value of the last layer)
- **Loss** - the difference between predicted and true values
- **batch** - input data is divided into batches that are delivered to the network; recommended batch sizes: 8, 16, 32, 64, 128
- **optimizer** - adjusts the parameters of the model in order to minimize the loss function
- **gradients** - represent the partial derivatives of the loss function with respect to the model's parameters (weights and biases). Gradients are used to update the parameters in a way that minimizes the loss

References:
- https://www.projectpro.io/article/bert-nlp-model-explained/558
- https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
- https://huggingface.co/docs/transformers/main_classes/tokenizer

### Exercises (2 points in total)

Note: Exercises 10 - 14 are closely related. Read all of them before starting exercise 10.

1. Install the transformers module.
2. Open the corpus file from senseval for a word of your choice. Print the number of phrases (instances).
3. Print the first 10 phrases as raw text (untagged).
4. Create a DataFrame based on the sentences in the text with the columns: text, label, id. Each row will correspond to an entry in the corpus. The column text would contain the phrase from the entry (as raw text - untagged), the label will be the sense for the target word (the label should be an int), the id is the number of the phrase.
5. Shuffle the data.
6. For the training set, use 90% of instances to train a BERT classifier. Try to find the sense of the word on the rest of 10% of instances (test set). The classes used for training are the senses.
7. Create a pretrained classification model with BertForSequenceClassification. The number of labels is actually the number of classes. Set the output attentions an hidden states on false
8. Use BertTokenizer from pretrained model "bert-base-uncased", in lowercase.
9. Use batch_encode_plus method of the tokenizer to encode the texts. Use the longest padding and return the attention mask.
10. Use 2 epochs to train the data using AdamW optimizer with learning rate 5e-5 and with batch sizes of:
    - 16
    - 32
    - 64
11. For each epoch load the batches and train the model with the batches (this is the step where we do the fine tuning). Don't forget to use zero_grad() on the optimizer.
12. For each batch size, try training with 2, 3 and 4 epochs. Print the loss both for training and testing. Choose the best number of epochs for each batch size based on the values of the loss functions.
13. Print how much time each training epoch took.
14. Justify the choices at ex. 12 with 3 tables (columns will correspond to the number of epochs and rows to the loss function values for training and testing).
15. Classify the test set using the best model you trained (i.e. best loss function). You again load in batches the data and check the predictions.
16. Print the test accuracy, precision, recall, and weighted f1-score of the best model. You can use sklearn.
17. Plot the confusion matrix. You can again use sklearn.
18. Change the code so that the computations are made on GPU (for NVIDIA, method cuda())

# Each Cell represents one exercise


In [None]:
# ! pip install transformers
import time
import torch_optimizer as optim
import nltk
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
# import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import matplotlib.pyplot as plt

In [None]:
nltk.download('senseval')
from nltk.corpus import senseval

word = 'interest'
instances = senseval.instances(f"{word}.pos")
print(f"Total instances for '{word}': {len(instances)}")

In [None]:
for i, inst in enumerate(instances[:10]):
    context = ' '.join(inst.context[0])
    senses = inst.senses
    print(f"{i+1}. Context: {context}\n   Senses: {senses}\n")

In [None]:
data = []
for i, inst in enumerate(instances):
    text = ' '.join(inst.context[0])
    label = inst.senses[0]
    data.append({'text': text, 'label': label, 'id': i})
df = pd.DataFrame(data)
df['label'], label_names = pd.factorize(df['label'])
print(f"Label mapping: {dict(enumerate(label_names))}")

In [None]:
df = df.sample(frac=1, random_state=13).reset_index(drop=True)

In [None]:
train_df, test_df = train_test_split(
    df, test_size=0.1, stratify=df['label'], random_state=13
)

In [None]:
num_labels = len(label_names)

tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased', do_lower_case=True
)
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=num_labels,
    output_attentions=False, output_hidden_states=False
)


In [None]:
def encode_texts(texts):
    return tokenizer.batch_encode_plus(
        texts, add_special_tokens=True, max_length=128,
        padding='longest', truncation=True,
        return_attention_mask=True, return_tensors='pt'
    )

train_enc = encode_texts(train_df['text'].tolist())
test_enc = encode_texts(test_df['text'].tolist())

class WSDDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item


In [None]:
batch_sizes = [16, 32, 64]
results = {}
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for batch_size in batch_sizes:
    train_ds = WSDDataset(train_enc, train_df['label'].tolist())
    test_ds  = WSDDataset(test_enc,  test_df['label'].tolist())
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_loader  = DataLoader(test_ds,  batch_size=batch_size)
    for epochs in [2, 3, 4]:
        optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
        train_losses, test_losses, times = [], [], []
        for epoch in range(epochs):
            t0 = time.time()
            model.train()
            total_train_loss = 0
            for batch in train_loader:
                b_ids = batch['input_ids'].to(device)
                b_mask = batch['attention_mask'].to(device)
                b_labels = batch['labels'].to(device)
                optimizer.zero_grad()
                outputs = model(
                    input_ids=b_ids,
                    attention_mask=b_mask,
                    labels=b_labels
                )
                loss = outputs.loss
                total_train_loss += loss.item()
                loss.backward()
                optimizer.step()
            avg_train = total_train_loss / len(train_loader)
            train_losses.append(avg_train)
            model.eval()
            total_eval_loss = 0
            for batch in test_loader:
                b_ids = batch['input_ids'].to(device)
                b_mask = batch['attention_mask'].to(device)
                b_labels = batch['labels'].to(device)
                with torch.no_grad():
                    outputs = model(
                        input_ids=b_ids,
                        attention_mask=b_mask,
                        labels=b_labels
                    )
                total_eval_loss += outputs.loss.item()
            avg_test = total_eval_loss / len(test_loader)
            test_losses.append(avg_test)

            epoch_time = time.time() - t0
            times.append(epoch_time)
            print(f"Batch {batch_size} | Epoch {epoch+1}/{epochs} | Train Loss: {avg_train:.4f} | Test Loss: {avg_test:.4f} | Time: {epoch_time:.1f}s")
        results[(batch_size, epochs)] = {
            'train_losses': train_losses,
            'test_losses': test_losses,
            'times': times
        }

In [None]:
for batch_size in batch_sizes:
    df_summary = pd.DataFrame({
        'Epochs': [2, 3, 4],
        'Train Loss': [results[(batch_size,e)]['train_losses'][-1] for e in [2,3,4]],
        'Test Loss':  [results[(batch_size,e)]['test_losses'][-1]  for e in [2,3,4]]
    })
    print(f"Batch size: {batch_size}")
    print(df_summary)

# This final part represent exercises 15-18 as splittin them appart would've been weird

In [None]:
best_bs = 16
best_ep = 2
model.eval()

test_loader = DataLoader(
    WSDDataset(test_enc, test_df['label'].tolist()),
    batch_size=best_bs
)
all_preds, all_labels = [], []
for batch in test_loader:
    b_ids = batch['input_ids'].to(device)
    b_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].numpy()
    with torch.no_grad():
        logits = model(input_ids=b_ids, attention_mask=b_mask).logits
    preds = torch.argmax(logits, dim=1).cpu().numpy()
    all_preds.extend(preds)
    all_labels.extend(labels)
acc = accuracy_score(all_labels, all_preds)
precision, recall, f1, _ = precision_recall_fscore_support(
    all_labels, all_preds, average='weighted'
)
cm = confusion_matrix(all_labels, all_preds)
print(f"\nTest Accuracy:  {acc:.4f}")
print(f"Precision:      {precision:.4f}")
print(f"Recall:         {recall:.4f}")
print(f"Weighted F1:    {f1:.4f}")


plt.figure(figsize=(6,6))
plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
plt.colorbar()
ticks = range(len(label_names))
plt.xticks(ticks, label_names, rotation=45)
plt.yticks(ticks, label_names)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
plt.show()

print(f"Using device: {device}")