# Week 5 Exercise

In this exercise we will fine-tune a vanilla transformer on the Multi-NLI dataset and compare it to a already fine-tuned model.

In this exercise, we will:

- load the vanilla BART-large model as well as the Multi-NLI dataset from huggingface
- define a tokenization function to encode our input
- define a evaluation function to use during fine-tuning
- fine-tune and evaluate the vanilla model
- compare your fine-tuned model to an already existing one
- analyze the pretrained model on a set of linguistic phenomena

In [None]:
# install requirements (not that many this time)
!pip install transformers torch datasets evaluate

In [None]:
# import relevant libraries
import evaluate, torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoConfig,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
import numpy as np

### Task 1: Fine-tune the vanilla BART-large model on MNLI

Check out this page to understand how to fine-tune a model on huggingface: https://huggingface.co/docs/transformers/training

First, let's load the model that we want to fine-tune. We will use BART-large (https://huggingface.co/facebook/bart-large).

In [None]:
# this is the model we will use
model_name = "facebook/bart-large"

# TODO: load tokenizer and model here

# your code here

Now let's load the MNLI dataset:

In [None]:
# load the MNLI dataset
dataset = load_dataset("multi_nli")

Next, let's define a tokenization function to tokenize our input.

Some advice on this:

- use max_length=128 if memory is an issue
- remember that your input consists of two components (premise and hypothesis), which you should give to the tokenizer at the same time

In [None]:
# TODO: implement the tokenization function
def tokenize(examples):

    # your code here
    
    return

Now that we can tokenize the input, let's tokenize the dataset using the `map` function: 

In [None]:
# TODO: encode dataset using the `tokenize` function

# your code here

encoded_dataset = "placeholder"

In [None]:
# define smaller train and validation dataset for speed
# note that we will test the model on the validation dataset in this assignment as we're not going to report the numbers
train_mnli = encoded_dataset["train"].select(range(20000))
val_mnli = encoded_dataset["validation_matched"].select(range(1000))

We still need to define a metric to evaluate our model and also define a function that the trainer can use to evaluate the model. We'll use accuracy as our evaluation metric:

In [None]:
# define evaluation metric
metric = evaluate.load("accuracy")

In [None]:
# TODO: define the function that the trainer can use to compute and report the metric
# note that BART-large may return its logits in a tuple and the first element is the relevant logit
# so: if logits istype tuple → logits = logits[0]

def compute_metrics(eval_pred):

    # your code here

    return 

In [None]:
# TODO: define training arguments here 

# training_args = TrainingArguments(…)

# your code here

In [None]:
# TODO: define trainer here 
# hint: create a data collator (DataCollatorWithPadding) for dynamic padding, yields better performance
#       if you do, set data_collator=data_collator in your Trainer

# trainer = Trainer(…)

# your code here 

trainer = "placeholder"

In [None]:
# fine-tune the model
trainer.train()

In [None]:
# evaluate fine-tuned model
results = trainer.evaluate()
print(results)

### Task 2: Compare to an already fine-tuned model

Now we want to compare your fine-tuned model to a already fine-tuned model that we can find on huggingface. As luck would have it, there is an already fine-tuned BART-large model for MNLI available, so let's compare our model to this one!

You can find the model here: https://huggingface.co/facebook/bart-large-mnli

First, load the model and tokenizer. See this blog post for help (Step 1): https://www.geeksforgeeks.org/deep-learning/how-to-use-hugging-face-pretrained-model

In [None]:
# TODO: load model and tokenizer
# note: load the tokenizer with the same name as the previous one to make sure that this tokenizer will be used in your tokenize function from now on

# your code here

Before we evaluate this model, let's check the label map (always a good idea to do that):

In [None]:
config = AutoConfig.from_pretrained('facebook/bart-large-mnli')
print(config.id2label)

And now let's check the MNLI dataset that we just used:

In [None]:
dataset["train"].features

Apparently, the label mapping was inverted in Facebook's fine-tuned BART-large model and conflicts with the label mapping of MNLI. Make sure to fix this when evaluating the model on MNLI!

Perform inference on Facebook's fine-tuned model on MNLI's validation dataset (the same that you used to evaluate your fine-tuned model!). Evaluate using accuracy as a metric by collecting the number of correctly classified items in your validation set and dividing them by the number of items in the validation dataset. Check Step 2 from the blog post to see how you can use a pretrained model for inference.

In [None]:
# TODO: write a function that takes the premise and the hypothesis as well as the pretrained model as input and outputs the model's prediction 

# uncomment the lines below if you want to use GPU
# device = "cuda" if torch.cuda.is_available() else "cpu"
# facebook_model.to(device)

# use this label map to make sure that the indices point to the same label for the dataset and the model
# MNLI mapping: 0: Entailment,      1: Neutral,     2: Contradiction
# BART mapping: 0: Contradiction,   1: Neutral,     2: Entailment
label_map = {0: 2, 1: 1, 2: 0}

def run_finetuned_model(premise, hypothesis, model):

    return 

Now use `run_finetuned_model` on `val_mnli` and evaluate using accuracy.

Remember: accuracy is just `correctly_classified_samples` / `all_samples`

In [None]:
# TODO: loop over the samples in `val_mnli` and get the prediction for each sample
# evaluate using the accuracy measure

# your code here

### Task 3: Analyse Linguistic Phenomena in MNLI

In the paper that you read for this session, the authors analyze their data by looking at some linguistic phenomena that are potentially difficult for a NLI model. We will do the same here. Choose at least two phenomena from the paper that you want to investigate and test some premise-hypothesis pairs on the `facebook/bart-large-mnli` model. For each of you categories, come up with at least 5 samples. Test the model on your sample and share your observations below.

The categories from the paper are the following (second paragraph of section 4.3 on p.1119 in the paper for the categories):

- Quantifiers
- Belief Verbs
- Time Terms
- Discourse Markers
- Presupposition Triggers
- Conditionals

Additionally, come up with at least one other category that is not in the paper and investigate how well the model is doing on examples from this category (at least 5 samples).

In [None]:
# this is an example, you don't have to use quantifiers but you may of course
# TODO: add your samples
quantifier_example = [("All teachers were dancing.", "Some people were dancing.")] # entailment

# label map because BART-large has inverted label assignment
label_map = {0: "Contradiction", 1: "Neutral", 2: "Entailment"}


for (premise, hypothesis) in quantifier_example:
    prediction = run_finetuned_model(premise, hypothesis, "facebook_model") # change "facebook_model" to your model variable
    print(f"{'Premise:':12}{premise}\n{'Hypothesis:':12}{hypothesis}\n{'Prediction:':12}{label_map[prediction]}")

\# Your observations here