# Natural Language Processing Assignment 3: The Notebook

This is the notebook for the third and final hand-in assignment for Natural Language Processing. The notebook counts for 100% of the total assignment, the total assignment counts towards 20% of the final grade.

In this notebook, you will be using the Huggingface Transformers library to work with pretrained transformer-based language models. Our running task will be: Natural Language Inference.

The assignment broadly consists of three parts:

1. Data preparation: where you will learn about the task, and prepare the data to be consumed by your PyTorch model.
2. Model finetuning: where you finetune a `transformers` model on the task.
3. Multilingual comparison: where you will compare the results on the English dataset to results on its Dutch incarnation.
4. In-context learning: where you see how well a non-finetuned generative model like GPT-2 works on the same task.


### Note
When finetuning huggingface models, the models are saved to your computer. These files can be big (500MB-1GB), so do not hand them in! Instead, make sure that all cell outputs after running the code are visible (so: not cleared) when you hand in the assignment, this way we can see that you've done the training.


## Part 1 (14 points): Data preparation

In this part you will familiarize yourself with the task at hand: Natural Language Inference. Recall from the course lectures that Natural Language Inference is a three-way sequence classification task over two sentences. Given a premise and a hypothesis, the task is to decide whether the premise Entails, Contradicts, or is Neutral with respect to the hypothesis. We will work with the SICK (Sentences Involving Compositional Knowledge) dataset of (Marelli et al. 2014) and its Dutch incarnation (Wijnholds & Moortgat, 2021).

But first, we need to ensure that we have all the right packages installed, and then make some initial package imports, as usual. We assume that by now you have `torch` already installed.

In [3]:
# HuggingFace Transformers library ([torch] is used to get the correct version of accelerate)
!pip install transformers[torch]
# HuggingFace Datasets library
!pip install datasets
# HuggingFace Evaluate library
!pip install evaluate
# Scikit Learn, for evaluation metrics and confusion matrix
!pip install scikit-learn
# Seaborn, for nice plots
!pip install seaborn

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [25]:
import torch
import numpy as np
from datasets import load_dataset
import evaluate
import transformers

### The SICK Dataset

The SICK dataset was introduced in 2014 as one of the first dataset to measure relatedness between full sentences, but additionally also is labelled with Natural Language Inference labels. The good news for us is that the Dutch version of SICK, the SICK-NL dataset, is actually on the HuggingFace Hub: ['maximedb/sick_nl'](https://huggingface.co/datasets/maximedb/sick_nl). You can go ahead and check out some samples through the link, or check out the code below; loading the data is now incredibly simple:

In [26]:
raw_datasets = load_dataset('maximedb/sick_nl')
display(type(raw_datasets['train']))
for i in range(20):
    display(raw_datasets['train'][i])


{'pair_ID': 1,
 'sentence_A': 'Een groepje kinderen speelt in een tuin en een oude man staat op de achtergrond',
 'sentence_B': 'Een groep jongens in een tuin is aan het spelen en een man staat op de achtergrond',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 4.5,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A group of kids is playing in a yard and an old man is standing in the background',
 'sentence_B_original': 'A group of boys in a yard is playing and a man is standing in the background',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 2,
 'sentence_A': 'Een groep kinderen speelt in het huis en er staat geen man op de achtergrond',
 'sentence_B': 'Een groepje kinderen speelt in een tuin en een oude man staat op de achtergrond',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.2,
 'entailment_AB': 'A_contradicts_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A group of children is playing in the house and there is no man standing in the background',
 'sentence_B_original': 'A group of kids is playing in a yard and an old man is standing in the background',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 3,
 'sentence_A': 'De jonge jongens spelen buiten en de man lacht in de buurt',
 'sentence_B': 'De kinderen spelen buiten in de buurt van een man met een glimlach',
 'entailment_label': 'ENTAILMENT',
 'relatedness_score': 4.7,
 'entailment_AB': 'A_entails_B',
 'entailment_BA': 'B_entails_A',
 'sentence_A_original': 'The young boys are playing outdoors and the man is smiling nearby',
 'sentence_B_original': 'The kids are playing outdoors near a man with a smile',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 2,
 'label_seq2seq': '3'}

{'pair_ID': 5,
 'sentence_A': 'De kinderen spelen buiten in de buurt van een man met een glimlach',
 'sentence_B': 'Een groepje kinderen speelt in een tuin en een oude man staat op de achtergrond',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.4,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'The kids are playing outdoors near a man with a smile',
 'sentence_B_original': 'A group of kids is playing in a yard and an old man is standing in the background',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 9,
 'sentence_A': 'De jonge jongens spelen buiten en de man lacht in de buurt',
 'sentence_B': 'Een groepje kinderen speelt in een tuin en een oude man staat op de achtergrond',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.7,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'The young boys are playing outdoors and the man is smiling nearby',
 'sentence_B_original': 'A group of kids is playing in a yard and an old man is standing in the background',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 12,
 'sentence_A': 'Twee honden zijn aan het vechten',
 'sentence_B': 'Twee honden zijn aan het worstelen en knuffelen',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 4.0,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Two dogs are fighting',
 'sentence_B_original': 'Two dogs are wrestling and hugging',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 14,
 'sentence_A': 'Een bruine hond valt een ander dier aan voor de man in een broek',
 'sentence_B': 'Twee honden zijn aan het vechten',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.5,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A brown dog is attacking another animal in front of the man in pants',
 'sentence_B_original': 'Two dogs are fighting',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 18,
 'sentence_A': 'Een bruine hond valt een ander dier aan voor de man in een broek',
 'sentence_B': 'Twee honden zijn aan het worstelen en knuffelen',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.2,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A brown dog is attacking another animal in front of the man in pants',
 'sentence_B_original': 'Two dogs are wrestling and hugging',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 25,
 'sentence_A': 'Niemand rijdt op de fiets op één wiel',
 'sentence_B': 'Iemand in een zwart jasje doet trucjes op een motor',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 2.8,
 'entailment_AB': 'A_contradicts_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Nobody is riding the bicycle on one wheel',
 'sentence_B_original': 'A person in a black jacket is doing tricks on a motorbike',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 26,
 'sentence_A': 'Een persoon rijdt op de fiets op één wiel',
 'sentence_B': 'Een man in een zwart jasje doet trucjes op een motor',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.7,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A person is riding the bicycle on one wheel',
 'sentence_B_original': 'A man in a black jacket is doing tricks on a motorbike',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 28,
 'sentence_A': 'Een persoon op een zwarte motor doet trucjes met een jasje',
 'sentence_B': 'Een persoon rijdt op de fiets op één wiel',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.4,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A person on a black motorbike is doing tricks with a jacket',
 'sentence_B_original': 'A person is riding the bicycle on one wheel',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 30,
 'sentence_A': 'Een man met een trui is de bal aan het dunken bij een basketbalwedstrijd',
 'sentence_B': 'De bal wordt gedunkt door een man met een trui bij een basketbalwedstrijd',
 'entailment_label': 'ENTAILMENT',
 'relatedness_score': 4.9,
 'entailment_AB': 'A_entails_B',
 'entailment_BA': 'B_entails_A',
 'sentence_A_original': 'A man with a jersey is dunking the ball at a basketball game',
 'sentence_B_original': 'The ball is being dunked by a man with a jersey at a basketball game',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 2,
 'label_seq2seq': '3'}

{'pair_ID': 35,
 'sentence_A': 'Een man met een trui is de bal aan het dunken bij een basketbalwedstrijd',
 'sentence_B': 'Een man die speelt, dunkt de basketbal in het net en het publiek is op de achtergrond',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.6,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'A man with a jersey is dunking the ball at a basketball game',
 'sentence_B_original': 'A man who is playing dunks the basketball into the net and a crowd is in background',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 40,
 'sentence_A': 'De speler is de basketbal in het net aan het dunken en er is een menigte op de achtergrond',
 'sentence_B': 'Een man met een trui is de bal aan het dunken bij een basketbalwedstrijd',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.8,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'The player is dunking the basketball into the net and a crowd is in background',
 'sentence_B_original': 'A man with a jersey is dunking the ball at a basketball game',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 42,
 'sentence_A': 'Twee mensen zijn aan het kickboksen en toeschouwers kijken niet',
 'sentence_B': 'Twee mensen zijn aan het kickboksen en toeschouwers kijken toe',
 'entailment_label': 'CONTRADICTION',
 'relatedness_score': 3.4,
 'entailment_AB': 'A_contradicts_B',
 'entailment_BA': 'B_contradicts_A',
 'sentence_A_original': 'Two people are kickboxing and spectators are not watching',
 'sentence_B_original': 'Two people are kickboxing and spectators are watching',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 0,
 'label_seq2seq': '1'}

{'pair_ID': 44,
 'sentence_A': 'Twee jonge vrouwen zijn aan het sparren in een kickboksgevecht',
 'sentence_B': 'Twee vrouwen zijn aan het sparren in een kickbokswedstrijd',
 'entailment_label': 'ENTAILMENT',
 'relatedness_score': 4.9,
 'entailment_AB': 'A_entails_B',
 'entailment_BA': 'B_entails_A',
 'sentence_A_original': 'Two young women are sparring in a kickboxing fight',
 'sentence_B_original': 'Two women are sparring in a kickboxing match',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 2,
 'label_seq2seq': '3'}

{'pair_ID': 45,
 'sentence_A': 'Twee jonge vrouwen zijn niet aan het sparren in een kickboksgevecht',
 'sentence_B': 'Twee vrouwen zijn aan het sparren in een kickbokswedstrijd',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.9,
 'entailment_AB': 'A_contradicts_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Two young women are not sparring in a kickboxing fight',
 'sentence_B_original': 'Two women are sparring in a kickboxing match',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 47,
 'sentence_A': 'Twee mensen zijn aan het kickboksen en toeschouwers kijken toe',
 'sentence_B': 'Twee jonge vrouwen zijn niet aan het sparren in een kickboksgevecht',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.415,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Two people are kickboxing and spectators are watching',
 'sentence_B_original': 'Two young women are not sparring in a kickboxing fight',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 49,
 'sentence_A': 'Twee vrouwen zijn aan het sparren in een kickbokswedstrijd',
 'sentence_B': 'Twee mensen zijn aan het kickboksen en toeschouwers kijken niet',
 'entailment_label': 'NEUTRAL',
 'relatedness_score': 3.7,
 'entailment_AB': 'A_neutral_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Two women are sparring in a kickboxing match',
 'sentence_B_original': 'Two people are kickboxing and spectators are not watching',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 1,
 'label_seq2seq': '2'}

{'pair_ID': 55,
 'sentence_A': 'Drie jongens springen in de bladeren',
 'sentence_B': 'Drie kinderen springen in de bladeren',
 'entailment_label': 'ENTAILMENT',
 'relatedness_score': 4.4,
 'entailment_AB': 'A_entails_B',
 'entailment_BA': 'B_neutral_A',
 'sentence_A_original': 'Three boys are jumping in the leaves',
 'sentence_B_original': 'Three kids are jumping in the leaves',
 'sentence_A_dataset': 'FLICKR',
 'sentence_B_dataset': 'FLICKR',
 'SemEval_set': 'TRAIN',
 'label': 2,
 'label_seq2seq': '3'}

Isn't that sick? The example above shows the structure of the data, containing Dutch and English premise (`sentence_A`) and hypothesis (`sentence_B`) sentences. You may notice that `entailment_label` and `label`; the latter is just the integer version of the actual label..

### Part 1.1 (6 points):

Check out some more examples of the train data, until you find out the correspondence between entailment labels and the integer versions of them. Prove it by finishing the implementation below:

In [27]:
label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2}
id2label = {0: 'CONTRADICTION', 1: 'NEUTRAL', 2:'ENTAILMENT'}
check= False
for i in raw_datasets['train']:
    if label2id[i['entailment_label']] == i['label'] :
        continue
    else:
      check = True
      print("incorrect mapping or mistake")

if check == False:
  print("correct mapping")
else:
  print("incorrect mapping")




correct mapping


### Part 1.2 (8 points): Tokenization

Now, as we've seen in the lectures and in the HuggingFace NLP course, we need to prepare the raw data in a form that `transformers` models will understand. Again, unlike the previous hand-in assignment, preparing the data is very simple. The below code loads a tokenizer for a BERT base model (uncased), and creates a tokenized version of the data, and prepares a data collator, which we will need to use to wrap everything up properly during training.

The only missing part is the implementation of `tokenize_function` below. It takes in a data point (like the example printed above) and returns a tokenized model input, ready to pass to a BERT model. Finish its implementation, ensuring that it returns the correct tokenized input (refer to the slides if you need to recall how you tokenize a pair of sentences for BERT). You can run the test code underneath to verify your implementation.

In [35]:
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

# Define the tokenization function
def tokenize_function(tokenizer, example):
    # Tokenize the input sentences
    return tokenizer(
        text=example['sentence_A'],
        text_pair=example.get('sentence_B', None),
        truncation=True, #handle max len
        padding=False,  # handle later
        max_length=512  # 512 max for bert apparently
    )

bert_name = 'bert-base-uncased'
bert_tokenizer = AutoTokenizer.from_pretrained(bert_name)
tokenized_datasets = raw_datasets.map(lambda x: tokenize_function(bert_tokenizer, x), batched=True)
data_collator = DataCollatorWithPadding(tokenizer=bert_tokenizer)

Map:   0%|          | 0/495 [00:00<?, ? examples/s]

In [33]:
print(tokenized_datasets['train'][0])
print(tokenized_datasets['train'][0].get('input_ids'))  # Safely access input_ids
print(raw_datasets['train'].column_names)


{'pair_ID': 1, 'sentence_A': 'Een groepje kinderen speelt in een tuin en een oude man staat op de achtergrond', 'sentence_B': 'Een groep jongens in een tuin is aan het spelen en een man staat op de achtergrond', 'entailment_label': 'NEUTRAL', 'relatedness_score': 4.5, 'entailment_AB': 'A_neutral_B', 'entailment_BA': 'B_neutral_A', 'sentence_A_original': 'A group of kids is playing in a yard and an old man is standing in the background', 'sentence_B_original': 'A group of boys in a yard is playing and a man is standing in the background', 'sentence_A_dataset': 'FLICKR', 'sentence_B_dataset': 'FLICKR', 'SemEval_set': 'TRAIN', 'label': 1, 'label_seq2seq': '2', 'input_ids': [101, 25212, 2078, 24665, 8913, 2361, 6460, 2785, 7869, 2078, 11867, 4402, 7096, 1999, 25212, 2078, 10722, 2378, 4372, 25212, 2078, 15068, 3207, 2158, 2358, 11057, 2102, 6728, 2139, 9353, 11039, 2121, 16523, 15422, 102, 25212, 2078, 24665, 8913, 2361, 18528, 6132, 1999, 25212, 2078, 10722, 2378, 2003, 9779, 2078, 21770,

In [36]:
print('Input IDs:')
print(tokenized_datasets['train'][0]['input_ids'])
print('\nToken Type IDs:')
print(tokenized_datasets['train'][0]['token_type_ids'])
print('\nAttention Mask:')
print(tokenized_datasets['train'][0]['attention_mask'])

Input IDs:
[101, 25212, 2078, 24665, 8913, 2361, 6460, 2785, 7869, 2078, 11867, 4402, 7096, 1999, 25212, 2078, 10722, 2378, 4372, 25212, 2078, 15068, 3207, 2158, 2358, 11057, 2102, 6728, 2139, 9353, 11039, 2121, 16523, 15422, 102, 25212, 2078, 24665, 8913, 2361, 18528, 6132, 1999, 25212, 2078, 10722, 2378, 2003, 9779, 2078, 21770, 11867, 12260, 2078, 4372, 25212, 2078, 2158, 2358, 11057, 2102, 6728, 2139, 9353, 11039, 2121, 16523, 15422, 102]

Token Type IDs:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Attention Mask:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## Part 2 (30 points): Finetuning BERT

So far so good. Now it's time to finetune a BERT model! Don't worry, the dataset was chosen to be small enough for you to finetune on your own machine (and on CPU), if you have 8GB+ of working memory available. Okay let's get to it.

### Part 2.1 (6 Points)

Given that we have the name of the BERT model we want to train, we need to load in a pretrained model. Finish the one-liner below to setup a model for three-way classification so you can finetune for Natural Language Inference:

In [37]:
from transformers import AutoModelForSequenceClassification

bert_model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Wow, a one-liner to define a whole model! Let's continue with the model training..

### Part 2.2 (12 Points)

Let's get to training. Again, HuggingFace provides us with a lot of built-in functionality. The code below sets everything up: `compute_metrics` implements the method for calculating accuracy during training, using the `evaluate` library. Then, we have to set up a `Trainer` with a number of `TrainingArguments`. Finish the implementation so that we will run for 3 epochs, with a training batch size low enough for your machine (on the test machine, an M1 MacBook Air 2020 with 16GB working memory, a batch size of 16 was used). Check what device the implementation is going to use (CPU, CUDA, MPS?).

In [42]:
from transformers import TrainingArguments, Trainer

# def compute_metrics(eval_preds):
#     accuracy = evaluate.load("accuracy")
#     logits, labels = eval_preds
#     predictions = np.argmax(logits, axis=-1)
#     return accuracy.compute(predictions=predictions, references=labels)

# # If you're running on an M1/M2 MacBook, with MPS backend support,
# # you can replace "TrainingArguments" by "TrainingArgumentsWithMPSSupport"
# # If not, just ignore this Python class!
# class TrainingArgumentsWithMPSSupport(TrainingArguments):

#     @property
#     def device(self) -> torch.device:
#         if torch.cuda.is_available():
#             return torch.device("cuda")
#         elif torch.backends.mps.is_available():
#             return torch.device("mps")
#         else:
#             return torch.device("cpu")

# training_args = TrainingArguments("my-trainer",
#                                   per_device_train_batch_size=NotImplemented,
#                                   num_train_epochs=NotImplemented,
#                                   logging_strategy="epoch",
#                                   evaluation_strategy="epoch",
#                                   save_strategy="epoch",
#                                   dataloader_num_workers=0,
#                                   load_best_model_at_end=True,
#                                   save_total_limit=2)

# trainer = Trainer(
#     bert_model,
#     training_args,
#     train_dataset=NotImplemented,
#     eval_dataset=NotImplemented,
#     data_collator=data_collator,
#     tokenizer=bert_tokenizer,
#     compute_metrics=compute_metrics
# )
# display(training_args.device)




def compute_metrics(eval_preds):
    accuracy = evaluate.load("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# Use the built-in TrainingArguments without modification
training_args = TrainingArguments("my-trainer",  # Directory for saving models and checkpoints
    per_device_train_batch_size=16,  # Adjust batch size based on your machine
    num_train_epochs=3,  # Train for 3 epochs
    logging_strategy="epoch",  # Log metrics at the end of each epoch
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    save_strategy="epoch",  # Save the model at the end of each epoch
    dataloader_num_workers=0,  # Number of workers for data loading
    load_best_model_at_end=True,  # Load the best model after training
    save_total_limit=2,  # Keep the last 2 checkpoints
    push_to_hub=False  # Do not push to Hugging Face Hub (optional)
)

# Initialize Trainer
trainer = Trainer(
    model=bert_model,  # The model to train
    args=training_args,  # Training arguments
    train_dataset=tokenized_datasets["train"],  # Training dataset
    eval_dataset=tokenized_datasets["validation"],  # Evaluation dataset
    data_collator=data_collator,  # Data collator for padding
    compute_metrics=compute_metrics  # Compute metrics function
)




Now, press the button on the cell below, and make some tea while you wait for the finetuning to finish :-)

In [43]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6912,0.606709,0.727273


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6912,0.606709,0.727273
2,0.4323,0.534876,0.779798
3,0.2972,0.623345,0.773737


TrainOutput(global_step=834, training_loss=0.47354761759440106, metrics={'train_runtime': 10808.152, 'train_samples_per_second': 1.232, 'train_steps_per_second': 0.077, 'total_flos': 452362882362804.0, 'train_loss': 0.47354761759440106, 'epoch': 3.0})

Finally, now use the `Trainer` (that already loaded the best performing checkpoint/epoch), to evaluate on the test set and display test accuracy. It should be above 80%.

In [None]:
# 2.2b Solution:
test_predictions = NotImplemented
print('Test accuracy: ', test_predictions[2]['test_accuracy'])


# 2.2b Solution:
# Evaluate the model on the test set
test_results = trainer.evaluate(eval_dataset=tokenized_datasets["test"])

# Extract test accuracy from the evaluation results
test_accuracy = test_results["eval_accuracy"]

# Print the test accuracy
print('Test accuracy: ', test_accuracy)


est_predictions = trainer.predict(test_dataset=tokenized_datasets["test"])
print('Test accuracy: ', test_predictions.metrics['test_accuracy'])


### Part 2.3 (12 Points)

Wasn't that incredibly easy? However, we would like to have a bit more insight in the model's predictions now. For this, we are going to look into precision, recall, and F1 score for the different classes.
First, complete the implementation below to retrieve, for the test set, the model's predicted labels and the correct labels. Then inspect the confusion matrix that comes out, and its pretty-printed heatmap version.
Finally, the precision, recall and f1 score are also printed. Use those to explain the confusion matrix: are the model's predictions at the rows and the correct answers at the columns or the other way around?

*If you're confused about what a confusion matrix is, check out [Scikit Learn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) or review the slides of Week 3 (the part on multiclass evaluation and micro/macro-averaging).*

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import seaborn as sns

preds = NotImplemented
trues = NotImplemented

cf_matrix = confusion_matrix(trues, preds)
display(cf_matrix)

sns.heatmap(cf_matrix, annot=True, cmap='Blues', fmt='g')

precision, recall, f1, support = precision_recall_fscore_support(trues, preds)
display(precision)
display(recall)
display(f1)
display(support)

## Part 3 (24 points): Multilingual comparison

Hey, this dataset you're using... it contains English and Dutch! In fact, let's revisit an example:

In [None]:
display(raw_datasets['train'][0])

In fact, the items in the dataset are all *aligned*. That is, each `sentence_A` is a Dutch translation of `sentence_A_original`, and each `sentence_B` is a translation of `sentence_B_original`. That means we could also finetune a Dutch BERT model on the same dataset! Your task for this part is to do exactly this, and then compare results.

### Part 3.1 (12 points)

In this part, your task is quite simple: repeat the finetuning exactly as before, but now use:

- (a) the Dutch sentences instead of the English ones
- (b) a Dutch tokenizer and BERT model, as indicated below

In the end, report the test set accuracy, and the other evaluation metrics (precision, recall, F1) exactly as above, again plotting the confusion matrix.

In [None]:
nl_bert_name = 'GroNLP/bert-base-dutch-cased'
nl_tokenizer = NotImplemented
nl_model = NotImplemented
NotImplemented

### Part 3.2 (12 points)
Now we wish to quantify the difference between the Dutch and English model results. Execute the following:

1. Gather those predictions and true labels for which the English and Dutch model disagree, quantifying the percentage of cases where they disagree.
2. Then, calculate and display the English confusion matrix, and Dutch confusion matrix for these cases.
3. Then report on your findings. For example, when the models disagree, does the English model have a stronger tendency to classify sentence pairs as Neutral?

*Note: the heatmap plots are in separate cells to avoid Seaborn to plot them on top of each other :)*

In [None]:
dis_trues, dis_en_preds, dis_nl_preds = NotImplemented

dis_en_cf_matrix = confusion_matrix(dis_trues, dis_en_preds)
dis_nl_cf_matrix = confusion_matrix(dis_trues, dis_nl_preds)
sns.heatmap(dis_en_cf_matrix, annot=True, cmap='Blues', fmt='g')

In [None]:
sns.heatmap(dis_nl_cf_matrix, annot=True, cmap='Blues', fmt='g')

#### Explanation
[Answer for 3. here]

## Part 4 (32 points): In-context learning

Okay, while it's great that we can reach high accuracy with low effort by using built-in functionality from HuggingFace, let's try and see if we can do without fine-tuning at all, and use *in-context learning* with a generative model on the exact same task.

Recall that for in-context learning, we take a large pretrained generative model (like GPT-3) and pose it with a prompt that specifies our task format and then we hope it generates text for a new case that corresponds to the correct answer! In this way we can do classification as well.

Now for the bad news: OpenAI never officially released their GPT-3+ models, so we will do with the last available version, GPT-2*. No worries though: since this model is so much smaller we can actually use it on our own machines ;-) in the end, you just need to change the model's name to try out a larger model as soon as you get your hands on a powerful enough computing device.

Let's start with setting up the GPT-2 model in the right setting: text generation. We will make use of HuggingFace's built-in `pipeline` for this. Start by running the below code that sets up the generative model in text generation mode.

**In fact, the version of GPT-2 we'll be using is not even the largest GPT-2 model around. But hey, you get the general idea right? Just swapping a model's name will allow us to perform the exact same task, just with a larger model.*

In [None]:
from transformers import pipeline, AutoTokenizer
import torch

torch.manual_seed(0)
model = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(model)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.eos_token_id)

Now let's see how generation works with some starting prompt. Note that we're setting a seed to guarantee that the model will give the same output.

In [None]:
from transformers import set_seed
set_seed(5287935)

prompt = "My incomplete sentence is"
sequences_full = pipe(prompt, max_new_tokens=15, return_full_text=True)
sequences_generated = pipe(prompt, max_new_tokens=15, return_full_text=False)
print(sequences_full[0]['generated_text'])
print('\n\n')
print(sequences_generated[0]['generated_text'])

Okay, let's try a serious NLI prompt. We will try a few-shot setting in which the model will have seen one example for each NLI label. The code below grabs one example for each label from the validation data, and places them in a prompt, which contains a test pair (Does "This is difficult." entail "This is easy."?). Then, we ask the model to generate the next few tokens to see if it gives some sensible prediction.

In [None]:
co_example = raw_datasets['validation'][0]
co_example_sentence_a = co_example['sentence_A_original']
co_example_sentence_b = co_example['sentence_B_original']

ne_example = raw_datasets['validation'][1]
ne_example_sentence_a = ne_example['sentence_A_original']
ne_example_sentence_b = ne_example['sentence_B_original']

en_example = raw_datasets['validation'][7]
en_example_sentence_a = en_example['sentence_A_original']
en_example_sentence_b = en_example['sentence_B_original']

prompt = f"For Sentence A and Sentence B, classify as Entailment, Neutral, or Contradiction.\n\
Sentence A: {co_example_sentence_a}\nSentence B: {co_example_sentence_b}\nNLI Label: Contradiction.\n\
Sentence A: {ne_example_sentence_a}\nSentence B: {ne_example_sentence_b}\nNLI Label: Neutral.\n\
Sentence A: {en_example_sentence_a}\nSentence B: {en_example_sentence_b}\nNLI Label: Entailment.\n\
Sentence A: This is difficult.\nSentence B: This is easy.\n NLI Label:"

prompting_examples = pipe(prompt, max_new_tokens=3, return_full_text=False)
print(prompting_examples[0]['generated_text'])

Pretty cool, right? Is the answer correct?

If we want to systematically assess how well the model does on the full dataset, we will need a few ingredients, and these are the steps you will follow:

1. A way to encode each sentence pair as a prompt.
2. A loop to run the model on all of the prompts.
3. Functionality to decode the model's output back to an NLI label.

### Part 4.1 (10 points)

First off, implement the function `create_prompt` below, that returns an NLI prompt but with the two given sentences (A and B). Verify with the code underneath to see what happens.

*Hint: you can re-use the prompt above in your solution.*

In [None]:
def create_prompt(sentence_a, sentence_b):
    NotImplemented

In [None]:
ex = raw_datasets['test'][0]
prompt = create_prompt(ex['sentence_A_original'], ex['sentence_B_original'])
print(prompt)
sequences = pipe(prompt, max_new_tokens=3, return_full_text=False)
display(sequences)

### Part 4.2 (10 points)

You may notice that the output is not exactly clean, and it could even be a completely different text than an NLI label! So you'll need to finish the function `decode_prompting_result` below, that will take the output of the generation and return an actual label. Note that the function should return the correct label if the output corresponds to an NLI label, and a fourth label in case the output is something different.

In [None]:
def decode_prompting_result(result: str) -> int:
    result_label = NotImplemented
    if result_label in label2id:
        return label2int[result_label]
    else:
        return NotImplemented

### Part 4.3 (5 points)

Now to actually run the whole thing: for each item in the test data we want to create a prompt, feed it to the model, and transform the result into a label. We want to end up with a list of predictions, just like with the finetuned models before. The only difference will be that we have a fourth possible label. Run the below code as is (don't forget to make tea while you wait!), and run the code underneath to see a sample of the predictions.

In [None]:
from tqdm import tqdm

generation_preds = []
for d in tqdm(raw_datasets['test']):
    prompt = create_prompt(d['sentence_A_original'], d['sentence_B_original'])
    results = pipe(prompt, max_new_tokens=3, return_full_text=False)
    label = decode_prompting_result(results[0]['generated_text'])
    generation_preds.append(label)

In [None]:
print(generation_preds[:50])

### Part 4.4 (7 points)

As a last step, do what you do best and display the confusion matrix, precision, recall and F1 score for the prompting setup.

In [None]:
# Your 4.4 Solution:

NotImplemented

### Bonus Exercise: Decoding strategies

If you are unsatisfied with the result, you may be happy to know that you can apply the decoding strategies you saw in the lecture (such as beam search, top-k sampling, top-p nucleus sampling) also in the current context, by adding the same arguments to the `pipe` when you run it on a prompt. You will get bonus points for trying out at least two different generation strategies and seeing how this affects the result.

### Alternative Bonus Exercise: Seed-averaging

You may notice that text generation can be different on the same prompt each time that you run the model. You can score bonus points by running the model over the dataset three times and aggregating the results in a way you choose.