---

<div align='center'>
<font size="+2">

Text Mining and Natural Language Processing  
2023-2024

<b>SelectWise</b>

Alessandro Ghiotto 513944

</font>
</div>

---

# Notebook 3 - Transformer Encoder Only:

1. BERT:
    - Binary classification - NextSentencePrediction
    - Multiclass classification - MultipleChoice
2. Different ways of tuning a pretrained models:
    - Linear probing
    - Mixed method

---

# **BERT**

BERT is a pretrained encoder-only transformer, it encodes the input sentence, then the output of the `[CLS]` token is fed to a classification head. Each token of the input sequence have as output a vector, which can be seen as a contextualized embedding of the input token. In particular I use the `'bert-base-uncased'` model

I will have 8 inputs for each sample, of this kind:

**"[CLS] fact_1 fact_2 [SEP] question [SEP] choice_i [SEP]"** for each choice i in ['A','B','C','D','E','F','G','H']

![](../imgs/3_BERT_multiplechoice_crop.png "BERT for multiple choice")

Load Data

In [1]:
from datasets import load_dataset
import pandas as pd
import numpy as np
import random
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset
sns.set_theme(style="darkgrid")

# SEED
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
seed = 8
set_seed(seed)

# DEVICE and DTYPE
mydevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.set_default_device(mydevice) # default tensor device
torch.set_default_dtype(torch.float32) # default tensor dtype

# DATASET
dataset = load_dataset("allenai/qasc")
n_train_sample = 7323
dataset_train = dataset['train'].select(range(n_train_sample))
dataset_val = dataset['train'].select(range(n_train_sample, len(dataset['train'])))
dataset_test = dataset['validation']

def format_choices(example):
    # transform the choices from a dictionary to a list of strings
    # I will eliminate the labels, if we know that the order is always the same
    # Does all the samples have the same order of choices?
    if example['choices']['label'] == ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']:
        # get the text of the choices
        example['choices'] = example['choices']['text']
    else:
        print("The order of the choices is not the same for all the examples")
    
    # transform the answerKey from a string to an integer
    example['answerKey_int'] = ord(example['answerKey']) - 65
    return example

dataset_train = dataset_train.map(format_choices)
dataset_val = dataset_val.map(format_choices)

# Display the dataset
dataset_train[0]

{'id': '3E7TUJ2EGCLQNOV1WEAJ2NN9ROPD9K',
 'question': 'What type of water formation is formed by clouds?',
 'choices': ['pearls',
  'streams',
  'shells',
  'diamonds',
  'rain',
  'beads',
  'cooled',
  'liquid'],
 'answerKey': 'F',
 'fact1': 'beads of water are formed by water vapor condensing',
 'fact2': 'Clouds are made of water vapor.',
 'combinedfact': 'Beads of water can be formed by clouds.',
 'formatted_question': 'What type of water formation is formed by clouds? (A) pearls (B) streams (C) shells (D) diamonds (E) rain (F) beads (G) cooled (H) liquid',
 'answerKey_int': 5}

## **Binary Classification**

I train the model like a NextSentencePrediction task (with `AutoModelForNextSentencePrediction`), I simply ask if this choice is correct or not. So as output I have just two values, one associated to the positive class and the other to the negative one.

![picture](../imgs/3_BERTbinary.png)

Get the `tokenizer` and the `model`

In [11]:
from transformers import AutoTokenizer

mydevice = 'cuda' if torch.cuda.is_available() else 'cpu'

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# EXAMPLE
res = tokenizer('fact1 and fact2', 'question [SEP] choice')
print(res)
print(tokenizer.decode(res['input_ids']))

{'input_ids': [101, 2755, 2487, 1998, 2755, 2475, 102, 3160, 102, 3601, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] fact1 and fact2 [SEP] question [SEP] choice [SEP]


In [12]:
from transformers import AutoModelForNextSentencePrediction

model = AutoModelForNextSentencePrediction.from_pretrained(model_name)

model.resize_token_embeddings(len(tokenizer))



Embedding(30522, 768, padding_idx=0)

In [13]:
print(model)

BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [10]:
# in particular we can see the last layer -> classifier
named_layers = list(model.named_children())
num_last_layers = 1
for name, layer in named_layers[-num_last_layers:]:
    print(f"Layer Name: {name}, Layer: {layer}")

Layer Name: classifier, Layer: Linear(in_features=768, out_features=1, bias=True)


Create the `preprocess_function` which takes as input batches of the dataset and tansform them, giving as output 8 inputs for each sample, in which we have `['input_ids', 'token_type_ids', 'attention_mask', 'labels']`

In [15]:
# I still don't return torch tensors and I don't pad
# I will use the data collator, wich is memory efficient
# In particular here that the sentences are short, I don't pad to the max length

def preprocess_function_NextSentecePrediction(examples):
    # attach fact1 and fact2
    # and repeat each sentence 8 times to go with the 8 choices
    first_sentences = [[f"{examples["fact1"][i]} {examples["fact2"][i]}"] * 8 for i in range(len(examples["fact1"]))]
    # Grab all second sentences, the questions.
    questions = examples["question"]
    second_sentences = [
        [f"{question} [SEP] {examples["choices"][i][choice_idx]}" for choice_idx in range(8)] 
        for i, question in enumerate(questions)
    ]

    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, max_length=512,  truncation=True)

    # Create the labels
    # 1: correct choice, 0: incorrect choice
    answerKeys = examples['answerKey'] # list of correct choices ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    new_labels = [] # length: 8 * len(answerKeys)
    for answerKey in answerKeys:
        new_labels.extend([1 if chr(65+i) == answerKey else 0 for i in range(8)])

    tokenized_examples['labels'] = new_labels

    return tokenized_examples

In [16]:
# Apply the preprocessing function to the dataset
dataset_train_encoded = dataset_train.map(preprocess_function_NextSentecePrediction, 
                                          batched=True, remove_columns=dataset_train.column_names)
dataset_val_encoded = dataset_val.map(preprocess_function_NextSentecePrediction, 
                                      batched=True, remove_columns=dataset_val.column_names)
# we got the input_ids, token_type_ids and attention_mask
dataset_train_encoded.column_names

['input_ids', 'token_type_ids', 'attention_mask', 'labels']

Create the `DataCollator`, which dinamically pad the sequences in the batch, which is more memory efficient wrt padding all the dataset to the longest sequence 

In [17]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForNextSentencePrediction:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        # take the labels out
        label_name = "labels"
        labels = [feature.pop(label_name) for feature in features]

        # pad
        batch = self.tokenizer.pad(
            features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)

        return batch

In [18]:
# EXAMPLE
# from the idx sample in the train set we got this 8 inputs
idx = 0
margin = 10 # create a larger batch, so we can see the padding
features = [{k: v for k, v in dataset_train_encoded[i].items()} for i in range(idx*8, idx*8+8 + margin)]
batch = DataCollatorForNextSentencePrediction(tokenizer)(features)
decoded_input_ids = [tokenizer.decode(batch["input_ids"][i].tolist()) for i in range(margin, margin + 8)]
print(f'RESULT OBTAINED FROM THE PREPROCESS FUNCTION - sample {idx}\nWE CAN SEE THE PADDING AT THE END GIVEN BY THE DATACOLLATOR\n')
for i in range(8):
    print(decoded_input_ids[i])
    print(f'label: {batch["labels"][idx*8+i + margin]}')
    print()
print(f'CORRECT LABEL: {dataset_train['answerKey'][idx]} -> CHOICE: {dataset_train['choices'][idx][ord(dataset_train["answerKey"][idx])-65]}')


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RESULT OBTAINED FROM THE PREPROCESS FUNCTION - sample 0
WE CAN SEE THE PADDING AT THE END GIVEN BY THE DATACOLLATOR

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] pearls [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: 0

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] streams [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: 0

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] shells [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: 0

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] diamonds [SEP] [PAD] [PAD] [PAD] [PAD] [PAD

Now we are ready to train the model

In [19]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler

# HYPERPARAMETERS
batch_size = 16
learning_rate = 5e-5
num_train_epochs = 1
# ---------------------------------------------

# DATALOADERS
generator = torch.Generator(device=mydevice)
train_dataloader = DataLoader(dataset_train_encoded, batch_size=batch_size, shuffle=True, 
                              collate_fn=DataCollatorForNextSentencePrediction(tokenizer), generator=generator)
val_dataloader = DataLoader(dataset_val_encoded, batch_size=batch_size, shuffle=True,
                            collate_fn=DataCollatorForNextSentencePrediction(tokenizer), generator=generator)

# OPTIMIZER
optimizer = AdamW(model.parameters(), lr=learning_rate)

# SCHEDULER
num_training_steps = num_train_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [20]:
from tqdm. import tqdm

mydevice = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(mydevice)

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        # move the batch to cuda
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        
        # forward pass
        outputs = model(**batch)

        # compute the loss
        loss = outputs.loss

        # backward pass
        loss.backward()

        # update the weights
        optimizer.step()

        # update the learning rate
        lr_scheduler.step()

        # zero the gradients
        optimizer.zero_grad()

        # update the progress bar
        progress_bar.update(1)
        progress_bar.set_postfix({'loss': loss.item()})

# save the model
# model.save_pretrained(f"../models/{model_name}-NextSentencePrediction-finetuned")

  0%|          | 0/3662 [00:00<?, ?it/s]

In [21]:
import evaluate

# load the model
model = AutoModelForNextSentencePrediction.from_pretrained(f"../models/{model_name}-NextSentencePrediction-finetuned")
model.to(mydevice)

metric = evaluate.load("accuracy")
model.eval()
for batch in val_dataloader:
    batch = {k: v.to(mydevice) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.9824290998766955}

In [22]:
# EXAMPLE
batch = next(iter(val_dataloader))
batch = {k: v.to(mydevice) for k, v in batch.items()}
with torch.no_grad():
    outputs = model(**batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(logits)
print(predictions)
print(batch["labels"])
print(predictions == batch["labels"])


tensor([[-0.1405,  1.0166],
        [ 3.3760, -2.8730],
        [ 3.3582, -2.7215],
        [ 4.1679, -3.7248],
        [ 3.6158, -3.1926],
        [ 3.9187, -3.4710],
        [ 3.3933, -2.9008],
        [ 3.5048, -2.9808],
        [-0.8759,  2.9226],
        [ 4.1225, -3.7218],
        [ 3.1081, -2.4220],
        [ 4.1965, -3.7751],
        [ 3.1352, -2.5844],
        [ 3.1694, -2.6081],
        [ 4.3199, -3.9138],
        [-0.3883,  1.6641]], device='cuda:0')
tensor([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1], device='cuda:0')
tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1], device='cuda:0')
tensor([False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True], device='cuda:0')


The results are very good, and we can also see from the logits values that the model is also quite confident in what it's predicting

Now we have seen the prediction for each input, given by splitting in eight each sample. Our task is not just to see if this is correct or not, but at the end I pick the most correct, and see if this is correct.

For each sample I will choose the answer which as the highest `logits[1]` (the sentence which is more probable to be correct). I look at the second dimension of the logits because is the one associated to the positive label.

In [23]:
def answerKey_predictions(dataset_encoded, model):
    # I use dataloaders with shuffle=False and batch_size=8
    # so each batch is exactly one of the original samples
    batch_size = 8
    generator = torch.Generator(device=mydevice)
    dataloader = DataLoader(dataset_encoded, batch_size=batch_size, shuffle=False, 
                            collate_fn=DataCollatorForNextSentencePrediction(tokenizer), generator=generator)
    
    predictions = []
    model.eval()
    for batch in dataloader:
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        logits_for_positive_class = logits[:, 1]
        prediction = torch.argmax(logits_for_positive_class)#, dim=-1)
        predictions.append(prediction.cpu().numpy())
        
    return predictions

from sklearn.metrics import accuracy_score, f1_score
def evaluate_predictions(true_labels, predictions, dataset_label=None):
    # dataset_label is a string that specifies the dataset{'train', 'validation', 'test'}
    accuracy = accuracy_score(true_labels, predictions)
    f1 = f1_score(true_labels, predictions, average='macro')
    print(f"{dataset_label} Accuracy:", accuracy)
    print(f"{dataset_label} F1 Score:", f1)
    return accuracy, f1

In [24]:
# train
predictions_train = answerKey_predictions(dataset_train_encoded, model)
train_accuracy, train_f1 = evaluate_predictions(dataset_train['answerKey_int'], predictions_train, 'train')

# validation
predictions_val = answerKey_predictions(dataset_val_encoded, model)
val_accuracy, val_f1 = evaluate_predictions(dataset_val['answerKey_int'], predictions_val, 'validation')

train Accuracy: 0.9786972552232691
train F1 Score: 0.9786639385715545
validation Accuracy: 0.9741060419235512
validation F1 Score: 0.9736600249668403


---
## **Multiclass Classification**

I train the model like a single-label multi-class classification task (with `AutoModelForMultipleChoice`), for each choice I have an output, the one which is higher is the correct choice (given by the model). Now I have as output 8 values, I take the argmax, and gives me a label in {0, 1, 2, 3, 4, 5, 6, 7}

![picture](../imgs/3_BERTmulti.png)

In [2]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice

mydevice = 'cuda' if torch.cuda.is_available() else 'cpu'
set_seed(seed)

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForMultipleChoice.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(30522, 768, padding_idx=0)

In [4]:
print(model)

BertForMultipleChoice(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, ele

In [5]:
# in particular we can see the last layer -> classifier
named_layers = list(model.named_children())
num_last_layers = 1
for name, layer in named_layers[-num_last_layers:]:
    print(f"Layer Name: {name}, Layer: {layer}")

Layer Name: classifier, Layer: Linear(in_features=768, out_features=1, bias=True)


In [3]:

def preprocess_function_MultipleChoice(examples):
    # attach fact1 and fact2
    # and repeat each sentence 8 times to go with the 8 choices
    first_sentences = [[f"{examples["fact1"][i]} {examples["fact2"][i]}"] * 8 for i in range(len(examples["fact1"]))]
    # Grab all second sentences, the questions.
    questions = examples["question"]
    second_sentences = [
        [f"{question} [SEP] {examples["choices"][i][choice_idx]}" for choice_idx in range(8)] 
        for i, question in enumerate(questions)
    ]

    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    # Un-flatten -> each example has 8 choices
    tokenized_examples = {k: [v[i:i+8] for i in range(0, len(v), 8)] for k, v in tokenized_examples.items()}

    # Create the labels
    # ['A','B','C','D','E','F','G','H'] -> [0, 1, 2, 3, 4, 5, 6, 7]
    answerKeys = examples['answerKey'] 
    tokenized_examples['labels'] = [ord(answerKey) - ord('A') for answerKey in answerKeys]

    return tokenized_examples

# Apply the preprocessing function to the dataset
dataset_train_encoded = dataset_train.map(preprocess_function_MultipleChoice, 
                                          batched=True, remove_columns=dataset_train.column_names)
dataset_val_encoded = dataset_val.map(preprocess_function_MultipleChoice, 
                                      batched=True, remove_columns=dataset_val.column_names)
# we got the input_ids, token_type_ids and attention_mask
dataset_train_encoded.column_names

['input_ids', 'token_type_ids', 'attention_mask', 'labels']

In [4]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        # take the labels out
        label_name = "labels"
        labels = [feature.pop(label_name) for feature in features]

        # flatten (because now I have a list of 8 choices for each example)
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])

        # pad
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch


In [8]:
# The difference now is that all the choices are stored in the same sample
# EXAMPLE
idx = 0
margin = 10 # create a larger batch, so we can see the padding
features = [{k: v for k, v in dataset_train_encoded[i].items()} for i in range(idx, idx + margin)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)
decoded_input_ids = [tokenizer.decode(batch["input_ids"][idx][i].tolist()) for i in range(8)]
print(f'RESULT OBTAINED FROM THE PREPROCESS FUNCTION - sample {idx}\nWE CAN SEE THE PADDING AT THE END GIVEN BY THE DATACOLLATOR\n')
for i in range(8):
    print(decoded_input_ids[i])
    print(f'label: {batch["labels"][idx]}')
    print()
print(f'CORRECT LABEL: {dataset_train['answerKey'][idx]} -> CHOICE: {dataset_train['choices'][idx][ord(dataset_train["answerKey"][idx])-65]}')


You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


RESULT OBTAINED FROM THE PREPROCESS FUNCTION - sample 0
WE CAN SEE THE PADDING AT THE END GIVEN BY THE DATACOLLATOR

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] pearls [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: 5

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] streams [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: 5

[CLS] beads of water are formed by water vapor condensing clouds are made of water vapor. [SEP] what type of water formation is formed by clouds? [SEP] shells [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [

train

In [5]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler

# HYPERPARAMETERS
batch_size = 16
learning_rate = 5e-5
num_train_epochs = 1
# ---------------------------------------------

# DATALOADERS
generator = torch.Generator(device=mydevice)
train_dataloader = DataLoader(dataset_train_encoded, batch_size=batch_size, shuffle=True, 
                              collate_fn=DataCollatorForMultipleChoice(tokenizer), generator=generator)
val_dataloader = DataLoader(dataset_val_encoded, batch_size=batch_size, shuffle=True,
                            collate_fn=DataCollatorForMultipleChoice(tokenizer), generator=generator)

# OPTIMIZER
optimizer = AdamW(model.parameters(), lr=learning_rate)

# SCHEDULER
num_training_steps = num_train_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [6]:
from tqdm.auto import tqdm

mydevice = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(mydevice)

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        # move the batch to cuda
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        
        # forward pass
        outputs = model(**batch)

        # compute the loss
        loss = outputs.loss

        # backward pass
        loss.backward()

        # update the weights
        optimizer.step()

        # update the learning rate
        lr_scheduler.step()

        # zero the gradients
        optimizer.zero_grad()

        # update the progress bar
        progress_bar.update(1)
        progress_bar.set_postfix({'loss': loss.item()})

# save the model
# model.save_pretrained(f"../models/{model_name}-MultipleChoice-finetuned")

  0%|          | 0/458 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [52]:
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def evaluate_model(dataloader, model, dataset_label=None):
    model.eval()
    for batch in dataloader:
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        # Add the batch predictions and references to the metrics
        accuracy_metric.add_batch(predictions=predictions, references=batch["labels"])
        f1_metric.add_batch(predictions=predictions, references=batch["labels"])

    # Compute and print the results
    accuracy_result = accuracy_metric.compute()
    f1_result = f1_metric.compute(average='macro')
    print(f"{dataset_label} Accuracy: {accuracy_result['accuracy']}")
    print(f"{dataset_label} F1 Score: {f1_result['f1']}")
    return accuracy_result['accuracy'], f1_result['f1']

model = AutoModelForMultipleChoice.from_pretrained(f"../models/{model_name}-MultipleChoice-finetuned")
model.to(mydevice)

# train
train_accuracy, train_f1 = evaluate_model(train_dataloader, model, 'train')

# validation
val_accuracy, val_f1 = evaluate_model(val_dataloader, model, 'validation')


train Accuracy: 0.9793800355045746
train F1 Score: 0.9793611480842821
validation Accuracy: 0.9790382244143033
validation F1 Score: 0.9791321003700275


With this classification head is just sufficient to take the argmax for each sample, the model gives directly the asnwer that we want.

In [53]:
# EXAMPLE
batch = next(iter(val_dataloader))
batch = {k: v.to(mydevice) for k, v in batch.items()}
with torch.no_grad():
    outputs = model(**batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(logits)
print(predictions)
print(batch["labels"])
print(predictions == batch["labels"])


tensor([[-4.0935,  3.9343, -3.8873, -4.2520, -2.5385,  5.5411, -3.2184, -3.9984],
        [-4.4275,  5.1825, -3.8262, -4.0298, -4.8384, -3.1587, -2.5576, -3.8295],
        [-4.7901, -3.9886, -2.1845,  7.3923, -4.1852, -4.0817, -2.0705, -3.7491],
        [-3.4828, -4.6891,  4.8587, -4.9498, -3.7593, -2.3122, -3.7485, -2.1878],
        [-2.8561, -4.0060,  7.3995, -2.7483, -4.4307, -3.8597, -4.5980, -2.7486],
        [-3.4716, -3.6285, -3.3278, -1.8094, -4.0485, -1.5041,  2.7307,  5.0338],
        [-3.6403,  7.3583, -3.5429, -1.0634, -4.0818, -2.4636, -3.0010, -1.3780],
        [-2.0641,  7.4481, -3.8136, -2.2357, -4.2170, -3.6363, -2.1736, -4.5679],
        [ 4.0983, -3.1642, -4.0581, -0.6500, -3.7150, -4.4371, -3.6834,  6.6779],
        [-4.4480, -4.3171, -1.9642, -4.2828,  5.6323, -4.5545, -4.6879, -4.2238],
        [ 5.4727, -3.9283, -4.5002, -3.8473, -2.7156, -2.2504, -4.0748, -4.0479],
        [-3.3253,  4.6846, -3.9729, -3.9657, -3.6748, -4.0582, -4.5269, -4.4246],
        [-2.5622

### Results:

Binary Classification: validation accuracy = $0.97410$

Multiclass Classification: validation Accuracy = $0.97903$

The results are almost identical, for the next steps I will use the second (`AutoModelForMultipleChoice`), which is much easier and confortable to handle. With all the choices in one sample is much more cleane (also for the predictions).


---

# **Different ways of tuning a pretrained models**

![picture](../imgs/3_LinearProbing_FineTuning.png "LinearProbing & FineTuning ")

Until now I have **fine-tuned** the models, I have updated all the weights while training.

## **Linear probing**

With linear probing we train just the classification head weights, and the representation of the point given by the model doesn't change, are fixed in the embedding space.

fine-tuning is better than linear probing (we train all the weights)

linear probing should performe better out of distribution, because the pretrained features are fixed.

the `DataCollactor` and the `encoded dataset` are taken from the previous part

In [31]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice

mydevice = 'cuda' if torch.cuda.is_available() else 'cpu'
set_seed(seed)

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForMultipleChoice.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(30522, 768, padding_idx=0)

In [32]:
for name, param in model.named_parameters():
    # WHAT WE DON'T TRAIN -> FREEZE
    if 'classifier' not in name:
        param.requires_grad = False
    # THE OTHERS ARE THE ONE TO BE TRAINED
    else:  # classifier layer
        print(name)

classifier.weight
classifier.bias


In [42]:
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler

# HYPERPARAMETERS
batch_size = 16
learning_rate = 5e-5
num_train_epochs = 2
# ---------------------------------------------

# DATALOADERS
generator = torch.Generator(device=mydevice)
train_dataloader = DataLoader(dataset_train_encoded, batch_size=batch_size, shuffle=True, 
                              collate_fn=DataCollatorForMultipleChoice(tokenizer), generator=generator)
val_dataloader = DataLoader(dataset_val_encoded, batch_size=batch_size, shuffle=True,
                            collate_fn=DataCollatorForMultipleChoice(tokenizer), generator=generator)

# OPTIMIZER
optimizer = AdamW(model.parameters(), lr=learning_rate)

# SCHEDULER
num_training_steps = num_train_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [43]:
from tqdm.auto import tqdm

mydevice = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(mydevice)

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        progress_bar.set_postfix({'loss': loss.item()})

# save the model
# model.save_pretrained(f"../models/{model_name}-MultipleChoice-linearprobing")

  0%|          | 0/916 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [16]:
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def evaluate_model(dataloader, model, dataset_label=None):
    model.eval()
    for batch in dataloader:
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        # Add the batch predictions and references to the metrics
        accuracy_metric.add_batch(predictions=predictions, references=batch["labels"])
        f1_metric.add_batch(predictions=predictions, references=batch["labels"])

    # Compute and print the results
    accuracy_result = accuracy_metric.compute()
    f1_result = f1_metric.compute(average='macro')
    print(f"{dataset_label} Accuracy: {accuracy_result['accuracy']}")
    print(f"{dataset_label} F1 Score: {f1_result['f1']}")
    return accuracy_result['accuracy'], f1_result['f1']

In [49]:
model = AutoModelForMultipleChoice.from_pretrained(f"../models/{model_name}-MultipleChoice-linearprobing")
model.to(mydevice)

# train
train_accuracy, train_f1 = evaluate_model(train_dataloader, model, 'train')

# validation
val_accuracy, val_f1 = evaluate_model(val_dataloader, model, 'validation')


train Accuracy: 0.9980882152123447
train F1 Score: 0.9980920146845049
validation Accuracy: 0.9790382244143033
validation F1 Score: 0.978762918439323


In [45]:
# EXAMPLE
batch = next(iter(val_dataloader))
batch = {k: v.to(mydevice) for k, v in batch.items()}
with torch.no_grad():
    outputs = model(**batch)

logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
print(logits)
print(predictions)
print(batch["labels"])
print(predictions == batch["labels"])


tensor([[-8.0635, -8.1475, -8.2077, -7.7835,  7.1177, -7.9665, -8.1542, -3.8058],
        [-7.8387, -1.3881,  7.9918,  7.8251, -8.2508, -7.2913, -8.2615, -8.1033],
        [-7.1352, -7.9718, -8.0846, -8.1383,  5.0578, -8.2606, -8.3322, -7.4598],
        [ 6.5435,  1.7675, -6.9395, -7.8207, -8.0814, -7.0317, -8.0152, -7.5562],
        [-7.3153, -5.1316, -8.0707, -7.9625, -8.0620,  7.6504, -8.2321, -7.9300],
        [ 5.2820, -8.2117, -8.1995, -8.0062, -8.1868, -7.9580, -7.9542, -7.7910],
        [-8.2587, -8.2831, -8.1234,  8.3343, -8.2293, -8.1069, -8.0794, -8.2049],
        [-8.1586, -8.2209, -8.1853,  8.5138, -8.2739, -8.2566, -6.7375, -8.0082],
        [ 0.0820, -0.1621, -0.6542,  6.5795, -5.5367, -6.3054, -4.4811,  3.0971],
        [-8.2234, -8.1691, -8.1770, -8.2145, -7.7564,  7.8036, -8.2401, -8.1932],
        [-8.1832, -8.2644, -8.1885, -8.2319, -8.1982, -7.3070,  7.8420, -8.1723],
        [-7.8365, -8.2693, -8.2553, -8.2627,  7.0586, -8.1292, -8.2239, -8.1727],
        [ 8.4885

## **Combined method**

![picture](../imgs/3_BERT_combinedmethod.png)

Take the best from the two methods

1) We train the classification head only, and we learn the linear classification boundary (I simply take the model trained with linear probing)

2) We fix the classification head, and train the rest

So in this way we have nothing to lose, first we create the boundary and then try to adapt the features. 

In [10]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice

mydevice = 'cuda' if torch.cuda.is_available() else 'cpu'
set_seed(seed)

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

model = AutoModelForMultipleChoice.from_pretrained(f"../models/{model_name}-MultipleChoice-linearprobing")
model.resize_token_embeddings(len(tokenizer))

for name, param in model.named_parameters():
    # SET THE WEIGHTS TRAINABLE IN ALL THE OTHER LAYERS
    if 'classifier' not in name:
        param.requires_grad = True
    # FREEZE THE CLASSIFIER
    else:
        param.requires_grad = False

In [18]:
from tqdm.auto import tqdm

# HYPERPARAMETERS
batch_size = 16
learning_rate = 5e-6 # just a little adjustment
num_train_epochs = 1
# ---------------------------------------------

# OPTIMIZER
optimizer = AdamW(model.parameters(), lr=learning_rate)

# SCHEDULER
num_training_steps = num_train_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

model.train()

progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(mydevice) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
        progress_bar.set_postfix({'loss': loss.item()})

# save the model
# model.save_pretrained(f"../models/{model_name}-MultipleChoice-combinedmethod")

  0%|          | 0/458 [00:00<?, ?it/s]

In [20]:
model = AutoModelForMultipleChoice.from_pretrained(f"../models/{model_name}-MultipleChoice-combinedmethod")
model.to(mydevice)

# train
train_accuracy, train_f1 = evaluate_model(train_dataloader, model, 'train')

# validation
val_accuracy, val_f1 = evaluate_model(val_dataloader, model, 'validation')


train Accuracy: 0.9989075515499113
train F1 Score: 0.9989121147081581
validation Accuracy: 0.9815043156596794
validation F1 Score: 0.9813588250400052


### **Results**

| Metric          |Fine-tuning|Linear probing|Combined method|
|-----------------|-----------|--------------|---------------|
| Val Accuracy    | $0.97903$ | $0.97903$    | $0.98150$     | 
| Val F1 Score    | $0.97913$ | $0.97876$    | $0.98135$     |


Ranking (based on the accuracy and f1 on the validation set):

1) combined method
2) fine-tuning
3) linea probing

But at the end the results are almost identical.
