# Introduction

In this tutorial, we will introduce the HeadQA dataset and run BertForMultipleChoice model on the dataset HeadQA.

This tutorial is adapted from [bert for multiple-choice](https://github.com/huggingface/notebooks/blob/master/examples/multiple_choice.ipynb) from huggingface's github.

This notebook uses python3 and transformer 4.12.2

# Preparation

In [1]:
### Google Colab Mount Drive ###

# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd drive/MyDrive

/content/drive/MyDrive


In [None]:
! pip install datasets transformers

Sign up for Huggingface account [here](https://huggingface.co/welcome). Run the following cell, enter username and password when prompted.

In [4]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-crendential store but this isn't the helper defined on your machine.
You will have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal to set it as the default

git config --global credential.helper store[0m


In [5]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 0s (18.7 MB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155219 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


Check transformers version.

In [6]:
import transformers

print(transformers.__version__)

4.12.3


Check availability of GPU.

In [7]:
import torch
# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


# Loading and Inspecting HeadQA Dataset
[HeadQA](https://aghie.github.io/head-qa/) is a set of multiple-choice questions covering Medicine, Nursing, Psychology, Chemistry, Pharmacology, and Biology. Questions come from exams to access a specialized position in the Spanish healthcare system. The dataset can be downloaded from [huggingface datasets](https://huggingface.co/datasets/head_qa). Details of loading and inspecting HeadQA are shown below.

In [28]:
from datasets import load_dataset, load_metric

The questions and answers are available in both Spanish and English. Deafult language is Spanish. <br>
If Spanish version is desired, use the command `headqa = load_dataset("head_qa")` to load dataset 

If English version is desired, use the command `headqa = load_dataset("head_qa", "en")` to load dataset.

In this example, we use the English version.



In [29]:
headqa = load_dataset("head_qa", "en")

Downloading:   0%|          | 0.00/2.28k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Downloading and preparing dataset head_qa/en (download: 1.67 MiB, generated: 2.65 MiB, post-processed: Unknown size, total: 4.31 MiB) to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2...


Downloading:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset head_qa downloaded and prepared to /root/.cache/huggingface/datasets/head_qa/en/1.1.0/d6803d1e84273cdc4a2cf3c5102945d166555f47b299ecbc5266d582f408f8e2. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

The `headqa` object itself is a [DatasetDict](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set. For each key, the value is a [Dataset](https://huggingface.co/docs/datasets/package_reference/main_classes.html#dataset).

In [None]:
headqa

DatasetDict({
    train: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 2657
    })
    test: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 2742
    })
    validation: Dataset({
        features: ['name', 'year', 'category', 'qid', 'qtext', 'ra', 'image', 'answers'],
        num_rows: 1366
    })
})

To view an actual data instance, select one of the splits and then specify an index.

In [None]:
# display the first training data instance
headqa["train"][0]

{'answers': [{'aid': 1, 'atext': 'They are all or nothing.'},
  {'aid': 2, 'atext': 'They are hyperpolarizing.'},
  {'aid': 3, 'atext': 'They can be added.'},
  {'aid': 4, 'atext': 'They spread long distances.'},
  {'aid': 5, 'atext': 'They present a refractory period.'}],
 'category': 'biology',
 'image': '',
 'name': 'Cuaderno_2013_1_B',
 'qid': 1,
 'qtext': 'The excitatory postsynaptic potentials:',
 'ra': 3,
 'year': '2013'}

In [None]:
# display the first validation data instance
headqa["validation"][0]

{'answers': [{'aid': 1, 'atext': 'The balance of Gibbs-Donnan.'},
  {'aid': 2, 'atext': 'The Goldman-Hodgkin-Katz equation.'},
  {'aid': 3, 'atext': 'The Ohm equation.'},
  {'aid': 4, 'atext': 'The Nernst equation.'}],
 'category': 'biology',
 'image': '',
 'name': 'Cuaderno_2015_1_B',
 'qid': 1,
 'qtext': 'The equilibrium potential for a permeant ion through a membrane is calculated by:',
 'ra': 4,
 'year': '2015'}

In [None]:
# display the first test data instance
headqa["test"][0]

{'answers': [{'aid': 1, 'atext': 'Fibronectin'},
  {'aid': 2, 'atext': 'Collagen'},
  {'aid': 3, 'atext': 'Integrins'},
  {'aid': 4, 'atext': 'Proteoglycans'}],
 'category': 'biology',
 'image': '',
 'name': 'Cuaderno_2016_1_B',
 'qid': 1,
 'qtext': 'Form extracellular fibers with high tensile strength:',
 'ra': 2,
 'year': '2016'}

Notive that training set questions have 5 choices for each questions while validation and test set questions have 4 choices for each question. Let's confirm it.

In [None]:
from collections import Counter
def check_num_choices(dataset):
  train_nums = [len(x['answers']) for x in dataset['train']]
  val_nums = [len(x['answers']) for x in dataset['validation']]
  test_nums = [len(x['answers']) for x in dataset['test']]
  return (Counter(train_nums), Counter(val_nums), Counter(test_nums))

Questions in the same split have the same number of choices.

In [None]:
check_num_choices(headqa)

(Counter({5: 2657}), Counter({4: 1366}), Counter({4: 2742}))

To get a better sense of what the data looks like, the following function will show some examples picked randomly from the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(headqa["test"], 2)

Unnamed: 0,name,year,category,qid,qtext,ra,image,answers
0,Cuaderno_2016_1_P,2016,psychology,206,"What therapy, in addition to cognitive-behavioral therapy, is a well-established treatment for adolescent depression ?:",4,,"[{'aid': 1, 'atext': 'The psychoanalytic therapy.'}, {'aid': 2, 'atext': 'The humanistic therapy.'}, {'aid': 3, 'atext': 'Systemic therapy'}, {'aid': 4, 'atext': 'Interpersonal therapy'}]"
1,Cuaderno_2016_1_P,2016,psychology,83,What is the personality disorder classified as such in the (DSM-IV-TR) that does not appear in the section on personality disorders in the ICD-10 ?:,1,,"[{'aid': 1, 'atext': 'Schizotypic'}, {'aid': 2, 'atext': 'Schizoid.'}, {'aid': 3, 'atext': 'Histrionic.'}, {'aid': 4, 'atext': 'Paranoid'}]"


In each example, the question text is contained in the field `qtext`. The `answers` field is a list of dictionaries, each dictionary has two keys: `aid` contains the index of the choice and `atext` contains the text for the choice. <br>
The following function helps to better visualize each question. The file `ra` contains the index of the right answer.

In [44]:
def show_one(example):
    print(f"Question: {example['qtext']}")
    for i, answer in enumerate(example['answers']):
      print(f"  {i + 1} - {answer['atext']}")
    print(f"\nGround truth: option {example['ra']}")

In [None]:
show_one(headqa["train"][0])

Question: The excitatory postsynaptic potentials:
  1 - They are all or nothing.
  2 - They are hyperpolarizing.
  3 - They can be added.
  4 - They spread long distances.
  5 - They present a refractory period.

Ground truth: option 3


In [None]:
show_one(headqa["test"][0])

Question: Form extracellular fibers with high tensile strength:
  1 - Fibronectin
  2 - Collagen
  3 - Integrins
  4 - Proteoglycans

Ground truth: option 2


# Preprocessing Data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires. \\
To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure: \\ 


*   we get a tokenizer that corresponds to the model architecture we want to use,
*   we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

Select a pretrained model. In this example, we use microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract. <br>
Make sure in the config file, "architectures": "BertForMaskedLM" <br>
Some model choices: <br>
[bert-base-uncased](https://huggingface.co/bert-base-uncased) <br>
PubmedBERT ([abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) or [fulltext](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext)) <br>


In [30]:
# model_checkpoint = microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
# model_checkpoint = microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
# model_checkpoint = "bert-base-uncased"

# in this example, we use PubMedBert fulltext
model_checkpoint = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

We pass along use_fast=True to the call above to use one of the fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, but if you got an error with the previous call, remove that argument.

In [31]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpi1oqen3f


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/7affa2581f6363d963d4a0b175be381c8b97435cc21001e8a900a048d3042dd9.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
creating metadata file for /root/.cache/huggingface/transformers/7affa2581f6363d963d4a0b175be381c8b97435cc21001e8a900a048d3042dd9.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79
https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpignn12bp


Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/76e7b0967140f134278c3209cffe98f69eb013b9de505a434b3359c057aedaa3.2411d0fafcf181e9b95d9cb7972d93b27c57a2cb75819924f8fc7ec848b708f2
creating metadata file for /root/.cache/huggingface/transformers/76e7b0967140f134278c3209cffe98f69eb013b9de505a434b3359c057aedaa3.2411d0fafcf181e9b95d9cb7972d93b27c57a2cb75819924f8fc7ec848b708f2
loading configuration file https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/76e7b0967140f134278c3209cffe98f69eb013b9de505a434b3359c057aedaa3.2411d0fafcf181e9b95d9cb7972d93b27c57a2cb75819924f8fc7ec848b708f2
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_pro

Downloading:   0%|          | 0.00/221k [00:00<?, ?B/s]

storing https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/0261d31dbdad58a6abf6186e318acd19cb579fc6ce43976bd6bf8ced89f69bde.166b46119509a6e652fe317ce8edceca50b52d70bd2a126e1ad846abd3ccb82f
creating metadata file for /root/.cache/huggingface/transformers/0261d31dbdad58a6abf6186e318acd19cb579fc6ce43976bd6bf8ced89f69bde.166b46119509a6e652fe317ce8edceca50b52d70bd2a126e1ad846abd3ccb82f
loading file https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/0261d31dbdad58a6abf6186e318acd19cb579fc6ce43976bd6bf8ced89f69bde.166b46119509a6e652fe317ce8edceca50b52d70bd2a126e1ad846abd3ccb82f
loading file https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/microsoft/BiomedNLP-PubMe

The following function preprocesses a batch of examples.

In [32]:
def preprocess_function(examples):
    # get num_choice
    num_choice = len(examples['answers'][0])
    question_headers = examples["answers"]
    # Repeat each first sentence four times to go with the four possibilities of second sentences.
    first_sentences = [[context] * num_choice for context in examples["qtext"]]
    # Grab all second sentences possible for each context.
    second_sentences = [[ans[i]['atext'] for i in range(num_choice)] for ans in question_headers]
    
    # Flatten everything
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    
    # Tokenize
    tokenized_examples = tokenizer(first_sentences, second_sentences, max_length=128, truncation=True)
    # Un-flatten
    dic =  {k: [v[i:i+num_choice] for i in range(0, len(v), num_choice)] for k, v in tokenized_examples.items()}
    dic['label'] = [label - 1 for label in examples['ra']] # Make sure labels start from 0 instead of 1, or there will be CUDA error: device-side assert triggered 
    return dic

Inspect processing results on a small subset of data.

In [None]:
split = "test"
examples = headqa[split][:5]
features = preprocess_function(examples)
print(len(features["input_ids"]), len(features["input_ids"][0]), [len(x) for x in features["input_ids"][0]])

5 4 [12, 12, 12, 12]


In [None]:
idx = 3
split_num_choice = {'train': 5, 'validation': 4, 'test': 4}
[tokenizer.decode(features["input_ids"][idx][i]) for i in range(split_num_choice[split])]

['[CLS] the multivesicular bodies are : [SEP] peroxisomes [SEP]',
 '[CLS] the multivesicular bodies are : [SEP] mitochondria [SEP]',
 '[CLS] the multivesicular bodies are : [SEP] polysomes [SEP]',
 '[CLS] the multivesicular bodies are : [SEP] endosomes [SEP]']

Compare with original question and choices.

In [None]:
show_one(headqa["train"][3])

Question: In the initiation of voluntary movements the first area that is activated is:
  1 - Premotor cortex.
  2 - Primary motor cortex.
  3 - Brain stem
  4 - Cerebellum.
  5 - Basal ganglia

Ground truth: option 1


Encode entire dataset.

In [33]:
encoded_datasets = headqa.map(preprocess_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

Set max_length to a big value (eg. 500) and inspect length of input to determine max_legnth

In [None]:
import numpy as np
def find_max_length(encoded_datasets, split):
  tokens = np.array(encoded_datasets[split]['input_ids']).flatten()
  max_length = max(list(map(lambda x : len(x), tokens)))
  return max_length
  

In [None]:
print(find_max_length(encoded_datasets, 'train')) # 325
print(find_max_length(encoded_datasets, 'validation')) # 382
print(find_max_length(encoded_datasets, 'test')) # 279

# Fine-tuning the Model

The warning is telling us we are throwing away some weights (the `vocab_transform` and `vocab_layer_norm` layers) and randomly initializing some other (the `pre_classifier` and `classifier` layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

In [None]:
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint, num_labels=5)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForMultipleChoice: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were 

In [None]:
batch_size = 8

To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    output_dir=f"./headqa-finetune",
    do_train=True,
    do_eval=True,
    logging_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    save_strategy="epoch",
    metric_for_best_model='accuracy'
)

Then we need to tell our `Trainer` how to form batches from the pre-processed inputs. We haven't done any padding yet because we will pad each batch to the maximum length inside the batch (instead of doing so with the maximum length of the whole dataset). This will be the job of the *data collator*. A data collator takes a list of examples and converts them to a batch (by, in our case, applying padding). Since there is no data collator in the library that works on our specific problem, we will write one, adapted from the `DataCollatorWithPadding`:

In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import torch

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        #print(features[0])
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

When called on a list of examples, it will flatten all the inputs/attentions masks etc. in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * num_choice) x seq_length`) that we then unflatten.

We can check this data collator works on a list of features, we just have to make sure to remove all features that are not inputs accepted by our model (something the `Trainer` will do automatically for us after):

In [None]:
accepted_keys = ["input_ids", "attention_mask", "label"]
split = 'validation'
features = [{k: v for k, v in encoded_datasets[split][i].items() if k in accepted_keys} for i in range(10)]
print(features[0].keys())
batch = DataCollatorForMultipleChoice(tokenizer)(features)

dict_keys(['label', 'attention_mask', 'input_ids'])


Again, all those flatten/un-flatten are sources of potential errors so let's make another sanity check on our inputs:

In [None]:
[tokenizer.decode(batch["input_ids"][8][i].tolist()) for i in range(split_num_choice[split])]

['[CLS] in the cerebellum, it has a specific function on the control of posture : [SEP] the dentate nucleus. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] in the cerebellum, it has a specific function on the control of posture : [SEP] the lateral areas of the cerebellar hemispheres. [SEP] [PAD] [PAD]',
 '[CLS] in the cerebellum, it has a specific function on the control of posture : [SEP] the intermediate core. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] in the cerebellum, it has a specific function on the control of posture : [SEP] the flocculonodular lobe. [SEP]']

In [None]:
show_one(headqa[split][8])

Question: In the cerebellum, it has a specific function on the control of posture:
  1 - The dentate nucleus.
  2 - The lateral areas of the cerebellar hemispheres.
  3 - The intermediate core.
  4 - The flocculonodular lobe.

Ground truth: option 4


Looks good! <br>

The last thing to define for our `Trainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, the only preprocessing we have to do is to take the argmax of our predicted logits:

In [None]:
import numpy as np

def compute_metrics(eval_predictions):
    predictions, label_ids = eval_predictions
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets['train'],
    eval_dataset=encoded_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics
)

On a Tesla K80 gpu, it takes around ~30 minutes to complete training for 5 epochs. The best model (model with best accuracy is loaded at the end of training for evaluation and prediction)


In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, answers, name, qid, qtext, category, image, year.
***** Running training *****
  Num examples = 2657
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1665


Epoch,Training Loss,Validation Loss,Accuracy
1,1.5527,1.327476,0.367496
2,1.1742,1.538116,0.409956
3,0.6206,2.159842,0.374085
4,0.3112,2.889148,0.377745
5,0.1962,3.266858,0.378477


The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, answers, name, qid, qtext, category, image, year.
***** Running Evaluation *****
  Num examples = 1366
  Batch size = 8
Saving model checkpoint to ./headqa-finetune/checkpoint-333
Configuration saved in ./headqa-finetune/checkpoint-333/config.json
Model weights saved in ./headqa-finetune/checkpoint-333/pytorch_model.bin
tokenizer config file saved in ./headqa-finetune/checkpoint-333/tokenizer_config.json
Special tokens file saved in ./headqa-finetune/checkpoint-333/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, answers, name, qid, qtext, category, image, year.
***** Running Evaluation *****
  Num examples = 1366
  Batch size = 8
Saving model checkpoint to ./headqa-finetune/checkpoint-666
Configuration saved in ./headqa-finetune

TrainOutput(global_step=1665, training_loss=0.7709866314678937, metrics={'train_runtime': 1953.4702, 'train_samples_per_second': 6.801, 'train_steps_per_second': 0.852, 'total_flos': 2805853795934940.0, 'train_loss': 0.7709866314678937, 'epoch': 5.0})

# Evaluation and Prediction

After finetuning the model, we want to inspect its performance on the test split of the headqa.

Check HeadQA [leaderboard](https://aghie.github.io/head-qa/) for state-of-the-art performances. <br>

At the time this notebook is created, best model performance is 46.7% for accuracy for supervised setting in the general category. 

First, we inspect the model's performance on the training  set.

In [None]:
predictions = trainer.predict(encoded_datasets['train'])

The following columns in the test set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, answers, name, qid, qtext, category, image, year.
***** Running Prediction *****
  Num examples = 2657
  Batch size = 8


The best model achieves 87.9% accuracy on the training set.

In [None]:
predictions[2]

{'test_accuracy': 0.878810703754425,
 'test_loss': 0.4831354022026062,
 'test_runtime': 94.9949,
 'test_samples_per_second': 27.97,
 'test_steps_per_second': 3.505}

Now we check the model's performance on the test set.

In [None]:
test_predictions = trainer.predict(encoded_datasets['test'])

Test accuracy is 42.4%.

In [None]:
test_predictions[2]

{'test_accuracy': 0.4237782657146454,
 'test_loss': 1.4385887384414673,
 'test_runtime': 83.6883,
 'test_samples_per_second': 32.764,
 'test_steps_per_second': 4.099}

# Load saved model for evaluation and prediction

If model is saved locally as a checkpoint, use the following code to load model and evaluate.

In [21]:
from transformers import AutoModelForMultipleChoice, AutoTokenizer, TrainingArguments, Trainer

Specify directory for best model.

In [35]:
model_checkpoint = './headqa-finetune/checkpoint-666'
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loading configuration file ./headqa-finetune/checkpoint-666/config.json
Model config BertConfig {
  "_name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
  "architectures": [
    "BertForMultipleChoice"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.12.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 305

In [36]:
batch_size = 8
args = TrainingArguments(
    output_dir=f"./headqa-finetune",
    do_train=True,
    do_eval=True,
    do_predict=True,
    logging_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    save_strategy="epoch",
    metric_for_best_model='accuracy'
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [37]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_datasets['train'],
    eval_dataset=encoded_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics
)

In [38]:
test_predictions = trainer.evaluate(encoded_datasets['test'])

The following columns in the evaluation set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, year, qtext, qid, image, category, answers, name.
***** Running Evaluation *****
  Num examples = 2742
  Batch size = 8


In [39]:
test_predictions

{'eval_accuracy': 0.4237782657146454,
 'eval_loss': 1.4385887384414673,
 'eval_runtime': 88.2401,
 'eval_samples_per_second': 31.074,
 'eval_steps_per_second': 3.887}

Inspect predictions. Use trainer.predict to get individual prediction.

In [48]:
predictions = trainer.predict(encoded_datasets['test'])

The following columns in the test set  don't have a corresponding argument in `BertForMultipleChoice.forward` and have been ignored: ra, year, qtext, qid, image, category, answers, name.
***** Running Prediction *****
  Num examples = 2742
  Batch size = 8


trainer.predict returns a tuple of length 3 - logits, labels, metric

In [56]:
import numpy as np
def show_predction(dataset, predictions, index):
  show_one(dataset['test'][index])
  prediction = np.argmax(predictions[0][index]) + 1
  print(f"Predicted choice: option {prediction}")

Pick a data index to inspect.

In [62]:
index = 1
show_predction(headqa, predictions[0], index)

Question: The cardiolipin phospholipid is abundant in the membrane:
  1 - Internal mitochondrial
  2 - External mitochondrial
  3 - Plasma.
  4 - Lysosomal

Ground truth: option 1
Predicted choice: option 1
