# Train a Q&A model
In this notebook we fine tune a transformer model on a dataset using the HuggingFace datasets and transformers library.

In [1]:
! pip3 install datasets transformers

Collecting datasets
  Downloading datasets-1.15.1-py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 5.1 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 39.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 37.7 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 45.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 7.0 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.0-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 52.2 MB/s 
Collecting sacremoses


# Training a question answering model

In this notebook, we will see how to fine-tune one of the 🤗 Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer API to fine-tune a model on it.

In [2]:
import transformers

print(transformers.__version__)

4.12.5


In [3]:
squad_v2 = True
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [4]:
from datasets import load_dataset, load_metric, DatasetDict

In [5]:
train, validation = load_dataset("squad_v2" if squad_v2 else "squad", split=['train[:10%]', 'validation'])

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
datasets = DatasetDict()

In [7]:
datasets["train"] = train
datasets["validation"] = validation

In [8]:
datasets["validation"][5]

{'answers': {'answer_start': [], 'text': []},
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'id': '5ad39d53604f3c001a3fe8d1',
 'question': "Who gave their name to Normandy in the 1000's and 1100's",
 'title': 'Normans'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).


In [59]:
from utils import *

In [10]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,5a6b045ca9e0c9001a4e9e6b,Wayback_Machine,"Snapshots usually become available more than six months after they are archived or, in some cases, even later; it can take twenty-four months or longer. The frequency of snapshots is variable, so not all tracked web site updates are recorded. Sometimes there are intervals of several weeks or years between snapshots.",What is the minimum amount of time that elapses before most snapshots are released for recording?,"{'text': [], 'answer_start': []}"
1,56bfd351a10cfb1400551312,Beyoncé,"Beyoncé's work has influenced numerous artists including Adele, Ariana Grande, Lady Gaga, Bridgit Mendler, Rihanna, Kelly Rowland, Sam Smith, Meghan Trainor, Nicole Scherzinger, Rita Ora, Zendaya, Cheryl Cole, JoJo, Alexis Jordan, Jessica Sanchez, and Azealia Banks. American indie rock band White Rabbits also cited her an inspiration for their third album Milk Famous (2012), friend Gwyneth Paltrow studied Beyoncé at her live concerts while learning to become a musical performer for the 2010 film Country Strong. Nicki Minaj has stated that seeing Beyoncé's Pepsi commercial influenced her decision to appear in the company's 2012 global campaign.",What about Beyonce has influenced many entertainers?,"{'text': ['work'], 'answer_start': [10]}"
2,5a18d1639aa02b0018605ed2,Iranian_languages,"Proto-Iranian thus dates to some time after Proto-Indo-Iranian break-up, or the early second millennium BCE, as the Old Iranian languages began to break off and evolve separately as the various Iranian tribes migrated and settled in vast areas of southeastern Europe, the Iranian plateau, and Central Asia.",What language came sometime after the breakup of Proto-Iranian?,"{'text': [], 'answer_start': []}"
3,5a63251e68151a001a92222b,Southern_Europe,"Beginning roughly in the 14th century in Florence, and later spreading through Europe with the development of the printing press, a Renaissance of knowledge challenged traditional doctrines in science and theology, with the Arabic texts and thought bringing about rediscovery of classical Greek and Roman knowledge.",The encounter with the Florence printing press put Renaissance thinkers back in touch with the teachings of which civilizations?,"{'text': [], 'answer_start': []}"
4,56cbef3a6d243a140015edfd,Frédéric_Chopin,"Chopin's successes as a composer and performer opened the door to western Europe for him, and on 2 November 1830, he set out, in the words of Zdzisław Jachimecki, ""into the wide world, with no very clearly defined aim, forever."" With Woyciechowski, he headed for Austria, intending to go on to Italy. Later that month, in Warsaw, the November 1830 Uprising broke out, and Woyciechowski returned to Poland to enlist. Chopin, now alone in Vienna, was nostalgic for his homeland, and wrote to a friend, ""I curse the moment of my departure."" When in September 1831 he learned, while travelling from Vienna to Paris, that the uprising had been crushed, he expressed his anguish in the pages of his private journal: ""Oh God! ... You are there, and yet you do not take vengeance!"" Jachimecki ascribes to these events the composer's maturing ""into an inspired national bard who intuited the past, present and future of his native Poland.""",Which country did Frédéric go to first after setting out for Western Europe?,"{'text': ['Austria'], 'answer_start': [263]}"
5,5a18c2e19aa02b0018605eac,Iranian_languages,"The Iranian languages or Iranic languages form a branch of the Indo-Iranian languages, which in turn are a branch of the Indo-European language family. The speakers of Iranian languages are known as Iranian peoples. Historical Iranian languages are grouped in three stages: Old Iranian (until 400 BCE), Middle Iranian (400 BCE – 900 CE), and New Iranian (since 900 CE). Of the Old Iranian languages, the better understood and recorded ones are Old Persian (a language of Achaemenid Iran) and Avestan (the language of the Avesta). Middle Iranian languages included Middle Persian (a language of Sassanid Iran), Parthian, and Bactrian.",When did the change from old Iranian to new Iranian occur?,"{'text': [], 'answer_start': []}"
6,56df28373277331400b4d9c3,Sony_Music_Entertainment,"Sony renamed the record company Sony Music Entertainment (SME) on January 1, 1991, fulfilling the terms set under the 1988 buyout, which granted only a transitional license to the CBS trademark. The CBS Associated label was renamed Epic Associated. Also on January 1, 1991, to replace the CBS label, Sony reintroduced the Columbia label worldwide, which it previously held in the United States and Canada only, after it acquired the international rights to the trademark from EMI in 1990. Japan is the only country where Sony does not have rights to the Columbia name as it is controlled by Nippon Columbia, an unrelated company. Thus, until this day, Sony Music Entertainment Japan does not use the Columbia trademark for Columbia label recordings from outside Japan which are issued in Japan. The Columbia Records trademark's rightsholder in Spain was Bertelsmann Music Group, Germany, which Sony Music subsequently subsumed via a 2004 merger, followed by a 2008 buyout.",In what year did the name Sony Music Entertainment become the new name of Sony's record label?,"{'text': ['1991'], 'answer_start': [77]}"
7,56d9e3e8dc89441400fdb8aa,Dog,"In Greek mythology, Cerberus is a three-headed watchdog who guards the gates of Hades. In Norse mythology, a bloody, four-eyed dog called Garmr guards Helheim. In Persian mythology, two four-eyed dogs guard the Chinvat Bridge. In Philippine mythology, Kimat who is the pet of Tadaklan, god of thunder, is responsible for lightning. In Welsh mythology, Annwn is guarded by Cŵn Annwn.",Who is the dog that guards Helheim?,"{'text': ['Garmr'], 'answer_start': [138]}"
8,5ad30d26604f3c001a3fdb0b,Cardinal_(Catholicism),"The Cardinal Camerlengo of the Holy Roman Church, assisted by the Vice-Camerlengo and the other prelates of the office known as the Apostolic Camera, has functions that in essence are limited to a period of sede vacante of the papacy. He is to collate information about the financial situation of all administrations dependent on the Holy See and present the results to the College of Cardinals, as they gather for the papal conclave.",Who does nor collate information about the financial situation of all dependent administrations of the Holy See?,"{'text': [], 'answer_start': []}"
9,56d64d231c85041400947089,2008_Sichuan_earthquake,"In the China Digital Times an article reports a close analysis by an alleged Chinese construction engineer known online as “Book Blade” (书剑子), who stated:",Where was an article reported about the scandal?,"{'text': ['China Digital Times'], 'answer_start': [7]}"


## Preprocessing the training data

In [11]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [12]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [13]:
pad_on_right = tokenizer.padding_side == "right"

In [14]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [15]:
features = prepare_train_features(datasets["train"][:5], tokenizer, pad_on_right, max_length, doc_stride)

In [16]:
tokenized_datasets = datasets.map(lambda x: prepare_train_features(x, tokenizer, pad_on_right, max_length, doc_stride), batched=True, remove_columns=datasets["train"].column_names)

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/12 [00:00<?, ?ba/s]

In [17]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 13186
    })
    validation: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 12134
    })
})

## Fine-tuning the model

In [18]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

In [19]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

In [20]:
from transformers import default_data_collator

data_collator = default_data_collator

In [21]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [22]:
trainer.train()

***** Running training *****
  Num examples = 13186
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2475


Epoch,Training Loss,Validation Loss
1,2.7762,1.899838
2,1.2887,1.998723
3,1.066,2.127269


Saving model checkpoint to distilbert-base-uncased-finetuned-squad2/checkpoint-500
Configuration saved in distilbert-base-uncased-finetuned-squad2/checkpoint-500/config.json
Model weights saved in distilbert-base-uncased-finetuned-squad2/checkpoint-500/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-squad2/checkpoint-500/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-squad2/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 12134
  Batch size = 16
Saving model checkpoint to distilbert-base-uncased-finetuned-squad2/checkpoint-1000
Configuration saved in distilbert-base-uncased-finetuned-squad2/checkpoint-1000/config.json
Model weights saved in distilbert-base-uncased-finetuned-squad2/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-squad2/checkpoint-1000/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-fi

TrainOutput(global_step=2475, training_loss=1.553262236624053, metrics={'train_runtime': 3483.3124, 'train_samples_per_second': 11.356, 'train_steps_per_second': 0.711, 'total_flos': 3876281498299392.0, 'train_loss': 1.553262236624053, 'epoch': 3.0})

In [23]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
  
tokenizer = AutoTokenizer.from_pretrained("mvonwyl/distilbert-base-uncased-finetuned-squad2")

model = AutoModelForQuestionAnswering.from_pretrained("mvonwyl/distilbert-base-uncased-finetuned-squad2")

https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpwp0cn19f


Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/0eefcbdecb44adc8a8a31203c6fc6081e4abee730e80b3d7c73ae8149d76268e.42154c5fd30bfa7e34941d0d8ad26f8a3936990926fbe06b2da76dd749b1c6d4
creating metadata file for /root/.cache/huggingface/transformers/0eefcbdecb44adc8a8a31203c6fc6081e4abee730e80b3d7c73ae8149d76268e.42154c5fd30bfa7e34941d0d8ad26f8a3936990926fbe06b2da76dd749b1c6d4
https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/vocab.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0atgso96


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/571aebbe8ed96147d4b98134c0a95beec9029368b0c35a1190f40094ee186478.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
creating metadata file for /root/.cache/huggingface/transformers/571aebbe8ed96147d4b98134c0a95beec9029368b0c35a1190f40094ee186478.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpm00nh40m


Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/28675318b6e2e081432f0e2fa9539dd176e7765623809f220b03cbafaaf810ad.3d26c56b358cca40b16bcdc782f4b906e5c295e21e24fae62803895dae040f25
creating metadata file for /root/.cache/huggingface/transformers/28675318b6e2e081432f0e2fa9539dd176e7765623809f220b03cbafaaf810ad.3d26c56b358cca40b16bcdc782f4b906e5c295e21e24fae62803895dae040f25
https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpqhahh_1q


Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/8059125746dc8b7c4ab020d7078038ce3ba711f8ee2be055be536ac398300fe1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/8059125746dc8b7c4ab020d7078038ce3ba711f8ee2be055be536ac398300fe1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/571aebbe8ed96147d4b98134c0a95beec9029368b0c35a1190f40094ee186478.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
loading file https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/28675318b6e2e081432f0e2fa9539dd176e7765623809f220b03c

Downloading:   0%|          | 0.00/561 [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/cffbca5d7cd141d50b708ec6a05c79978d19c4dd3074fcd116c149cd355221f8.a540da103e8b3d40c8db787d1fe802d9215f260969fa27deaa13b88795d8181b
creating metadata file for /root/.cache/huggingface/transformers/cffbca5d7cd141d50b708ec6a05c79978d19c4dd3074fcd116c149cd355221f8.a540da103e8b3d40c8db787d1fe802d9215f260969fa27deaa13b88795d8181b
loading configuration file https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/cffbca5d7cd141d50b708ec6a05c79978d19c4dd3074fcd116c149cd355221f8.a540da103e8b3d40c8db787d1fe802d9215f260969fa27deaa13b88795d8181b
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "drop

Downloading:   0%|          | 0.00/253M [00:00<?, ?B/s]

storing https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/b5b35581fa082df7d2ff2451f726fd2ebd8be192e2a873d4f7e53285f94db045.19e0e53a6167116b32114f34d67c198cfdaa42db408fe90a32add525213a6fc1
creating metadata file for /root/.cache/huggingface/transformers/b5b35581fa082df7d2ff2451f726fd2ebd8be192e2a873d4f7e53285f94db045.19e0e53a6167116b32114f34d67c198cfdaa42db408fe90a32add525213a6fc1
loading weights file https://huggingface.co/mvonwyl/distilbert-base-uncased-finetuned-squad2/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/b5b35581fa082df7d2ff2451f726fd2ebd8be192e2a873d4f7e53285f94db045.19e0e53a6167116b32114f34d67c198cfdaa42db408fe90a32add525213a6fc1
All model checkpoint weights were used when initializing DistilBertForQuestionAnswering.

All the weights of DistilBertForQuestionAnswering were initialized from the model checkpoint at mvonwyl/distilbert-bas

In [24]:
trainer_fine_tuned = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

# Evaluation

In [25]:
import torch

# for batch in trainer.get_eval_dataloader():
#     break
# batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
# with torch.no_grad():
#     output = trainer.model(**batch)
# output.keys()

In [26]:
# output.start_logits.shape, output.end_logits.shape

In [27]:
# output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

In [28]:
# n_best_size = 20
import numpy as np

# start_logits = output.start_logits[0].cpu().numpy()
# end_logits = output.end_logits[0].cpu().numpy()
# # Gather the indices the best start/end logits:
# start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
# end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
# valid_answers = []
# for start_index in start_indexes:
#     for end_index in end_indexes:
#         if start_index <= end_index: # We need to refine that test to check the answer is inside the context
#             valid_answers.append(
#                 {
#                     "score": start_logits[start_index] + end_logits[end_index],
#                     "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
#                 }
#             )

In [29]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [30]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

  0%|          | 0/12 [00:00<?, ?ba/s]

In [31]:
type(model)

transformers.models.distilbert.modeling_distilbert.DistilBertForQuestionAnswering

In [32]:
raw_predictions = trainer.predict(validation_features)
raw_predictions_fine_tuned = trainer_fine_tuned.predict(validation_features)

The following columns in the test set  don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping.
***** Running Prediction *****
  Num examples = 12134
  Batch size = 16


The following columns in the test set  don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping.
***** Running Prediction *****
  Num examples = 12134
  Batch size = 16


In [33]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

In [34]:
# max_answer_length = 30
# start_logits = output.start_logits[0].cpu().numpy()
# end_logits = output.end_logits[0].cpu().numpy()
# offset_mapping = validation_features[0]["offset_mapping"]
# # The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# # an example index
# context = datasets["validation"][0]["context"]

# # Gather the indices the best start/end logits:
# start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
# end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
# valid_answers = []
# for start_index in start_indexes:
#     for end_index in end_indexes:
#         # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
#         # to part of the input_ids that are not in the context.
#         if (
#             start_index >= len(offset_mapping)
#             or end_index >= len(offset_mapping)
#             or offset_mapping[start_index] is None
#             or offset_mapping[end_index] is None
#         ):
#             continue
#         # Don't consider answers with a length that is either < 0 or > max_answer_length.
#         if end_index < start_index or end_index - start_index + 1 > max_answer_length:
#             continue
#         if start_index <= end_index: # We need to refine that test to check the answer is inside the context
#             start_char = offset_mapping[start_index][0]
#             end_char = offset_mapping[end_index][1]
#             valid_answers.append(
#                 {
#                     "score": start_logits[start_index] + end_logits[end_index],
#                     "text": context[start_char: end_char]
#                 }
#             )

# valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
# valid_answers

In [35]:
# datasets["validation"][0]["answers"]

In [36]:
import collections

# examples = datasets["validation"]
# features = validation_features

# example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
# features_per_example = collections.defaultdict(list)
# for i, feature in enumerate(features):
#     features_per_example[example_id_to_index[feature["example_id"]]].append(i)

In [37]:
from tqdm.auto import tqdm
import numpy as np

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [48]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)
final_predictions_fine_tuned = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions_fine_tuned.predictions)

Post-processing 11873 example predictions split into 12134 features.


  0%|          | 0/11873 [00:00<?, ?it/s]

Post-processing 11873 example predictions split into 12134 features.


  0%|          | 0/11873 [00:00<?, ?it/s]

In [49]:
metric = load_metric("squad_v2" if squad_v2 else "squad")

In [50]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'HasAns_exact': 41.98717948717949,
 'HasAns_f1': 48.36979810628287,
 'HasAns_total': 5928,
 'NoAns_exact': 44.49116904962153,
 'NoAns_f1': 44.49116904962153,
 'NoAns_total': 5945,
 'best_exact': 50.31584266823886,
 'best_exact_thresh': 0.0,
 'best_f1': 50.470898344844585,
 'best_f1_thresh': 0.0,
 'exact': 43.24096689968837,
 'f1': 46.42770682843805,
 'total': 11873}

In [51]:
if squad_v2:
    formatted_predictions_fine_tuned = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions_fine_tuned.items()]
else:
    formatted_predictions_fine_tuned = [{"id": k, "prediction_text": v} for k, v in final_predictions_fine_tuned.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions_fine_tuned, references=references)

{'HasAns_exact': 66.41363022941971,
 'HasAns_f1': 73.5008343719444,
 'HasAns_total': 5928,
 'NoAns_exact': 63.33052985702271,
 'NoAns_f1': 63.33052985702271,
 'NoAns_total': 5945,
 'best_exact': 64.86987282068559,
 'best_exact_thresh': 0.0,
 'best_f1': 68.40840109129043,
 'best_f1_thresh': 0.0,
 'exact': 64.86987282068559,
 'f1': 68.4084010912902,
 'total': 11873}

## Observations
We can clearly see that the distillbert fine tuned model performs better than the default distillbert model, with a difference of f1 score of around 22. Let's dive further to see what caused this difference.

In [72]:
show_first_elements(datasets["validation"], 10)

NameError: ignored

In [61]:
datasets["validation"][5]

{'answers': {'answer_start': [], 'text': []},
 'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'id': '5ad39d53604f3c001a3fe8d1',
 'question': "Who gave their name to Normandy in the 1000's and 1100's",
 'title': 'Normans'}

if we take a close look at the 5th row in the validation dataset, we can clearly see that the answer is not the present

In [62]:
formatted_predictions_fine_tuned[5]

{'id': '5ad39d53604f3c001a3fe8d1',
 'no_answer_probability': 0.0,
 'prediction_text': 'The Normans'}

In [63]:
formatted_predictions[5]

{'id': '5ad39d53604f3c001a3fe8d1',
 'no_answer_probability': 0.0,
 'prediction_text': ''}

In fact, the default model didnt pick up the answer as the validation set. But the fine tuned model, which was smarter at finding the answer wasnt gonna get it correctly because the validation example is also empty. That is maybe due to the ambiguity in the question which specified years (validation question) instead of centuries (validation example). But as humans we can easily figure out that the correct answer should indeed be 'Normans'.

In [66]:
n_empty_answers = len([v for v in datasets["validation"] if v["answers"]["text"] == []])
n_empty_answers

5945

In [71]:
print(int(n_empty_answers / len(datasets["validation"]) * 100), "%")

50 %


Now we can see that almost 50% of the validation dataset contains no answers. And since the model is sometimes able to predict complexe questions, it is bound to be a source of error in the metrics computation which may seem to lower the performance of the model at first sight

In [98]:
def show_gt_vs_predicted(dataset, num_examples=10, predictions=None):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = [i for i in range(num_examples)]
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    df["predicted"] = predictions[:num_examples]
    display(HTML(df.to_html()))

In [99]:
show_gt_vs_predicted(datasets["validation"], 20, formatted_predictions)

Unnamed: 0,id,title,context,question,answers,predicted
0,56ddde6b9a695914005b9628,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",In what country is Normandy located?,"{'text': ['France', 'France', 'France', 'France'], 'answer_start': [159, 159, 159, 159]}","{'id': '56ddde6b9a695914005b9628', 'prediction_text': 'France', 'no_answer_probability': 0.0}"
1,56ddde6b9a695914005b9629,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",When were the Normans in Normandy?,"{'text': ['10th and 11th centuries', 'in the 10th and 11th centuries', '10th and 11th centuries', '10th and 11th centuries'], 'answer_start': [94, 87, 94, 94]}","{'id': '56ddde6b9a695914005b9629', 'prediction_text': '10th and 11th centuries', 'no_answer_probability': 0.0}"
2,56ddde6b9a695914005b962a,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",From which countries did the Norse originate?,"{'text': ['Denmark, Iceland and Norway', 'Denmark, Iceland and Norway', 'Denmark, Iceland and Norway', 'Denmark, Iceland and Norway'], 'answer_start': [256, 256, 256, 256]}","{'id': '56ddde6b9a695914005b962a', 'prediction_text': 'Denmark, Iceland and Norway', 'no_answer_probability': 0.0}"
3,56ddde6b9a695914005b962b,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",Who was the Norse leader?,"{'text': ['Rollo', 'Rollo', 'Rollo', 'Rollo'], 'answer_start': [308, 308, 308, 308]}","{'id': '56ddde6b9a695914005b962b', 'prediction_text': 'Rollo', 'no_answer_probability': 0.0}"
4,56ddde6b9a695914005b962c,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",What century did the Normans first gain their separate identity?,"{'text': ['10th century', 'the first half of the 10th century', '10th', '10th'], 'answer_start': [671, 649, 671, 671]}","{'id': '56ddde6b9a695914005b962c', 'prediction_text': '10th century', 'no_answer_probability': 0.0}"
5,5ad39d53604f3c001a3fe8d1,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",Who gave their name to Normandy in the 1000's and 1100's,"{'text': [], 'answer_start': []}","{'id': '5ad39d53604f3c001a3fe8d1', 'prediction_text': '', 'no_answer_probability': 0.0}"
6,5ad39d53604f3c001a3fe8d2,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",What is France a region of?,"{'text': [], 'answer_start': []}","{'id': '5ad39d53604f3c001a3fe8d2', 'prediction_text': '', 'no_answer_probability': 0.0}"
7,5ad39d53604f3c001a3fe8d3,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",Who did King Charles III swear fealty to?,"{'text': [], 'answer_start': []}","{'id': '5ad39d53604f3c001a3fe8d3', 'prediction_text': '', 'no_answer_probability': 0.0}"
8,5ad39d53604f3c001a3fe8d4,Normans,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",When did the Frankish identity emerge?,"{'text': [], 'answer_start': []}","{'id': '5ad39d53604f3c001a3fe8d4', 'prediction_text': '', 'no_answer_probability': 0.0}"
9,56dddf4066d3e219004dad5f,Normans,"The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering southern Italy on the Saracens and Byzantines, and an expedition on behalf of their duke, William the Conqueror, led to the Norman conquest of England at the Battle of Hastings in 1066. Norman cultural and military influence spread from these new European centres to the Crusader states of the Near East, where their prince Bohemond I founded the Principality of Antioch in the Levant, to Scotland and Wales in Great Britain, to Ireland, and to the coasts of north Africa and the Canary Islands.",Who was the duke in the battle of Hastings?,"{'text': ['William the Conqueror', 'William the Conqueror', 'William the Conqueror'], 'answer_start': [1022, 1022, 1022]}","{'id': '56dddf4066d3e219004dad5f', 'prediction_text': 'William the Conqueror', 'no_answer_probability': 0.0}"


In [84]:
datasets["validation"][0]["answers"]["text"]

['France', 'France', 'France', 'France']

In [94]:
i = 0
while(formatted_predictions[i]["prediction_text"] in datasets["validation"][i]["answers"]["text"] or datasets["validation"][i]["answers"]["text"] == []):
  i +=1

In [95]:
formatted_predictions[i]["prediction_text"]

'Christian piety, becoming exponents of the Catholic orthodoxy'

In [96]:
datasets["validation"][i]["answers"]["text"]

['Catholic', 'Catholic orthodoxy', 'Catholic']

Here we also see an error caused by the fact that the model took the whole sentence containing the answer, instead of the just capturing the part of text that truly holds the answer, and therefore the noise in the model's answer is in fact an error since it does not corresponds with the validation set's answer.