# Text Mining Project
## Question Answering on SQUAD

- autor:
Samuele Marino

### Introduction

Question Answering (QA) models can automate the response to frequently asked questions by using a knowledge base (documents) as context.
Exist several different QA variants based on the inputs and outputs, for example:
- __Extractive QA__: The model extracts the answer from a context. The context here could be a provided text, a table or even HTML.
- __Open Generative QA__: The model generates free text directly based on the context.
- __Closed Generative QA__: In this case, no context is provided. The answer is completely generated by a model.

In this notebook I will provide a __Extractive QA__ model based on __SQuAD__ dataset.

## Import

In [2]:
#Colab
!pip install transformers datasets accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m106.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, default_data_collator, get_scheduler
from datasets import load_dataset, load_metric, ClassLabel, Sequence, DatasetDict
from IPython.display import display, HTML
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from functools import partial
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import collections
import random
import torch
import os

## Datasets

Stanford Question Answering Dataset (__SQuAD__) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- __SQuAD1.1__, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles.
- __SQuAD2.0__ combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Download dataset

In [4]:
datasets = load_dataset("squad")
datasets_v2 = load_dataset("squad_v2")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Have a quick look on the data

In [5]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [6]:
datasets_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Only the train and validation sets are provided. I will use the original validation set as the test set, and then I will split the test set to obtain our validation set.

In [7]:
train_valid = datasets['train'].train_test_split(test_size=0.2)
squad = DatasetDict({
    'train': train_valid['train'],
    'validation': train_valid['test'],
    'test': datasets['validation']})
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 70079
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 17520
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [8]:
train_valid_v2 = datasets_v2['train'].train_test_split(test_size=0.2)
squad_v2 = DatasetDict({
    'train': train_valid_v2['train'],
    'validation': train_valid_v2['test'],
    'test': datasets_v2['validation']})
squad_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 104255
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 26064
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Show some elements of the datasets

In [9]:
def show_random_elements(dataset, num_examples=5):
    # Can't pick more elements than there are in the dataset
    assert num_examples <= len(dataset)
    
    picks =  random.sample(range(1, len(dataset)-1), num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(squad["train"])

Unnamed: 0,id,title,context,question,answers
0,570d7708b3d812140066d9b2,Anti-aircraft_warfare,"Smaller boats and ships typically have machine-guns or fast cannons, which can often be deadly to low-flying aircraft if linked to a radar-directed fire-control system radar-controlled cannon for point defence. Some vessels like Aegis cruisers are as much a threat to aircraft as any land-based air defence system. In general, naval vessels should be treated with respect by aircraft, however the reverse is equally true. Carrier battle groups are especially well defended, as not only do they typically consist of many vessels with heavy air defence armament but they are also able to launch fighter jets for combat air patrol overhead to intercept incoming airborne threats.",Carrier battle groups can launch what to intercept incoming threats?,"{'text': ['fighter jets'], 'answer_start': [593]}"
1,570a84444103511400d597ef,Everton_F.C.,"There have been indications since 1996 that Everton will move to a new stadium. The original plan was for a new 60,000-seat stadium to be built, but in 2000 a proposal was submitted to build a 55,000 seat stadium as part of the King's Dock regeneration. This was unsuccessful as Everton failed to generate the £30 million needed for a half stake in the stadium project, with the city council rejecting the proposal in 2003. Late in 2004, driven by Liverpool Council and the Northwest Development Corporation, the club entered talks with Liverpool F.C. about sharing a proposed stadium on Stanley Park. Negotiations broke down as Everton failed to raise 50% of the costs. On 11 January 2005, Liverpool announced that ground-sharing was not a possibility, proceeding to plan their own Stanley Park Stadium.",How much money did Everton FC need to generate for a half-stake in the new stadium project in 2000?,"{'text': ['£30 million'], 'answer_start': [310]}"
2,57267c12dd62a815002e86c3,British_Empire,"The last decades of the 19th century saw concerted political campaigns for Irish home rule. Ireland had been united with Britain into the United Kingdom of Great Britain and Ireland with the Act of Union 1800 after the Irish Rebellion of 1798, and had suffered a severe famine between 1845 and 1852. Home rule was supported by the British Prime minister, William Gladstone, who hoped that Ireland might follow in Canada's footsteps as a Dominion within the empire, but his 1886 Home Rule bill was defeated in Parliament. Although the bill, if passed, would have granted Ireland less autonomy within the UK than the Canadian provinces had within their own federation, many MPs feared that a partially independent Ireland might pose a security threat to Great Britain or mark the beginning of the break-up of the empire. A second Home Rule bill was also defeated for similar reasons. A third bill was passed by Parliament in 1914, but not implemented because of the outbreak of the First World War leading to the 1916 Easter Rising.",The first Home Rule bill would have given Ireland less self-control than what other territory?,"{'text': ['Canada'], 'answer_start': [413]}"
3,5726bffb5951b619008f7d2d,Nigeria,"The Kingdom of Nri of the Igbo people consolidated in the 10th century and continued until it lost its sovereignty to the British in 1911. Nri was ruled by the Eze Nri, and the city of Nri is considered to be the foundation of Igbo culture. Nri and Aguleri, where the Igbo creation myth originates, are in the territory of the Umeuri clan. Members of the clan trace their lineages back to the patriarchal king-figure Eri. In West Africa, the oldest bronzes made using the lost-wax process were from Igbo Ukwu, a city under Nri influence.",Which tribe ran the city of Nri?,"{'text': ['Igbo'], 'answer_start': [227]}"
4,57304af1069b531400832002,Windows_8,"Several notable video game developers criticized Microsoft for making its Windows Store a closed platform subject to its own regulations, as it conflicted with their view of the PC as an open platform. Markus ""Notch"" Persson (creator of the indie game Minecraft), Gabe Newell (co-founder of Valve Corporation and developer of software distribution platform Steam), and Rob Pardo from Activision Blizzard voiced concern about the closed nature of the Windows Store. However, Tom Warren of The Verge stated that Microsoft's addition of the Store was simply responding to the success of both Apple and Google in pursuing the ""curated application store approach.""",What company is Rob Pardo associated with?,"{'text': ['Activision Blizzard'], 'answer_start': [384]}"


In [11]:
show_random_elements(squad_v2["train"])

Unnamed: 0,id,title,context,question,answers
0,5a42d4e74a4859001aac7359,Philosophy_of_space_and_time,"The positions on the persistence of objects are somewhat similar. An endurantist holds that for an object to persist through time is for it to exist completely at different times (each instance of existence we can regard as somehow separate from previous and future instances, though still numerically identical with them). A perdurantist on the other hand holds that for a thing to exist through time is for it to exist as a continuous reality, and that when we consider the thing as a whole we must consider an aggregate of all its ""temporal parts"" or instances of existing. Endurantism is seen as the conventional view and flows out of our pre-philosophical ideas (when I talk to somebody I think I am talking to that person as a complete object, and not just a part of a cross-temporal being), but perdurantists have attacked this position. (An example of a perdurantist is David Lewis.) One argument perdurantists use to state the superiority of their view is that perdurantism is able to take account of change in objects.",Who says that objects existin incompletely in the past present and future?,"{'text': [], 'answer_start': []}"
1,56e10f57cd28a01900c67503,Canon_law,"In the Church of England, the ecclesiastical courts that formerly decided many matters such as disputes relating to marriage, divorce, wills, and defamation, still have jurisdiction of certain church-related matters (e.g. discipline of clergy, alteration of church property, and issues related to churchyards). Their separate status dates back to the 12th century when the Normans split them off from the mixed secular/religious county and local courts used by the Saxons. In contrast to the other courts of England the law used in ecclesiastical matters is at least partially a civil law system, not common law, although heavily governed by parliamentary statutes. Since the Reformation, ecclesiastical courts in England have been royal courts. The teaching of canon law at the Universities of Oxford and Cambridge was abrogated by Henry VIII; thereafter practitioners in the ecclesiastical courts were trained in civil law, receiving a Doctor of Civil Law (D.C.L.) degree from Oxford, or a Doctor of Laws (LL.D.) degree from Cambridge. Such lawyers (called ""doctors"" and ""civilians"") were centered at ""Doctors Commons"", a few streets south of St Paul's Cathedral in London, where they monopolized probate, matrimonial, and admiralty cases until their jurisdiction was removed to the common law courts in the mid-19th century.",Who was responsible for banning canon law education from Oxford and Cambridge?,"{'text': ['Henry VIII'], 'answer_start': [833]}"
2,5a8008cf8f0597001ac0014f,Symbiosis,"Commensalism describes a relationship between two living organisms where one benefits and the other is not significantly harmed or helped. It is derived from the English word commensal used of human social interaction. The word derives from the medieval Latin word, formed from com- and mensa, meaning ""sharing a table"".",What type of symbiotic relationship happens when there is a major affect on the other organism?,"{'text': [], 'answer_start': []}"
3,573146e605b4da19006bcfad,Qing_dynasty,"Ratification of the treaty the following year led to resumption of hostilities and in 1860, with Anglo-French forces marching on Beijing, the emperor and his court fled the capital for the imperial hunting lodge at Rehe. Once in Beijing, the Anglo-French forces looted the Old Summer Palace, and in an act of revenge for the arrest of several Englishmen, burnt it to the ground. Prince Gong, a younger half-brother of the emperor, who had been left as his brother's proxy in the capital, was forced to sign the Convention of Beijing. Meanwhile, the humiliated emperor died the following year at Rehe.",What did Prince Gong sign?,"{'text': ['Convention of Beijing'], 'answer_start': [511]}"
4,56e104b7e3433e1400422acb,Canon_law,"The Greek-speaking Orthodox have collected canons and commentaries upon them in a work known as the Pēdálion (Greek: Πηδάλιον, ""Rudder""), so named because it is meant to ""steer"" the Church. The Orthodox Christian tradition in general treats its canons more as guidelines than as laws, the bishops adjusting them to cultural and other local circumstances. Some Orthodox canon scholars point out that, had the Ecumenical Councils (which deliberated in Greek) meant for the canons to be used as laws, they would have called them nómoi/νόμοι (laws) rather than kanónes/κανόνες (rules), but almost all Orthodox conform to them. The dogmatic decisions of the Councils, though, are to be obeyed rather than to be treated as guidelines, since they are essential for the Church's unity.",What are the constituents of the Pēdálion?,"{'text': ['canons and commentaries upon them'], 'answer_start': [43]}"


## Model definition

In [12]:
model_checkpoint = "distilroberta-base"
#model_checkpoint = 'bert-base-uncased'
#model_checkpoint = "prajjwal1/bert-mini"
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#Span Model
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(f"\nROBERTA is trained for sequences up to {model.config.max_position_embeddings} tokens")

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be 


ROBERTA is trained for sequences up to 514 tokens


## Prepare features 

Let’s start with preprocessing the training data. The hard part will be to generate labels for the question’s answer, which will be the start and end positions of the tokens corresponding to the answer inside the context.

One specific thing for the preprocessing in question answering is how to deal with very long documents. It is usual to truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in __SQuAD__ dataset to give several input features, each of length shorter than the maximum length of the model. Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`.

It is important to know that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, the tokenizer can automatically return us a list of features capped by a certain maximum length, with the specific overlap. This information is stored in `overflow_to_sample_mapping`.

Now we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. The tokenizer returning an `offset_mapping` that gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context.

For that reason we use the `sequence_ids` method that returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). 

Now with all of this, we can find the first and last token of the answer in one of our input feature or if the answer is not in this feature.

In [13]:
def prepare_train_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Preprocessing the validation data will be slightly easier as we don’t need to generate labels because we don't want to compute a validation loss. This because the loss don’t really help to understand how good the model is. The real joy will be to interpret the predictions of the model into spans of the original context. For this, we will just need to store both the offset mappings and some way to match each created feature to the original example it comes from. Since there is an ID column in the original dataset, we’ll use that ID.

The only thing we’ll add here is a tiny bit of cleanup of the offset mappings. They will contain offsets for the question and the context, but once we’re in the post-processing stage we won’t have any way to know which part of the input IDs corresponded to the context and which part was the question.

In [14]:
def prepare_test_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

## Post process and Metric

Once that the model generate the output, we will need to map the predictions back to parts of the context. The model itself predicts logits for the start and en position of our answers. The output of the model is a dict-like object.

We have one logit for each feature and each token. The most obvious thing to predict an answer for each feature is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position. This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. 

To choose the best start and end logits we pick the highest `n_best_size` logits and attributed a score to each (start_token, end_token) pair. An idea for the score can be the product but because we are working with logits the score will be the sum of the start and end logits. After checking if each one pairis valid, we will sort them by their score and keep the best one. A pair is not valid if give:
- Answer that wouldn’t be inside the context
- An answer with negative length
- An answer that is too long (we limit the possibilities at `max_answer_length`=30)

The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. So we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one.

Finally we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the minimum of the scores for the impossible answer in each feature generated by the example. We then predict the impossible answer when that score is greater than the score of the best non-impossible answer.

In [15]:
def postprocess_qa_predictions(all_start_logits, all_end_logits, examples, features, n_best_size=20, max_answer_length=30, squad_v2=False):
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                        or len(offset_mapping[start_index]) == 0
                        or len(offset_mapping[end_index]) == 0
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

To compute the metric we just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. 

In [16]:
def compute_metrics(predictions, examples, squad_v2=False):
    if squad_v2:
        formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()]
    else:
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    
    metric = load_metric("squad_v2" if squad_v2 else "squad") 
    
    return metric.compute(predictions=formatted_predictions, references=theoretical_answers)

## Train and Evaluation


In [17]:
def train(model,
          tokenized_train_dataset, 
          tokenized_val_dataset, 
          raw_val_dataset,
          folder,
          num_train_epochs=3,
          learning_rate=2e-5,
          batch_size=64,
          squad_v2=False):

    tokenized_train_dataset.set_format("torch")
    val_for_model = tokenized_val_dataset.remove_columns(["example_id", "offset_mapping"])
    val_for_model.set_format("torch")

    train_dataloader = DataLoader(tokenized_train_dataset,
                                  shuffle=True,
                                  collate_fn=default_data_collator,
                                  batch_size=batch_size)
    
    eval_dataloader = DataLoader(val_for_model,
                                 collate_fn=default_data_collator,
                                 batch_size=batch_size)

    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-08)

    accelerator = Accelerator()
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

    lr_scheduler = get_scheduler("linear",
                                 optimizer=optimizer,
                                 num_warmup_steps=0,
                                 num_training_steps=num_train_epochs*len(train_dataloader))

    for epoch in range(1, num_train_epochs + 1):
        # Training
        model.train()
        train_loss = 0 # cumulative loss
        loop = tqdm(train_dataloader)
        for batch in loop:
            # Forward Pass
            outputs = model(**batch)
            # Find the Loss
            loss = outputs.loss
            # Calculate gradients 
            accelerator.backward(loss)
            # Update Weights
            optimizer.step()
            lr_scheduler.step()
            # Clear the gradients
            optimizer.zero_grad()
            # Calculate Loss
            train_loss += loss.item()

            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())

        # Compute average loss per epoch
        avg_train_loss = train_loss / len(train_dataloader)

        # Evaluation
        model.eval()
        exact_match = 0
        f1 = 0 
        start_logits = []
        end_logits = []

        loop = tqdm(eval_dataloader)
        for batch in loop:
            with torch.no_grad():
                # Forward Pass
                outputs = model(**batch)
                loop.set_description(f'Valid {epoch}')

            start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
            end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())


        start_logits = np.concatenate(start_logits)
        end_logits = np.concatenate(end_logits)

        prediction = postprocess_qa_predictions(start_logits, end_logits, raw_val_dataset, tokenized_val_dataset, squad_v2=squad_v2)

        metrics = compute_metrics(prediction, raw_val_dataset, squad_v2)
        f1_score = metrics['f1']
        exact_match_score = metrics['exact'] if squad_v2 else metrics['exact_match']

        print(f'Epoch {epoch}:\t train-loss = {avg_train_loss:.2f}\t val-f1 = {f1_score:.2f}\t exact_match = {exact_match_score:.2f}')


        # Save and upload
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        path = ''.join([folder, '/', str(epoch), '/'])
        if not os.path.exists(path):
            os.makedirs(path)
        unwrapped_model.save_pretrained(path, save_function=accelerator.save)

In [18]:
def generate(model, tokenized_test, raw_test_dataset, batch_size=128, squad_v2=False):
  test_for_model = tokenized_test.remove_columns(["example_id", "offset_mapping"])
  test_for_model.set_format("torch")

  test_dataloader = DataLoader(test_for_model, 
                               collate_fn=default_data_collator, 
                               batch_size=batch_size)

  model, test_dataloader = Accelerator().prepare(model, test_dataloader)

  model.eval()

  start_logits = []
  end_logits = []

  for batch in tqdm(test_dataloader):
      with torch.no_grad():
          outputs = model(**batch)

      start_logits.append((outputs.start_logits).cpu().numpy())
      end_logits.append((outputs.end_logits).cpu().numpy())


  start_logits = np.concatenate(start_logits)
  end_logits = np.concatenate(end_logits)

  return postprocess_qa_predictions(start_logits, end_logits, raw_test_dataset, tokenized_test, squad_v2=squad_v2)

## Result

In [19]:
max_length = 256 # The maximum length of a feature (question and context)
doc_stride = 64 # The authorized overlap between two part of the context when splitting it is needed.
batch_size = 32
path='/content/squad'
path_v2='/content/squad_v2'

### SQuAD

In [None]:
tokenized_train_dataset = squad["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["train"].column_names,
)

tokenized_val_dataset = squad["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["validation"].column_names
)

      

#2:   0%|          | 0/24 [00:00<?, ?ba/s]

#1:   0%|          | 0/24 [00:00<?, ?ba/s]

#0:   0%|          | 0/24 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/6 [00:00<?, ?ba/s]

#1:   0%|          | 0/6 [00:00<?, ?ba/s]

#2:   0%|          | 0/6 [00:00<?, ?ba/s]

In [None]:
train(model,
      tokenized_train_dataset, 
      tokenized_val_dataset, 
      squad["validation"],
      num_train_epochs=3,
      folder=path,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=False)

  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.41	 val-f1 = 78.40	 exact_match = 64.79


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.95	 val-f1 = 80.01	 exact_match = 66.62


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 3:	 train-loss = 0.82	 val-f1 = 80.67	 exact_match = 67.21


In [None]:
torch.save(model.state_dict(), '/content/squad.pth')

In [None]:
tokenized_test_dataset = squad['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad["test"].column_names,
)

  0%|          | 0/11 [00:00<?, ?ba/s]

In [None]:
prediction = generate(model, tokenized_test_dataset, squad['test'], batch_size=256)

  0%|          | 0/47 [00:00<?, ?it/s]

Post-processing 10570 example predictions split into 11912 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In [None]:
compute_metrics(prediction, squad['test'], squad_v2=False)

{'exact_match': 79.75402081362347, 'f1': 87.22165222785182}

### SQuAD2

In [20]:
tokenized_train_dataset_v2 = squad_v2["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["train"].column_names,
)

tokenized_val_dataset_v2 = squad_v2["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["validation"].column_names
)

      

#0:   0%|          | 0/35 [00:00<?, ?ba/s]

#2:   0%|          | 0/35 [00:00<?, ?ba/s]

#1:   0%|          | 0/35 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

In [None]:
train(model,
      tokenized_train_dataset_v2, 
      tokenized_val_dataset_v2, 
      squad_v2["validation"],
      folder=path_v2,
      num_train_epochs=3,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=True)

  0%|          | 0/3604 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28790 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.40	 val-f1 = 68.86	 exact_match = 60.93


  0%|          | 0/3604 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28790 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.98	 val-f1 = 73.34	 exact_match = 65.77


  0%|          | 0/3604 [00:00<?, ?it/s]

In [None]:
torch.save(model.state_dict(), '/content/squad_v2.pth')

In [None]:
tokenized_test_dataset_v2 = squad_v2['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad_v2["test"].column_names,
)

In [None]:
prediction_v2 = generate(model, tokenized_test_dataset_v2, squad_v2['test'], batch_size=256, squad_v2=True)

In [None]:
compute_metrics(prediction_v2, squad_v2['test'], squad_v2=True)

In [None]:
compute_metrics(prediction_v2, squad_v2['test'], squad_v2=True)