# Text Mining Project
## Question Answering on SQUAD

- autor:
Samuele Marino

### Introduction

Question Answering (QA) models can automate the response to frequently asked questions by using a knowledge base (documents) as context.
Exist several different QA variants based on the inputs and outputs, for example:
- __Extractive QA__: The model extracts the answer from a context. The context here could be a provided text, a table or even HTML.
- __Open Generative QA__: The model generates free text directly based on the context.
- __Closed Generative QA__: In this case, no context is provided. The answer is completely generated by a model.

In this notebook I will provide a __Extractive QA__ model based on __SQuAD__ dataset.

## Import

In [1]:
#Colab
!pip install transformers datasets accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m96.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m111.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, default_data_collator, get_scheduler
from datasets import load_dataset, load_metric, ClassLabel, Sequence, DatasetDict
from IPython.display import display, HTML
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from functools import partial
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import collections
import random
import torch
import os

## Datasets

Stanford Question Answering Dataset (__SQuAD__) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- __SQuAD1.1__, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles.
- __SQuAD2.0__ combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Download dataset

In [3]:
datasets = load_dataset("squad")
datasets_v2 = load_dataset("squad_v2")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Have a quick look on the data

In [4]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [5]:
datasets_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Only the train and validation sets are provided. I will use the original validation set as the test set, and then I will split the test set to obtain our validation set.

In [6]:
train_valid = datasets['train'].train_test_split(test_size=0.2)
squad = DatasetDict({
    'train': train_valid['train'],
    'validation': train_valid['test'],
    'test': datasets['validation']})
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 70079
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 17520
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [7]:
train_valid_v2 = datasets_v2['train'].train_test_split(test_size=0.2)
squad_v2 = DatasetDict({
    'train': train_valid_v2['train'],
    'validation': train_valid_v2['test'],
    'test': datasets_v2['validation']})
squad_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 104255
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 26064
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Show some elements of the datasets

In [8]:
def show_random_elements(dataset, num_examples=5):
    # Can't pick more elements than there are in the dataset
    assert num_examples <= len(dataset)
    
    picks =  random.sample(range(1, len(dataset)-1), num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(squad["train"])

Unnamed: 0,id,title,context,question,answers
0,56fb88a0b28b3419009f1e02,Middle_Ages,"Society throughout Europe was disturbed by the dislocations caused by the Black Death. Lands that had been marginally productive were abandoned, as the survivors were able to acquire more fertile areas. Although serfdom declined in Western Europe it became more common in Eastern Europe, as landlords imposed it on those of their tenants who had previously been free. Most peasants in Western Europe managed to change the work they had previously owed to their landlords into cash rents. The percentage of serfs amongst the peasantry declined from a high of 90 to closer to 50 per cent by the end of the period. Landlords also became more conscious of common interests with other landholders, and they joined together to extort privileges from their governments. Partly at the urging of landlords, governments attempted to legislate a return to the economic conditions that existed before the Black Death. Non-clergy became increasingly literate, and urban populations began to imitate the nobility's interest in chivalry.","By the end of this period, about what percentage of Western Europeans were serfs?","{'text': ['50'], 'answer_start': [574]}"
1,56cc306b6d243a140015eec6,Sino-Tibetan_relations_during_the_Ming_dynasty,"The Ming initiated sporadic armed intervention in Tibet during the 14th century, but did not garrison permanent troops there. At times the Tibetans also used armed resistance against Ming forays. The Wanli Emperor (r. 1572–1620) made attempts to reestablish Sino-Tibetan relations after the Mongol-Tibetan alliance initiated in 1578, which affected the foreign policy of the subsequent Qing dynasty (1644–1912) of China in their support for the Dalai Lama of the Gelug school. By the late 16th century, the Mongols were successful armed protectors of the Gelug Dalai Lama, after increasing their presence in the Amdo region. This culminated in Güshi Khan's (1582–1655) conquest of Tibet from 1637–1642 and the establishment of the Ganden Phodrang regime by the 5th Dalai Lama with his help.",Who were the armed protectors for the Gelug Dalai Lama?,"{'text': ['the Mongols'], 'answer_start': [503]}"
2,5727cc862ca10214002d96a1,London,"Two recent discoveries indicate probable very early settlements near the Thames in the London area. In 1999, the remains of a Bronze Age bridge were found on the foreshore north of Vauxhall Bridge. This bridge either crossed the Thames, or went to a now lost island in the river. Dendrology dated the timbers to 1500 BC. In 2010 the foundations of a large timber structure, dated to 4500 BC, were found on the Thames foreshore, south of Vauxhall Bridge. The function of the mesolithic structure is not known. Both structures are on South Bank, at a natural crossing point where the River Effra flows into the River Thames.",How many ancient structures' ruins have been found near the River Thames in recent history?,"{'text': ['Two'], 'answer_start': [0]}"
3,5726e6bcdd62a815002e947b,Yale_University,"Yale's Office of Sustainability develops and implements sustainability practices at Yale. Yale is committed to reduce its greenhouse gas emissions 10% below 1990 levels by the year 2020. As part of this commitment, the university allocates renewable energy credits to offset some of the energy used by residential colleges. Eleven campus buildings are candidates for LEED design and certification. Yale Sustainable Food Project initiated the introduction of local, organic vegetables, fruits, and beef to all residential college dining halls. Yale was listed as a Campus Sustainability Leader on the Sustainable Endowments Institute’s College Sustainability Report Card 2008, and received a ""B+"" grade overall.",What project is bringing organic food to all of Yale's residential college dining areas?,"{'text': ['Yale Sustainable Food Project'], 'answer_start': [398]}"
4,5727f8794b864d19001640ea,Strasbourg,"During the Franco-Prussian War and the Siege of Strasbourg, the city was heavily bombarded by the Prussian army. The bombardment of the city was meant to break the morale of the people of Strasbourg. On 24 and 26 August 1870, the Museum of Fine Arts was destroyed by fire, as was the Municipal Library housed in the Gothic former Dominican church, with its unique collection of medieval manuscripts (most famously the Hortus deliciarum), rare Renaissance books, archeological finds and historical artifacts. The gothic cathedral was damaged as well as the medieval church of Temple Neuf, the theatre, the city hall, the court of justice and many houses. At the end of the siege 10,000 inhabitants were left without shelter; over 600 died, including 261 civilians, and 3200 were injured, including 1,100 civilians.",What cathedral was damaged along with the medieval church of Temple Neuf?,"{'text': ['gothic'], 'answer_start': [512]}"


In [10]:
show_random_elements(squad_v2["train"])

Unnamed: 0,id,title,context,question,answers
0,572821522ca10214002d9e8d,London,"Summers are generally warm and sometimes hot. London's average July high is 24 °C (75.2 °F). On average London will see 31 days above 25 °C (77.0 °F) each year, and 4.2 days above 30.0 °C (86.0 °F) every year. During the 2003 European heat wave there were 14 consecutive days above 30 °C (86.0 °F) and 2 consecutive days where temperatures reached 38 °C (100.4 °F), leading to hundreds of heat related deaths. Winters are generally cool and damp with little temperature variation. Snowfall does occur from time to time, and can cause travel disruption when this happens. Spring and autumn are mixed seasons and can be pleasant. As a large city, London has a considerable urban heat island effect, making the centre of London at times 5 °C (9 °F) warmer than the suburbs and outskirts. The effect of this can be seen below when comparing London Heathrow, 15 miles west of London, with the London Weather Centre, in the city centre.",What is London's average high temperature in July?,"{'text': ['24 °C (75.2 °F)'], 'answer_start': [76]}"
1,5ad259d1d7d075001a428e16,Dominican_Order,"Another who contributed significantly to the spirituality of the order is Albertus Magnus, the only person of the period to be given the appellation ""Great"". His influence on the brotherhood permeated nearly every aspect of Dominican life. Albert was a scientist, philosopher, astrologer, theologian, spiritual writer, ecumenist, and diplomat. Under the auspices of Humbert of Romans, Albert molded the curriculum of studies for all Dominican students, introduced Aristotle to the classroom and probed the work of Neoplatonists, such as Plotinus. Indeed, it was the thirty years of work done by Thomas Aquinas and himself (1245–1274) that allowed for the inclusion of Aristotelian study in the curriculum of Dominican schools.",What was not a discipline of Albert the Great?,"{'text': [], 'answer_start': []}"
2,5a8ddb86df8bba001a0f9cb2,Richard_Feynman,"In 1974, Feynman delivered the Caltech commencement address on the topic of cargo cult science, which has the semblance of science, but is only pseudoscience due to a lack of ""a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty"" on the part of the scientist. He instructed the graduating class that ""The first principle is that you must not fool yourself—and you are the easiest person to fool. So you have to be very careful about that. After you've not fooled yourself, it's easy not to fool other scientists. You just have to be honest in a conventional way after that.""",What did Feynman tell the class that they must do to themselves?,"{'text': [], 'answer_start': []}"
3,5ad3d344604f3c001a3ff27f,Yale_University,"The Yale Report of 1828 was a dogmatic defense of the Latin and Greek curriculum against critics who wanted more courses in modern languages, mathematics, and science. Unlike higher education in Europe, there was no national curriculum for colleges and universities in the United States. In the competition for students and financial support, college leaders strove to keep current with demands for innovation. At the same time, they realized that a significant portion of their students and prospective students demanded a classical background. The Yale report meant the classics would not be abandoned. All institutions experimented with changes in the curriculum, often resulting in a dual track. In the decentralized environment of higher education in the United States, balancing change with tradition was a common challenge because no one could afford to be completely modern or completely classical. A group of professors at Yale and New Haven Congregationalist ministers articulated a conservative response to the changes brought about by the Victorian culture. They concentrated on developing a whole man possessed of religious values sufficiently strong to resist temptations from within, yet flexible enough to adjust to the 'isms' (professionalism, materialism, individualism, and consumerism) tempting him from without. William Graham Sumner, professor from 1872 to 1909, taught in the emerging disciplines of economics and sociology to overflowing classrooms. He bested President Noah Porter, who disliked social science and wanted Yale to lock into its traditions of classical education. Porter objected to Sumner's use of a textbook by Herbert Spencer that espoused agnostic materialism because it might harm students.",When did William Graham Winter teach?,"{'text': [], 'answer_start': []}"
4,572673f0dd62a815002e8568,Mexico_City,"Mexico City is located in the Valley of Mexico, sometimes called the Basin of Mexico. This valley is located in the Trans-Mexican Volcanic Belt in the high plateaus of south-central Mexico. It has a minimum altitude of 2,200 meters (7,200 feet) above sea level and is surrounded by mountains and volcanoes that reach elevations of over 5,000 metres (16,000 feet). This valley has no natural drainage outlet for the waters that flow from the mountainsides, making the city vulnerable to flooding. Drainage was engineered through the use of canals and tunnels starting in the 17th century.",How high do the mountains get in Mexico City's region?,"{'text': ['5,000 metres (16,000 feet)'], 'answer_start': [336]}"


## Model definition

In [11]:
model_checkpoint = "distilroberta-base"
#model_checkpoint = 'bert-base-uncased'
#model_checkpoint = "prajjwal1/bert-mini"
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#Span Model
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(f"\nROBERTA is trained for sequences up to {model.config.max_position_embeddings} tokens")

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be 


ROBERTA is trained for sequences up to 514 tokens


## Prepare features 

Let’s start with preprocessing the training data. The hard part will be to generate labels for the question’s answer, which will be the start and end positions of the tokens corresponding to the answer inside the context.

One specific thing for the preprocessing in question answering is how to deal with very long documents. It is usual to truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in __SQuAD__ dataset to give several input features, each of length shorter than the maximum length of the model. Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`.

It is important to know that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, the tokenizer can automatically return us a list of features capped by a certain maximum length, with the specific overlap. This information is stored in `overflow_to_sample_mapping`.

Now we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. The tokenizer returning an `offset_mapping` that gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context.

For that reason we use the `sequence_ids` method that returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). 

Now with all of this, we can find the first and last token of the answer in one of our input feature or if the answer is not in this feature.

In [12]:
def prepare_train_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Preprocessing the validation data will be slightly easier as we don’t need to generate labels because we don't want to compute a validation loss. This because the loss don’t really help to understand how good the model is. The real joy will be to interpret the predictions of the model into spans of the original context. For this, we will just need to store both the offset mappings and some way to match each created feature to the original example it comes from. Since there is an ID column in the original dataset, we’ll use that ID.

The only thing we’ll add here is a tiny bit of cleanup of the offset mappings. They will contain offsets for the question and the context, but once we’re in the post-processing stage we won’t have any way to know which part of the input IDs corresponded to the context and which part was the question.

In [13]:
def prepare_test_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

## Post process and Metric

In [14]:
def postprocess_qa_predictions(all_start_logits, all_end_logits, examples, features, n_best_size=20, max_answer_length=30, squad_v2=False):
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                        or len(offset_mapping[start_index]) == 0
                        or len(offset_mapping[end_index]) == 0
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [15]:
def compute_metrics(predictions, examples, squad_v2=False):
    if squad_v2:
        formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()]
    else:
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    
    metric = load_metric("squad_v2" if squad_v2 else "squad")
    
    return metric.compute(predictions=formatted_predictions, references=theoretical_answers)

## Train and Evaluation


In [16]:
def train(model,
          tokenized_train_dataset, 
          tokenized_val_dataset, 
          raw_val_dataset,
          folder,
          num_train_epochs=3,
          learning_rate=2e-5,
          batch_size=64,
          squad_v2=False):

    tokenized_train_dataset.set_format("torch")
    val_for_model = tokenized_val_dataset.remove_columns(["example_id", "offset_mapping"])
    val_for_model.set_format("torch")

    train_dataloader = DataLoader(tokenized_train_dataset,
                                  shuffle=True,
                                  collate_fn=default_data_collator,
                                  batch_size=batch_size)
    
    eval_dataloader = DataLoader(val_for_model,
                                 collate_fn=default_data_collator,
                                 batch_size=batch_size)

    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-08)

    accelerator = Accelerator()
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

    lr_scheduler = get_scheduler("linear",
                                 optimizer=optimizer,
                                 num_warmup_steps=0,
                                 num_training_steps=num_train_epochs*len(train_dataloader))

    for epoch in range(1, num_train_epochs + 1):
        # Training
        model.train()
        train_loss = 0 # cumulative loss
        loop = tqdm(train_dataloader)
        for batch in loop:
            # Forward Pass
            outputs = model(**batch)
            # Find the Loss
            loss = outputs.loss
            # Calculate gradients 
            accelerator.backward(loss)
            # Update Weights
            optimizer.step()
            lr_scheduler.step()
            # Clear the gradients
            optimizer.zero_grad()
            # Calculate Loss
            train_loss += loss.item()

            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())

        # Compute average loss per epoch
        avg_train_loss = train_loss / len(train_dataloader)

        # Evaluation
        model.eval()
        exact_match = 0
        f1 = 0 
        start_logits = []
        end_logits = []

        loop = tqdm(eval_dataloader)
        for batch in loop:
            with torch.no_grad():
                # Forward Pass
                outputs = model(**batch)
                loop.set_description(f'Valid {epoch}')

            start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
            end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())


        start_logits = np.concatenate(start_logits)
        end_logits = np.concatenate(end_logits)

        prediction = postprocess_qa_predictions(start_logits, end_logits, raw_val_dataset, tokenized_val_dataset, squad_v2=squad_v2)

        metrics = compute_metrics(prediction, raw_val_dataset, squad_v2)
        f1_score = metrics['f1']
        exact_match_score = metrics['exact'] if squad_v2 else metrics['exact_match']

        print(f'Epoch {epoch}:\t train-loss = {avg_train_loss:.2f}\t val-f1 = {f1_score:.2f}\t exact_match = {exact_match_score:.2f}')


        # Save and upload
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        path = ''.join([folder, '/', str(epoch), '/'])
        if not os.path.exists(path):
            os.makedirs(path)
        unwrapped_model.save_pretrained(path, save_function=accelerator.save)

In [17]:
def generate(model, tokenized_test, raw_test_dataset, batch_size=128, squad_v2=False):
  test_for_model = tokenized_test.remove_columns(["example_id", "offset_mapping"])
  test_for_model.set_format("torch")

  test_dataloader = DataLoader(test_for_model, 
                               collate_fn=default_data_collator, 
                               batch_size=batch_size)

  model, test_dataloader = Accelerator().prepare(model, test_dataloader)

  model.eval()

  start_logits = []
  end_logits = []

  for batch in tqdm(test_dataloader):
      with torch.no_grad():
          outputs = model(**batch)

      start_logits.append((outputs.start_logits).cpu().numpy())
      end_logits.append((outputs.end_logits).cpu().numpy())


  start_logits = np.concatenate(start_logits)
  end_logits = np.concatenate(end_logits)

  return postprocess_qa_predictions(start_logits, end_logits, raw_test_dataset, tokenized_test, squad_v2=squad_v2)

## Result

In [18]:
max_length = 256 # The maximum length of a feature (question and context)
doc_stride = 64 # The authorized overlap between two part of the context when splitting it is needed.
batch_size = 32
path='/content/squad'
path_v2='/content/squad_v2'

### SQuAD

In [19]:
tokenized_train_dataset = squad["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["train"].column_names,
)

tokenized_val_dataset = squad["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["validation"].column_names
)

      

#2:   0%|          | 0/24 [00:00<?, ?ba/s]

#1:   0%|          | 0/24 [00:00<?, ?ba/s]

#0:   0%|          | 0/24 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/6 [00:00<?, ?ba/s]

#1:   0%|          | 0/6 [00:00<?, ?ba/s]

#2:   0%|          | 0/6 [00:00<?, ?ba/s]

In [20]:
train(model,
      tokenized_train_dataset, 
      tokenized_val_dataset, 
      squad["validation"],
      num_train_epochs=3,
      folder=path,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=False)

  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.41	 val-f1 = 78.40	 exact_match = 64.79


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.95	 val-f1 = 80.01	 exact_match = 66.62


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 3:	 train-loss = 0.82	 val-f1 = 80.67	 exact_match = 67.21


In [21]:
torch.save(model.state_dict(), '/content/squad.pth')

In [22]:
tokenized_test_dataset = squad['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad["test"].column_names,
)

  0%|          | 0/11 [00:00<?, ?ba/s]

In [23]:
prediction = generate(model, tokenized_test_dataset, squad['test'], batch_size=256)

  0%|          | 0/47 [00:00<?, ?it/s]

Post-processing 10570 example predictions split into 11912 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In [24]:
compute_metrics(prediction, squad['test'], squad_v2=False)

{'exact_match': 79.75402081362347, 'f1': 87.22165222785182}

### SQuAD2

In [25]:
tokenized_train_dataset_v2 = squad_v2["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["train"].column_names,
)

tokenized_val_dataset_v2 = squad_v2["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["validation"].column_names
)

      

#1:   0%|          | 0/35 [00:00<?, ?ba/s]

#0:   0%|          | 0/35 [00:00<?, ?ba/s]

#2:   0%|          | 0/35 [00:00<?, ?ba/s]

      

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

In [None]:
train(model,
      tokenized_train_dataset_v2, 
      tokenized_val_dataset_v2, 
      squad_v2["validation"],
      folder=path_v2,
      num_train_epochs=3,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=True)

  0%|          | 0/3599 [00:00<?, ?it/s]

  0%|          | 0/905 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28943 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.02	 val-f1 = 75.03	 exact_match = 67.91


  0%|          | 0/3599 [00:00<?, ?it/s]

  0%|          | 0/905 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28943 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.79	 val-f1 = 76.81	 exact_match = 69.54


  0%|          | 0/3599 [00:00<?, ?it/s]

  0%|          | 0/905 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28943 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

In [None]:
torch.save(model.state_dict(), '/content/squad_v2.pth')

In [None]:
tokenized_test_dataset_v2 = squad_v2['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad_v2["test"].column_names,
)

In [None]:
prediction_v2 = generate(model, tokenized_test_dataset_v2, squad_v2['test'], batch_size=256, squad_v2=True)

In [None]:
compute_metrics(prediction_v2, squad_v2['test'], squad_v2=True)

In [None]:
compute_metrics(prediction_v2, squad_v2['test'], squad_v2=True)