# Text Mining Project
## Question Answering on SQUAD

- autor:
Samuele Marino

### Introduction

Question Answering (QA) models can automate the response to frequently asked questions by using a knowledge base (documents) as context.
Exist several different QA variants based on the inputs and outputs, for example:
- __Extractive QA__: The model extracts the answer from a context. The context here could be a provided text, a table or even HTML.
- __Open Generative QA__: The model generates free text directly based on the context.
- __Closed Generative QA__: In this case, no context is provided. The answer is completely generated by a model.

In this notebook we will provide a __Extractive QA__ model based on __SQuAD__ dataset.

## Mount Drive


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import

In [2]:
#Colab
!pip install transformers datasets accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.5/191.5 KB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m106.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, default_data_collator, get_scheduler
from datasets import load_dataset, load_metric, ClassLabel, Sequence, DatasetDict
from IPython.display import display, HTML
from torch.utils.data import DataLoader
from accelerate import Accelerator
from torch.optim import AdamW
from functools import partial
from tqdm.auto import tqdm
import pandas as pd
import numpy as np
import collections
import random
import torch
import os

## Datasets

Stanford Question Answering Dataset (__SQuAD__) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
- __SQuAD1.1__, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles.
- __SQuAD2.0__ combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Download dataset

In [4]:
datasets = load_dataset("squad")
datasets_v2 = load_dataset("squad_v2")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Have a quick look on the data

In [5]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [6]:
datasets_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Only the train and validation sets are provided. I will use the original validation set as the test set, and then I will split the test set to obtain our validation set.

In [7]:
train_valid = datasets['train'].train_test_split(test_size=0.2)
squad = DatasetDict({
    'train': train_valid['train'],
    'validation': train_valid['test'],
    'test': datasets['validation']})
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 70079
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 17520
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [8]:
train_valid_v2 = datasets_v2['train'].train_test_split(test_size=0.2)
squad_v2 = DatasetDict({
    'train': train_valid_v2['train'],
    'validation': train_valid_v2['test'],
    'test': datasets_v2['validation']})
squad_v2

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 104255
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 26064
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

Show some elements of the datasets

In [9]:
def show_random_elements(dataset, num_examples=5):
    # Can't pick more elements than there are in the dataset
    assert num_examples <= len(dataset)
    
    picks =  random.sample(range(1, len(dataset)-1), num_examples)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(squad["train"])

Unnamed: 0,id,title,context,question,answers
0,5731ae4a0fdd8d15006c644a,Indigenous_peoples_of_the_Americas,"A route through Beringia is seen as more likely than the Solutrean hypothesis. Kashani et al. 2012 state that ""The similarities in ages and geographical distributions for C4c and the previously analyzed X2a lineage provide support to the scenario of a dual origin for Paleo-Indians. Taking into account that C4c is deeply rooted in the Asian portion of the mtDNA phylogeny and is indubitably of Asian origin, the finding that C4c and X2a are characterized by parallel genetic histories definitively dismisses the controversial hypothesis of an Atlantic glacial entry route into North America.""",When did Kashani and others make their statement regarding the similarities for C4c distributions?,"{'text': ['2012'], 'answer_start': [94]}"
1,56e16690cd28a01900c67870,Boston,"Boston Common, located near the Financial District and Beacon Hill, is the oldest public park in the United States. Along with the adjacent Boston Public Garden, it is part of the Emerald Necklace, a string of parks designed by Frederick Law Olmsted to encircle the city. The Emerald Necklace includes Jamaica Pond, Boston's largest body of freshwater, and Franklin Park, the city's largest park and home of the Franklin Park Zoo. Another major park is the Esplanade, located along the banks of the Charles River. The Hatch Shell, an outdoor concert venue, is located adjacent to the Charles River Esplanade. Other parks are scattered throughout the city, with the major parks and beaches located near Castle Island; in Charlestown; and along the Dorchester, South Boston, and East Boston shorelines.",The Emerald necklace is a string of what?,"{'text': ['parks'], 'answer_start': [210]}"
2,572716baf1498d1400e8f37e,Comcast,"Critics noted in 2013 that Tom Wheeler, the head of the FCC, which has to approve the deal, is the former head of both the largest cable lobbying organization, the National Cable & Telecommunications Association, and as largest wireless lobby, CTIA – The Wireless Association. According to Politico, Comcast ""donated to almost every member of Congress who has a hand in regulating it."" The US Senate Judiciary Committee held a hearing on the deal on April 9, 2014. The House Judiciary Committee planned its own hearing. On March 6, 2014 the United States Department of Justice Antitrust Division confirmed it was investigating the deal. In March 2014, the division's chairman, William Baer, recused himself because he was involved in a prior Comcast NBCUniversal acquisition. Several states' attorneys general have announced support for the federal investigation. On April 24, 2015, Jonathan Sallet, general counsel of the F.C.C., said that he was going to recommend a hearing before an administrative law judge, equivalent to a collapse of the deal.",What Senate group held hearings on the purchase?,"{'text': ['US Senate Judiciary Committee'], 'answer_start': [390]}"
3,570d5555b3d812140066d6d9,Valencia,"The decline of the city reached its nadir with the War of Spanish Succession (1702–1709) that marked the end of the political and legal independence of the Kingdom of Valencia. During the War of the Spanish Succession, Valencia sided with Charles of Austria. On 24 January 1706, Charles Mordaunt, 3rd Earl of Peterborough, 1st Earl of Monmouth, led a handful of English cavalrymen into the city after riding south from Barcelona, capturing the nearby fortress at Sagunt, and bluffing the Spanish Bourbon army into withdrawal.",What war took place from 1702-1709?,"{'text': ['War of Spanish Succession'], 'answer_start': [51]}"
4,56d3887c59d6e4140014667c,American_Idol,"The show pushed Fox to become the number one U.S. TV network amongst adults 18–49, the key demographic coveted by advertisers, for an unprecedented eight consecutive years by 2012. Its success also helped lift the ratings of other shows that were scheduled around it such as House and Bones, and Idol, for years, had become Fox's strongest platform primetime television program for promoting eventual hit shows of the 2010s (of the same network) such as Glee and New Girl. The show, its creator Simon Fuller claimed, ""saved Fox"".",What television network originally aired the show Glee?,"{'text': ['Fox'], 'answer_start': [16]}"


In [11]:
show_random_elements(squad_v2["train"])

Unnamed: 0,id,title,context,question,answers
0,5727aa68ff5b5019007d9224,"New_Haven,_Connecticut","Hopkins School, a private school, was founded in 1660 and is the fifth-oldest educational institution in the United States. New Haven is home to a number of other private schools as well as public magnet schools, including Metropolitan Business Academy, High School in the Community, Hill Regional Career High School, Co-op High School, New Haven Academy, ACES Educational Center for the Arts, the Foote School and the Sound School, all of which draw students from New Haven and suburban towns. New Haven is also home to two Achievement First charter schools, Amistad Academy and Elm City College Prep, and to Common Ground, an environmental charter school.",What magnet school in New Haven is centered around arts education?,"{'text': ['ACES Educational Center for the Arts'], 'answer_start': [356]}"
1,5729ffedaf94a219006aa743,Energy,"According to conservation of energy, energy can neither be created (produced) nor destroyed by itself. It can only be transformed. The total inflow of energy into a system must equal the total outflow of energy from the system, plus the change in the energy contained within the system. Energy is subject to a strict global conservation law; that is, whenever one measures (or calculates) the total energy of a system of particles whose interactions do not depend explicitly on time, it is found that the total energy of the system always remains constant.","According to what, energy can neither be created nor destroyed by itself?","{'text': ['conservation of energy'], 'answer_start': [13]}"
2,57340e5dd058e614000b68b9,Genocide,"There has been much debate over categorizing the situation in Darfur as genocide. The ongoing conflict in Darfur, Sudan, which started in 2003, was declared a ""genocide"" by United States Secretary of State Colin Powell on 9 September 2004 in testimony before the Senate Foreign Relations Committee. Since that time however, no other permanent member of the UN Security Council followed suit. In fact, in January 2005, an International Commission of Inquiry on Darfur, authorized by UN Security Council Resolution 1564 of 2004, issued a report to the Secretary-General stating that ""the Government of the Sudan has not pursued a policy of genocide."" Nevertheless, the Commission cautioned that ""The conclusion that no genocidal policy has been pursued and implemented in Darfur by the Government authorities, directly or through the militias under their control, should not be taken in any way as detracting from the gravity of the crimes perpetrated in that region. International offences such as the crimes against humanity and war crimes that have been committed in Darfur may be no less serious and heinous than genocide.""",What has been widely debated as a possible act of genocide in Sudan?,"{'text': ['situation in Darfur'], 'answer_start': [49]}"
3,5ad16586645df0001a2d199a,Labour_Party_(UK),"The party's decision-making bodies on a national level formally include the National Executive Committee (NEC), Labour Party Conference and National Policy Forum (NPF)—although in practice the Parliamentary leadership has the final say on policy. The 2008 Labour Party Conference was the first at which affiliated trade unions and Constituency Labour Parties did not have the right to submit motions on contemporary issues that would previously have been debated. Labour Party conferences now include more ""keynote"" addresses, guest speakers and question-and-answer sessions, while specific discussion of policy now takes place in the National Policy Forum.",Where does discussion of policy never take place?,"{'text': [], 'answer_start': []}"
4,5a84a37a7cf838001a46a9ec,Party_leaders_of_the_United_States_House_of_Representatives,"In addition, the minority leader has a number of other institutional functions. For instance, the minority leader is sometimes statutorily authorized to appoint individuals to certain federal entities; he or she and the majority leader each name three Members to serve as Private Calendar objectors; he or she is consulted with respect to reconvening the House per the usual formulation of conditional concurrent adjournment resolutions; he or she is a traditional member of the House Office Building Commission; he or she is a member of the United States Capitol Preservation Commission; and he or she may, after consultation with the Speaker, convene an early organizational party caucus or conference. Informally, the minority leader maintains ties with majority party leaders to learn about the schedule and other House matters and forges agreements or understandings with them insofar as feasible.",How many institutional functions does the House office Building Commission have?,"{'text': [], 'answer_start': []}"


## Model definition

In [12]:
model_checkpoint = "distilroberta-base"
#model_checkpoint = 'bert-base-uncased'
#model_checkpoint = "prajjwal1/bert-mini"
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#Span Model
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(f"\nROBERTA is trained for sequences up to {model.config.max_position_embeddings} tokens")

Downloading (…)lve/main/config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForQuestionAnswering: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be 


ROBERTA is trained for sequences up to 514 tokens


## Prepare features 

Let’s start with preprocessing the training data. The hard part will be to generate labels for the question’s answer, which will be the start and end positions of the tokens corresponding to the answer inside the context.

One specific thing for the preprocessing in question answering is how to deal with very long documents. It is usual to truncate them in other tasks, when they are longer than the model maximum sentence length, but here, removing part of the the context might result in losing the answer we are looking for. To deal with this, we will allow one (long) example in __SQuAD__ dataset to give several input features, each of length shorter than the maximum length of the model. Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`.

It is important to know that we never want to truncate the question, only the context, else the `only_second` truncation picked. Now, the tokenizer can automatically return us a list of features capped by a certain maximum length, with the specific overlap. This information is stored in `overflow_to_sample_mapping`.

Now we need to find in which of those features the answer actually is, and where exactly in that feature. The models we will use require the start and end positions of these answers in the tokens, so we will also need to to map parts of the original context to some tokens. The tokenizer returning an `offset_mapping` that gives, for each index of our input IDS, the corresponding start and end character in the original text that gave our token. So we can use this mapping to find the position of the start and end tokens of our answer in a given feature. We just have to distinguish which parts of the offsets correspond to the question and which part correspond to the context.

For that reason we use the `sequence_ids` method that returns `None` for the special tokens, then 0 or 1 depending on whether the corresponding token comes from the first sentence past (the question) or the second (the context). 

Now with all of this, we can find the first and last token of the answer in one of our input feature or if the answer is not in this feature.

In [13]:
def prepare_train_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Preprocessing the validation data will be slightly easier as we don’t need to generate labels because we don't want to compute a validation loss. This because the loss don’t really help to understand how good the model is. The real joy will be to interpret the predictions of the model into spans of the original context. For this, we will just need to store both the offset mappings and some way to match each created feature to the original example it comes from. Since there is an ID column in the original dataset, we’ll use that ID.

The only thing we’ll add here is a tiny bit of cleanup of the offset mappings. They will contain offsets for the question and the context, but once we’re in the post-processing stage we won’t have any way to know which part of the input IDs corresponded to the context and which part was the question.

In [14]:
def prepare_test_features(examples, max_length=384, doc_stride=128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the truncation of the context fail (the tokenized question will take a lots of space). So we remove that left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results in one example possible giving several features when a context is long, each of those features having a context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

## Post process and Metric

Once that the model generate the output, we will need to map the predictions back to parts of the context. The model itself predicts logits for the start and en position of our answers. The output of the model is a dict-like object.

We have one logit for each feature and each token. The most obvious thing to predict an answer for each feature is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position. This will work great in a lot of cases, but what if this prediction gives us something impossible: the start position could be greater than the end position, or point to a span of text in the question instead of the answer. 

To choose the best start and end logits we pick the highest `n_best_size` logits and attributed a score to each (start_token, end_token) pair. An idea for the score can be the product but because we are working with logits the score will be the sum of the start and end logits. After checking if each one pairis valid, we will sort them by their score and keep the best one. A pair is not valid if give:
- Answer that wouldn’t be inside the context
- An answer with negative length
- An answer that is too long (we limit the possibilities at `max_answer_length`=30)

The only point left is how to check a given span is inside the context (and not the question) and how to get back the text inside. So we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one.

Finally we need to also grab the score for the impossible answer (which has start and end indices corresponding to the index of the CLS token). When one example gives several features, we have to predict the impossible answer when all the features give a high score to the impossible answer (since one feature could predict the impossible answer just because the answer isn't in the part of the context it has access too), which is why the score of the impossible answer for one example is the minimum of the scores for the impossible answer in each feature generated by the example. We then predict the impossible answer when that score is greater than the score of the best non-impossible answer.

In [15]:
def postprocess_qa_predictions(all_start_logits, all_end_logits, examples, features, n_best_size=20, max_answer_length=30, squad_v2=False):
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                        or len(offset_mapping[start_index]) == 0
                        or len(offset_mapping[end_index]) == 0
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

To compute the metric we just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. 

In [16]:
def compute_metrics(predictions, examples, squad_v2=False):
    if squad_v2:
        formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in predictions.items()]
    else:
        formatted_predictions = [{"id": k, "prediction_text": v} for k, v in predictions.items()]

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    
    metric = load_metric("squad_v2" if squad_v2 else "squad") 
    
    return metric.compute(predictions=formatted_predictions, references=theoretical_answers)

## Train and Evaluation


In [17]:
def train(model,
          tokenized_train_dataset, 
          tokenized_val_dataset, 
          raw_val_dataset,
          folder,
          num_train_epochs=3,
          learning_rate=2e-5,
          batch_size=64,
          squad_v2=False):

    tokenized_train_dataset.set_format("torch")
    val_for_model = tokenized_val_dataset.remove_columns(["example_id", "offset_mapping"])
    val_for_model.set_format("torch")

    train_dataloader = DataLoader(tokenized_train_dataset,
                                  shuffle=True,
                                  collate_fn=default_data_collator,
                                  batch_size=batch_size)
    
    eval_dataloader = DataLoader(val_for_model,
                                 collate_fn=default_data_collator,
                                 batch_size=batch_size)

    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-08)

    accelerator = Accelerator()
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

    lr_scheduler = get_scheduler("linear",
                                 optimizer=optimizer,
                                 num_warmup_steps=0,
                                 num_training_steps=num_train_epochs*len(train_dataloader))

    for epoch in range(1, num_train_epochs + 1):
        # Training
        model.train()
        train_loss = 0 # cumulative loss
        loop = tqdm(train_dataloader)
        for batch in loop:
            # Forward Pass
            outputs = model(**batch)
            # Find the Loss
            loss = outputs.loss
            # Calculate gradients 
            accelerator.backward(loss)
            # Update Weights
            optimizer.step()
            lr_scheduler.step()
            # Clear the gradients
            optimizer.zero_grad()
            # Calculate Loss
            train_loss += loss.item()

            loop.set_description(f'Epoch {epoch}')
            loop.set_postfix(loss=loss.item())

        # Compute average loss per epoch
        avg_train_loss = train_loss / len(train_dataloader)

        # Evaluation
        model.eval()
        exact_match = 0
        f1 = 0 
        start_logits = []
        end_logits = []

        loop = tqdm(eval_dataloader)
        for batch in loop:
            with torch.no_grad():
                # Forward Pass
                outputs = model(**batch)
                loop.set_description(f'Valid {epoch}')

            start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
            end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())


        start_logits = np.concatenate(start_logits)
        end_logits = np.concatenate(end_logits)

        prediction = postprocess_qa_predictions(start_logits, end_logits, raw_val_dataset, tokenized_val_dataset, squad_v2=squad_v2)

        metrics = compute_metrics(prediction, raw_val_dataset, squad_v2)
        f1_score = metrics['f1']
        exact_match_score = metrics['exact'] if squad_v2 else metrics['exact_match']

        print(f'Epoch {epoch}:\t train-loss = {avg_train_loss:.2f}\t val-f1 = {f1_score:.2f}\t exact_match = {exact_match_score:.2f}')


        # Save and upload
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        path = ''.join([folder, '/', str(epoch), '/'])
        if not os.path.exists(path):
            os.makedirs(path)
        unwrapped_model.save_pretrained(path, save_function=accelerator.save)

In [18]:
def generate(model, tokenized_test, raw_test_dataset, batch_size=128, squad_v2=False):
  test_for_model = tokenized_test.remove_columns(["example_id", "offset_mapping"])
  test_for_model.set_format("torch")

  test_dataloader = DataLoader(test_for_model, 
                               collate_fn=default_data_collator, 
                               batch_size=batch_size)

  model, test_dataloader = Accelerator().prepare(model, test_dataloader)

  model.eval()

  start_logits = []
  end_logits = []

  for batch in tqdm(test_dataloader):
      with torch.no_grad():
          outputs = model(**batch)

      start_logits.append((outputs.start_logits).cpu().numpy())
      end_logits.append((outputs.end_logits).cpu().numpy())


  start_logits = np.concatenate(start_logits)
  end_logits = np.concatenate(end_logits)

  return postprocess_qa_predictions(start_logits, end_logits, raw_test_dataset, tokenized_test, squad_v2=squad_v2)

## Result

In [19]:
max_length = 256 # The maximum length of a feature (question and context)
doc_stride = 64 # The authorized overlap between two part of the context when splitting it is needed.
batch_size = 32
path='/content/drive/MyDrive/Text_Mining/Weight/SQuAD/'
path_v2='/content/drive/MyDrive/Text_Mining/Weight/SQuAD2/'

### SQuAD

In [None]:
tokenized_train_dataset = squad["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["train"].column_names,
)

tokenized_val_dataset = squad["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad["validation"].column_names
)

      

#2:   0%|          | 0/24 [00:00<?, ?ba/s]

#1:   0%|          | 0/24 [00:00<?, ?ba/s]

#0:   0%|          | 0/24 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/6 [00:00<?, ?ba/s]

#1:   0%|          | 0/6 [00:00<?, ?ba/s]

#2:   0%|          | 0/6 [00:00<?, ?ba/s]

In [None]:
train(model,
      tokenized_train_dataset, 
      tokenized_val_dataset, 
      squad["validation"],
      num_train_epochs=3,
      folder=path,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=False)

  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.41	 val-f1 = 78.40	 exact_match = 64.79


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.95	 val-f1 = 80.01	 exact_match = 66.62


  0%|          | 0/2425 [00:00<?, ?it/s]

  0%|          | 0/607 [00:00<?, ?it/s]

Post-processing 17520 example predictions split into 19418 features.


  0%|          | 0/17520 [00:00<?, ?it/s]

Epoch 3:	 train-loss = 0.82	 val-f1 = 80.67	 exact_match = 67.21


In [None]:
#save model
#torch.save(model.state_dict(), ''.join([path, 'squad.pth']))
#load model
#model.load_state_dict(torch.load(''.join([path, 'squad.pth'])))

In [None]:
tokenized_test_dataset = squad['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad["test"].column_names,
)

  0%|          | 0/11 [00:00<?, ?ba/s]

In [None]:
prediction = generate(model, tokenized_test_dataset, squad['test'], batch_size=256)

  0%|          | 0/47 [00:00<?, ?it/s]

Post-processing 10570 example predictions split into 11912 features.


  0%|          | 0/10570 [00:00<?, ?it/s]

In [None]:
compute_metrics(prediction, squad['test'], squad_v2=False)

{'exact_match': 79.75402081362347, 'f1': 87.22165222785182}

### SQuAD2

In [None]:
tokenized_train_dataset_v2 = squad_v2["train"].map(
    partial(prepare_train_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["train"].column_names,
)

tokenized_val_dataset_v2 = squad_v2["validation"].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    num_proc=3,
    remove_columns=squad_v2["validation"].column_names
)

      

#2:   0%|          | 0/35 [00:00<?, ?ba/s]

#1:   0%|          | 0/35 [00:00<?, ?ba/s]

#0:   0%|          | 0/35 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/9 [00:00<?, ?ba/s]

#1:   0%|          | 0/9 [00:00<?, ?ba/s]

#2:   0%|          | 0/9 [00:00<?, ?ba/s]

In [None]:
train(model,
      tokenized_train_dataset_v2, 
      tokenized_val_dataset_v2, 
      squad_v2["validation"],
      folder=path_v2,
      num_train_epochs=3,
      learning_rate=2e-5,
      batch_size=batch_size,
      squad_v2=True)

  0%|          | 0/3604 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28781 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

  metric = load_metric("squad_v2" if squad_v2 else "squad")


Downloading builder script:   0%|          | 0.00/2.25k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.19k [00:00<?, ?B/s]

Epoch 1:	 train-loss = 1.41	 val-f1 = 69.22	 exact_match = 62.07


  0%|          | 0/3604 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28781 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

Epoch 2:	 train-loss = 0.99	 val-f1 = 72.90	 exact_match = 65.44


  0%|          | 0/3604 [00:00<?, ?it/s]

  0%|          | 0/900 [00:00<?, ?it/s]

Post-processing 26064 example predictions split into 28781 features.


  0%|          | 0/26064 [00:00<?, ?it/s]

Epoch 3:	 train-loss = 0.83	 val-f1 = 73.75	 exact_match = 66.19


In [None]:
#save model
#torch.save(model.state_dict(), ''.join([path_v2, 'squad_v2.pth']))
#load model
#model.load_state_dict(torch.load(''.join([path_v2, 'squad_v2.pth'])))

In [22]:
tokenized_test_dataset_v2 = squad_v2['test'].map(
    partial(prepare_test_features, max_length=max_length, doc_stride=doc_stride),
    batched=True,
    remove_columns=squad_v2["test"].column_names,
)

  0%|          | 0/12 [00:00<?, ?ba/s]

In [23]:
prediction_v2 = generate(model, tokenized_test_dataset_v2, squad_v2['test'], batch_size=256, squad_v2=True)

  0%|          | 0/53 [00:00<?, ?it/s]

Post-processing 11873 example predictions split into 13502 features.


  0%|          | 0/11873 [00:00<?, ?it/s]

In [25]:
compute_metrics(prediction_v2, squad_v2['test'], squad_v2=True)

{'exact': 70.97616440663691,
 'f1': 73.96106146741363,
 'total': 11873,
 'HasAns_exact': 70.41160593792172,
 'HasAns_f1': 76.38995998694364,
 'HasAns_total': 5928,
 'NoAns_exact': 71.53910849453322,
 'NoAns_f1': 71.53910849453322,
 'NoAns_total': 5945,
 'best_exact': 70.98458687778994,
 'best_exact_thresh': 0.0,
 'best_f1': 73.96948393856671,
 'best_f1_thresh': 0.0}