# Training a question answering model

In this notebook, we will see how to fine-tune one of the 🤗 Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer API to fine-tune a model on it.

In [2]:
import transformers

print(transformers.__version__)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


4.12.3


In [3]:
squad_v2 = True
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [4]:
from datasets import load_dataset, load_metric

In [5]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 (download: 44.34 MiB, generated: 122.41 MiB, post-processed: Unknown size, total: 166.75 MiB) to /home/niss/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/801k [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset squad_v2 downloaded and prepared to /home/niss/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [7]:
datasets["train"][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).


In [8]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,5727d72bff5b5019007d96a5,Oklahoma,"The state has a rich history in ballet with five Native American ballerinas attaining worldwide fame. These were Yvonne Chouteau, sisters Marjorie and Maria Tallchief, Rosella Hightower and Moscelyne Larkin, known collectively as the Five Moons. The New York Times rates the Tulsa Ballet as one of the top ballet companies in the United States. The Oklahoma City Ballet and University of Oklahoma's dance program were formed by ballerina Yvonne Chouteau and husband Miguel Terekhov. The University program was founded in 1962 and was the first fully accredited program of its kind in the United States.",When did the University of Oklahoma's dance program begin?,"{'text': ['1962'], 'answer_start': [521]}"
1,56cf84c1234ae51400d9bdef,Kanye_West,"On September 11, 2008, West and his road manager/bodyguard Don ""Don C."" Crowley were arrested at Los Angeles International Airport and booked on charges of felony vandalism after an altercation with the paparazzi in which West and Crowley broke the photographers' cameras. West was later released from the Los Angeles Police Department's Pacific Division station in Culver City on $20,000 bail bond. On September 26, 2008, the Los Angeles County District Attorney's Office said it would not file felony counts against West over the incident. Instead the case file was forwarded to the city attorney's office, which charged West with one count of misdemeanor vandalism, one count of grand theft and one count of battery and his manager with three counts of each on March 18, 2009. West's and Crowley's arraignment was delayed from an original date of April 14, 2009.",What was Kanye arrested for in 2008?,"{'text': ['felony vandalism'], 'answer_start': [156]}"
2,56d13543e7d4791d00902004,IPod,"In 2010, a number of workers committed suicide at a Foxconn operations in China. Apple, HP, and others stated that they were investigating the situation. Foxconn guards have been videotaped beating employees. Another employee killed himself in 2009 when an Apple prototype went missing, and claimed in messages to friends, that he had been beaten and interrogated.",In what year did several Foxconn workers commit suicide?,"{'text': ['2010'], 'answer_start': [3]}"
3,5707070890286e26004fc809,Chihuahua_(state),"During the 14th century in the northeastern part of the state nomad tribes by the name of Jornado hunted bison along the Rio Grande; they left numerous rock paintings throughout the northeastern part of the state. When the Spanish explorers reached this area they found their descendants, Suma and Manso tribes. In the southern part of the state, in a region known as Aridoamerica, Chichimeca people survived by hunting, gathering, and farming between AD 300 and 1300. The Chichimeca are the ancestors of the Tepehuan people.",The Jornado painted onto what surface?,"{'text': ['rock'], 'answer_start': [152]}"
4,5ad0ceb6645df0001a2d043e,Gothic_architecture,"The pointed arch, one of the defining attributes of Gothic, was earlier incorporated into Islamic architecture following the Islamic conquests of Roman Syria and the Sassanid Empire in the Seventh Century. The pointed arch and its precursors had been employed in Late Roman and Sassanian architecture; within the Roman context, evidenced in early church building in Syria and occasional secular structures, like the Roman Karamagara Bridge; in Sassanid architecture, in the parabolic and pointed arches employed in palace and sacred construction.",What other type of architecture also made use of the curved arch?,"{'text': [], 'answer_start': []}"
5,5a8db847df8bba001a0f9ba2,The_Legend_of_Zelda:_Twilight_Princess,"A high-definition remaster of the game, The Legend of Zelda: Twilight Princess HD, is being developed by Tantalus Media for the Wii U. Officially announced during a Nintendo Direct presentation on November 12, 2015, it features enhanced graphics and Amiibo functionality. The game will be released in North America and Europe on March 4, 2016; in Australia on March 5, 2016; and in Japan on March 10, 2016.",For which console is Nintendo Direct being made?,"{'text': [], 'answer_start': []}"
6,57296b6f1d046914007793f9,"United_States_presidential_election,_2004","The 2004 election was the first to be affected by the campaign finance reforms mandated by the Bipartisan Campaign Reform Act of 2002 (also known as the McCain–Feingold Bill for its sponsors in the United States Senate). Because of the Act's restrictions on candidates' and parties' fundraising, a large number of so-called 527 groups emerged. Named for a section of the Internal Revenue Code, these groups were able to raise large amounts of money for various political causes as long as they do not coordinate their activities with political campaigns. Examples of 527s include Swift Boat Veterans for Truth, MoveOn.org, the Media Fund, and America Coming Together. Many such groups were active throughout the campaign season. (There was some similar activity, although on a much lesser scale, during the 2000 campaign.)",What finance act affected the 2004 election?,"{'text': ['the Bipartisan Campaign Reform Act of 2002'], 'answer_start': [91]}"
7,56e197efcd28a01900c67a0e,Catalan_language,"In verbs, 1st person present indicative desinence is -e (∅ in verbs of the 2nd and 3rd conjugation), or -o.\nE.g. parle, tem, sent (Valencian); parlo, temo, sento (Northwestern). In verbs, 1st person present indicative desinence is -o, -i or ∅ in all conjugations.\nE.g. parlo (Central), parl (Balearic), parli (Northern), ('I speak').",What language form is parli?,"{'text': ['Northern'], 'answer_start': [310]}"
8,56f6f4963d8e2e1400e372ea,Slavs,"Present-day Slavic people are classified into West Slavic (chiefly Poles, Czechs and Slovaks), East Slavic (chiefly Russians, Belarusians, and Ukrainians), and South Slavic (chiefly Serbs, Bulgarians, Croats, Bosniaks, Macedonians, Slovenes, and Montenegrins), though sometimes the West Slavs and East Slavs are combined into a single group known as North Slavs. For a more comprehensive list, see the ethnocultural subdivisions. Modern Slavic nations and ethnic groups are considerably diverse both genetically and culturally, and relations between them – even within the individual ethnic groups themselves – are varied, ranging from a sense of connection to mutual feelings of hostility.",West Slavic people consist of which nationalities?,"{'text': ['Poles, Czechs and Slovaks'], 'answer_start': [67]}"
9,5acebb2432bba1001ae4b211,Athanasius_of_Alexandria,"After the death of the replacement bishop Gregory in 345, Constans used his influence to allow Athanasius to return to Alexandria in October 345, amidst the enthusiastic demonstrations of the populace. This began a ""golden decade"" of peace and prosperity, during which time Athanasius assembled several documents relating to his exiles and returns from exile in the Apology Against the Arians. However, upon Constans's death in 350, another civil war broke out, which left pro-Arian Constantius as sole emperor. An Alexandria local council in 350 replaced (or reaffirmed) Athanasius in his see.",When did the uncivil war happen?,"{'text': [], 'answer_start': []}"


## Preprocessing the training data

In [12]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [13]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [16]:
pad_on_right = tokenizer.padding_side == "right"

In [19]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [20]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [21]:
features = prepare_train_features(datasets['train'][:5])