In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value="<center>\n<img src=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Training a question answering model

In this notebook, we will see how to fine-tune one of the 🤗 Transformers model to a question answering task, which is the task of extracting the answer to a question from a given context. We will see how to easily load a dataset for these kinds of tasks and use the Trainer API to fine-tune a model on it.

In [4]:
import transformers

print(transformers.__version__)

4.12.3


In [5]:
squad_v2 = True
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

In [6]:
from datasets import load_dataset, load_metric

In [7]:
datasets = load_dataset("squad_v2" if squad_v2 else "squad")

Reusing dataset squad_v2 (/home/niss/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


  0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [9]:
datasets["train"][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset (automatically decoding the labels in passing).


In [10]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [11]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,57301a69947a6a140053d0fb,Iran,"Iran consists of the Iranian Plateau with the exception of the coasts of the Caspian Sea and Khuzestan Province. It is one of the world's most mountainous countries, its landscape dominated by rugged mountain ranges that separate various basins or plateaux from one another. The populous western part is the most mountainous, with ranges such as the Caucasus, Zagros and Alborz Mountains; the last contains Iran's highest point, Mount Damavand at 5,610 m (18,406 ft), which is also the highest mountain on the Eurasian landmass west of the Hindu Kush.",Mount Damavand is located in what range?,"{'text': ['Alborz Mountains'], 'answer_start': [371]}"
1,56d12d3c17492d1400aabb6b,Kanye_West,"Adams sent condolences to Donda West's family but declined to publicly discuss the procedure, citing confidentiality. West’s family, through celebrity attorney Ed McPherson, filed complaints with the Medical Board against Adams and Aboolian for violating patient confidentiality following her death. Adams had previously been under scrutiny by the medical board. He appeared on Larry King Live on November 20, 2007, but left before speaking. Two days later, he appeared again, with his attorney, stating he was there to ""defend himself"". He said that the recently released autopsy results ""spoke for themselves"". The final coroner's report January 10, 2008, concluded that Donda West died of ""coronary artery disease and multiple post-operative factors due to or as a consequence of liposuction and mammoplasty"".",On what day did the final coroner's report show that Donda died from heart disease and complications from surgery?,"{'text': ['January 10, 2008'], 'answer_start': [640]}"
2,572ebaa3cb0c0d14000f14d5,Vacuum,"Medieval thought experiments into the idea of a vacuum considered whether a vacuum was present, if only for an instant, between two flat plates when they were rapidly separated. There was much discussion of whether the air moved in quickly enough as the plates were separated, or, as Walter Burley postulated, whether a 'celestial agent' prevented the vacuum arising. The commonly held view that nature abhorred a vacuum was called horror vacui. Speculation that even God could not create a vacuum if he wanted to was shut down[clarification needed] by the 1277 Paris condemnations of Bishop Etienne Tempier, which required there to be no restrictions on the powers of God, which led to the conclusion that God could create a vacuum if he so wished. Jean Buridan reported in the 14th century that teams of ten horses could not pull open bellows when the port was sealed.",When did Buridan state that teams of ten horses could not open a bellow with a sealed port?,"{'text': ['14th century'], 'answer_start': [779]}"
3,572fcc4ba23a5019007fca03,Pacific_War,"Japan sponsored several puppet governments, one of which was headed by Wang Jingwei. However, its policies of brutality toward the Chinese population, of not yielding any real power to these regimes, and of supporting several rival governments failed to make any of them a viable alternative to the Nationalist government led by Chiang Kai-shek. Conflicts between Chinese communist and nationalist forces vying for territory control behind enemy lines culminated in a major armed clash in January 1941, effectively ending their co-operation.",Who was the leader of the Nationalist government?,"{'text': ['Chiang Kai-shek'], 'answer_start': [329]}"
4,572a5969b8ce0319002e2ad6,Ottoman_Empire,"The Ottomans absorbed some of the traditions, art and institutions of cultures in the regions they conquered, and added new dimensions to them. Numerous traditions and cultural traits of previous empires (in fields such as architecture, cuisine, music, leisure and government) were adopted by the Ottoman Turks, who elaborated them into new forms, which resulted in a new and distinctively Ottoman cultural identity. Despite newer added amalgamations, the Ottoman dynasty, like their predecessors in the Sultanate of Rum and the Seljuk Empire, were thoroughly Persianised in their culture, language, habits and customs, and therefore, the empire has been described as a Persianate empire. Intercultural marriages also played their part in creating the characteristic Ottoman elite culture. When compared to the Turkish folk culture, the influence of these new cultures in creating the culture of the Ottoman elite was clear.",What is one way that the Empire was described as it related to culture?,"{'text': ['Persianate empire'], 'answer_start': [670]}"
5,5726e7d55951b619008f820b,Madonna_(entertainer),"Life with My Sister Madonna, a book by Madonna's brother Christopher, debuted at number two on The New York Times bestseller list. The book caused some friction between Madonna and her brother, because of the unsolicited publication. Problems also arose between Madonna and Ritchie, with the media reporting that they were on the verge of separation. Ultimately, Madonna filed for divorce from Ritchie, citing irreconcilable differences, which was finalized in December 2008. She decided to adopt from Malawi. The country's High Court initially approved the adoption of Chifundo ""Mercy"" James; however, the application was rejected because Madonna was not a resident of the country. Madonna appealed, and on June 12, 2009, the Supreme Court of Malawi granted Madonna the right to adopt Mercy James. She also released Celebration, her third greatest-hits album and final release with Warner. It contained the new songs ""Celebration"" and ""Revolver"" along with 34 hits spanning her career. Celebration reached number one in the UK, tying her with Elvis Presley as the solo act with most number one albums in the British chart history. She appeared at the 2009 MTV Video Music Awards on September 13, 2009, to speak in tribute to deceased pop star Michael Jackson.",When did Madonna appear in MTV for the tribute to Michael Jackson?,"{'text': ['September 13, 2009'], 'answer_start': [1183]}"
6,57283db82ca10214002da15e,Federalism,"Some federal constitutions also provide that certain constitutional amendments cannot occur without the unanimous consent of all states or of a particular state. The US constitution provides that no state may be deprived of equal representation in the senate without its consent. In Australia, if a proposed amendment will specifically impact one or more states, then it must be endorsed in the referendum held in each of those states. Any amendment to the Canadian constitution that would modify the role of the monarchy would require unanimous consent of the provinces. The German Basic Law provides that no amendment is admissible at all that would abolish the federal system.",The US Constitution says what to amendments?,"{'text': ['provides that no state may be deprived of equal representation in the senate without its consent'], 'answer_start': [182]}"
7,572954a83f37b31900478262,Software_testing,"It has been proved that each class is strictly included into the next. For instance, testing when we assume that the behavior of the implementation under test can be denoted by a deterministic finite-state machine for some known finite sets of inputs and outputs and with some known number of states belongs to Class I (and all subsequent classes). However, if the number of states is not known, then it only belongs to all classes from Class II on. If the implementation under test must be a deterministic finite-state machine failing the specification for a single trace (and its continuations), and its number of states is unknown, then it only belongs to classes from Class III on. Testing temporal machines where transitions are triggered if inputs are produced within some real-bounded interval only belongs to classes from Class IV on, whereas testing many non-deterministic systems only belongs to Class V (but not all, and some even belong to Class I). The inclusion into Class I does not require the simplicity of the assumed computation model, as some testing cases involving implementations written in any programming language, and testing implementations defined as machines depending on continuous magnitudes, have been proved to be in Class I. Other elaborated cases, such as the testing framework by Matthew Hennessy under must semantics, and temporal machines with rational timeouts, belong to Class II.","There are three classes, what has been concluded and proven for all classes?","{'text': ['each class is strictly included into the next'], 'answer_start': [24]}"
8,5706b5912eaba6190074ac58,House_music,"In the late 1980s, Nu Groove Records prolonged, if not launched the careers of Rheji Burrell & Rhano Burrell, collectively known as Burrell (after a brief stay on Virgin America via Timmy Regisford and Frank Mendez), along with basically every relevant DJ and Producer in the NY underground scene. The Burrell's are responsible for the ""New York Underground"" sound and are the undisputed champions of this style of house. Their 30+ releases on this label alone seems to support that fact. In today's market Nu Groove Record releases like the Burrells' enjoy a cult-like following and mint vinyl can fetch $100 U.S. or more in the open market.",what label launched the careers of burrell?,"{'text': ['Nu Groove Records'], 'answer_start': [19]}"
9,57291d851d0469140077906e,Race_(human_categorization),"In his 2003 paper, ""Human Genetic Diversity: Lewontin's Fallacy"", A. W. F. Edwards argued that rather than using a locus-by-locus analysis of variation to derive taxonomy, it is possible to construct a human classification system based on characteristic genetic patterns, or clusters inferred from multilocus genetic data. Geographically based human studies since have shown that such genetic clusters can be derived from analyzing of a large number of loci which can assort individuals sampled into groups analogous to traditional continental racial groups. Joanna Mountain and Neil Risch cautioned that while genetic clusters may one day be shown to correspond to phenotypic variations between groups, such assumptions were premature as the relationship between genes and complex traits remains poorly understood. However, Risch denied such limitations render the analysis useless: ""Perhaps just using someone's actual birth year is not a very good way of measuring age. Does that mean we should throw it out? ... Any category you come up with is going to be imperfect, but that doesn't preclude you from using it or the fact that it has utility.""",Risch feels any category someone comes up with will be what?,"{'text': ['imperfect'], 'answer_start': [1061]}"


## Preprocessing the training data

In [12]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [13]:
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [14]:
pad_on_right = tokenizer.padding_side == "right"

In [15]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [16]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [17]:
features = prepare_train_features(datasets['train'][:5])

In [18]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

Loading cached processed dataset at /home/niss/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-4c9013c1c29b5287.arrow
Loading cached processed dataset at /home/niss/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-c2d531cef43c9cce.arrow


In [19]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 131754
    })
    validation: Dataset({
        features: ['attention_mask', 'end_positions', 'input_ids', 'start_positions'],
        num_rows: 12134
    })
})

## Fine-tuning the model

In [20]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

In [21]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [22]:
from transformers import default_data_collator

data_collator = default_data_collator

In [23]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"][:5000],
    eval_dataset=tokenized_datasets["validation"][:1500],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/home/niss/nlp/QA_engine_nlp/distilbert-base-uncased-finetuned-squad is already a clone of https://huggingface.co/niss/distilbert-base-uncased-finetuned-squad. Make sure you pull the latest changes with `repo.git_pull()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [24]:
trainer.train()

***** Running training *****
  Num examples = 4
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3


KeyError: 0