## Introduction to Question Answering

Question answering is a common NLP task with several variants. In some variants, the task
is multiple-choice:
A list of possible answers are supplied with each question, and the model simply needs to
return a probability distribution over the options. A more challenging variant of
question answering, which is more applicable to real-life tasks, is when the options are
not provided. Instead, the model is given an input document -- called context -- and a
question about the document, and it must extract the span of text in the document that
contains the answer. In this case, the model is not computing a probability distribution
over answers, but two probability distributions over the tokens in the document text,
representing the start and end of the span containing the answer. This variant is called
"extractive question answering".

Extractive question answering is a very challenging NLP task, and the dataset size
required to train such a model from scratch when the questions and answers are natural
language is prohibitively huge. As a result, question answering (like almost all NLP
tasks) benefits enormously from starting from a strong pretrained foundation model -
starting from a strong pretrained language model can reduce the dataset size required to
reach a given accuracy by multiple orders of magnitude, enabling you to reach very strong
performance with surprisingly reasonable datasets.

Starting with a pretrained model adds difficulties, though - where do you get the model
from? How do you ensure that your input data is preprocessed and tokenized the same way
as the original model? How do you modify the model to add an output head that matches
your task of interest?

In this example, we'll show you how to load a model from the Hugging Face
[🤗Transformers](https://github.com/huggingface/transformers) library to tackle this
challenge. We'll also load a benchmark question answering dataset from the
[🤗Datasets](https://github.com/huggingface/datasets) library - this is another open-source
repository containing a wide range of datasets across many modalities, from NLP to vision
and beyond. Note, though, that there is no requirement that these libraries must be used
with each other. If you want to train a model from
[🤗Transformers](https://github.com/huggingface/transformers) on your own data, or you want
to load data from [🤗 Datasets](https://github.com/huggingface/datasets) and train your
own entirely unrelated models with it, that is of course possible (and highly
encouraged!)

## Installing the requirements

In [1]:
!pip install git+https://github.com/huggingface/transformers.git
!pip install datasets
!pip install huggingface-hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-6u7m4eug
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-6u7m4eug
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download
the UIT-Vi question answering dataset using `load_dataset()`.

In [2]:
from datasets import load_dataset

datasets = load_dataset("trungds10/UIT-Vi-Squad")



  0%|          | 0/2 [00:00<?, ?it/s]

The `datasets` object itself is a
`DatasetDict`, which contains one key for the training, validation and test set. We can see
the training, validation and test sets all have a column for the context, the question
and the answers to those questions. To access an actual element, you need to select a
split first, then give an index. We can see the answers are indicated by their start
position in the text and their full text, which is a substring of the context as we
mentioned above. Let's take a look at what a single training example looks like.

In [3]:
print(datasets["train"][0])

{'id': 'uit_000001', 'title': 'Phạm Văn Đồng', 'context': 'Phạm Văn Đồng (1 tháng 3 năm 1906 – 29 tháng 4 năm 2000) là Thủ tướng đầu tiên của nước Cộng hòa Xã hội chủ nghĩa Việt Nam từ năm 1976 (từ năm 1981 gọi là Chủ tịch Hội đồng Bộ trưởng) cho đến khi nghỉ hưu năm 1987. Trước đó ông từng giữ chức vụ Thủ tướng Chính phủ Việt Nam Dân chủ Cộng hòa từ năm 1955 đến năm 1976. Ông là vị Thủ tướng Việt Nam tại vị lâu nhất (1955–1987). Ông là học trò, cộng sự của Chủ tịch Hồ Chí Minh. Ông có tên gọi thân mật là Tô, đây từng là bí danh của ông. Ông còn có tên gọi là Lâm Bá Kiệt khi làm Phó chủ nhiệm cơ quan Biện sự xứ tại Quế Lâm (Chủ nhiệm là Hồ Học Lãm).', 'question': 'Tên gọi nào được Phạm Văn Đồng sử dụng khi làm Phó chủ nhiệm cơ quan Biện sự xứ tại Quế Lâm?', 'answers': {'text': ['Lâm Bá Kiệt'], 'answer_start': [507]}}


## Preprocessing the training data

Before we can feed those texts to our model, we need to preprocess them. This is done by
a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs
(including converting the tokens to their corresponding IDs in the pretrained vocabulary)
and put it in a format the model expects, as well as generate the other inputs that model
requires.

To do all of this, we instantiate our tokenizer with the `XLMRobertaTokenizerFast.from_pretrained`
method, which will ensure:

- We get a tokenizer that corresponds to the model architecture we want to use.
- We download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the
cell.

The `from_pretrained()` method expects the name of a model. If you're unsure which model to
pick, don't panic! The list of models to choose from can be bewildering, but in general
there is a simple tradeoff: Larger models are slower and consume more memory, but usually
yield slightly better final accuracies after fine-tuning. 

In [4]:
from transformers import XLMRobertaTokenizerFast

tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")



If we simply truncate with a fixed size (`max_length`), we will lose information. We want to
avoid truncating the question, and instead only truncate the context to ensure the task
remains solvable. To do that, we'll set `truncation` to `"only_second"`, so that only the
second sequence (the context) in each pair is truncated. To get the list of features
capped by the maximum length, we need to set `return_overflowing_tokens` to True and pass
the `doc_stride` to `stride`. To see which feature of the original context contain the
answer, we can return `"offset_mapping"`.

In [5]:
max_length = 384  # The maximum length of a feature (question and context)
doc_stride = (
    128  # The authorized overlap between two part of the context when splitting
)
# it is needed.

In the case of impossible answers (the answer is in another feature given by an example
with a long context), we set the cls index for both the start and end position. We could
also simply discard those examples from the training set if the flag
`allow_impossible_answers` is `False`. Since the preprocessing is already complex enough
as it is, we've kept is simple for this part.

In [6]:

def prepare_train_features(examples):
    # Tokenize our examples with truncation and padding, but keep the overflows using a
    # stride. This results in one example possible giving several features when a context is long,
    # each of those features having a context that overlaps a bit the context of the previous
    # feature.
    examples["question"] = [q.lstrip() for q in examples["question"]]
    examples["context"] = [c.lstrip() for c in examples["context"]]
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a
    # map from a feature to its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original
    # context. This will help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what
        # is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this
        # span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the
            # CLS index).
            if not (
                offsets[token_start_index][0] <= start_char
                and offsets[token_end_index][1] >= end_char
            ):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the
                # answer.
                # Note: we could go after the last offset if the answer is the last word (edge
                # case).
                while (
                    token_start_index < len(offsets)
                    and offsets[token_start_index][0] <= start_char
                ):
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples


To apply this function on all the sentences (or pairs of sentences) in our dataset, we
just use the `map()` method of our `Dataset` object, which will apply the function on all
the elements of.

We'll use `batched=True` to encode the texts in batches together. This is to leverage the
full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to
treat the texts in a batch concurrently. We also use the `remove_columns` argument to
remove the columns that existed before tokenization was applied - this ensures that the
only features remaining are the ones we actually want to pass to our model.

In [7]:
tokenized_datasets = datasets.map(
    prepare_train_features,
    batched=True,
    remove_columns=datasets["train"].column_names,
    num_proc=3,
)

    



 



 



    



 



 



Even better, the results are automatically cached by the 🤗 Datasets library to avoid
spending time on this step the next time you run your notebook. The 🤗 Datasets library is
normally smart enough to detect when the function you pass to map has changed (and thus
requires to not use the cache data). For instance, it will properly detect if you change
the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses
cached files, you can pass `load_from_cache_file=False` in the call to `map()` to not use
the cached files and force the preprocessing to be applied again.

Because all our data has been padded or truncated to the same length, and it is not too
large, we can now simply convert it to a dict of numpy arrays, ready for training.

Although we will not use it here, 🤗 Datasets have a `to_tf_dataset()` helper method
designed to assist you when the data cannot be easily converted to arrays, such as when
it has variable sequence lengths, or is too large to fit in memory. This method wraps a
`tf.data.Dataset` around the underlying 🤗 Dataset, streaming samples from the underlying
dataset and batching them on the fly, thus minimizing wasted memory and computation from
unnecessary padding. If your use-case requires it, please see the
[docs](https://huggingface.co/docs/transformers/custom_datasets#finetune-with-tensorflow)
on to_tf_dataset and data collator for an example. If not, feel free to follow this example
and simply convert to dicts!

In [8]:
train_set = tokenized_datasets["train"].with_format("numpy")[
    :
]  # Load the whole dataset as a dict of numpy arrays
validation_set = tokenized_datasets["validation"].with_format("numpy")[:]

## Fine-tuning the model

That was a lot of work! But now that our data is ready, everything is going to run very
smoothly. First, we download the pretrained model and fine-tune it. Since our task is
question answering, we use the `TFAutoModelForQuestionAnswering` class. Like with the
tokenizer, the `from_pretrained()` method will download and cache the model for us:

In [9]:
from transformers import AutoConfig, AutoModelForQuestionAnswering

# see https://huggingface.co/docs/transformers/main_classes/configuration
config = AutoConfig.from_pretrained(
    "xlm-roberta-base",
    num_labels=2,
    hidden_size=768,
)
model = (AutoModelForQuestionAnswering
         .from_pretrained("xlm-roberta-base", config=config)
         )

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForQuestionAnswering: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForQuestionAnswering were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream tas

In [10]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

The warning is telling us we are throwing away some weights and newly initializing some
others. Don't panic! This is absolutely normal. Recall that models like BERT and
Distilbert are trained on a **language modeling** task, but we're loading the model as
a `TFAutoModelForQuestionAnswering`, which means we want the model to perform a
**question answering** task. This change requires the final output layer or "head" to be
removed and replaced with a new head suited for the new task. The `from_pretrained`
method will handle all of this for us, and the warning is there simply to remind us that
some model surgery has been performed, and that the model will not generate useful
predictions until the newly-initialized layers have been fine-tuned on some data.

Next, we can create an optimizer and specify a loss function. You can usually get
slightly better performance by using learning rate decay and decoupled weight decay, but
for the purposes of this example the standard `Adam` optimizer will work fine. Note,
however, that when fine-tuning a pretrained transformer model you will generally want to
use a low learning rate! We find the best results are obtained with values in the range
1e-5 to 1e-4, and training may completely diverge at the default Adam learning rate of 1e-3.

In [11]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir = "trungds",
    log_level = "error",
    num_train_epochs = 3,
    learning_rate = 7e-5,
    lr_scheduler_type = "linear",
    warmup_steps = 100,
    per_device_train_batch_size = 2,
    per_device_eval_batch_size = 1,
    gradient_accumulation_steps = 16,
    evaluation_strategy = "steps",
    eval_steps = 150,
    save_steps = 500,
    logging_steps = 50,
    push_to_hub = False
)

In [12]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["validation"].select(range(100)),
    tokenizer = tokenizer,
)

And now we just compile and fit the model. As a convenience, all 🤗 Transformers models
come with a default loss which matches their output head, although you're of course free
to use your own. Because the built-in loss is computed internally during the forward
pass, when using it you may find that some Keras metrics misbehave or give unexpected
outputs. This is an area of very active development in 🤗 Transformers, though, so
hopefully we'll have a good solution to that issue soon!

For now, though, let's use the built-in loss without any metrics. To get the built-in
loss, simply leave out the `loss` argument to `compile`.

In [13]:
trainer.train()



Step,Training Loss,Validation Loss
150,2.5725,2.597296
300,1.9174,1.995494
450,1.8511,2.023351
600,1.7636,1.898478
750,1.6584,1.825791
900,1.6208,1.847045
1050,1.2985,1.77604
1200,1.2834,1.757387
1350,1.2775,1.651042
1500,1.2609,1.605081


TrainOutput(global_step=2850, training_loss=1.4242638651529949, metrics={'train_runtime': 7442.1333, 'train_samples_per_second': 12.255, 'train_steps_per_second': 0.383, 'total_flos': 1.7873482051012608e+16, 'train_loss': 1.4242638651529949, 'epoch': 3.0})

In [19]:
import os

os.makedirs("./xlm-vie-trung", exist_ok=True)
if hasattr(trainer.model, "module"):
    trainer.model.module.save_pretrained("./xlm-vie-trung")
else:
    trainer.model.save_pretrained("./xlm-vie-trung")

In [20]:
from transformers import AutoModelForQuestionAnswering

model = (AutoModelForQuestionAnswering
         .from_pretrained("./xlm-vie-trung")
         )

In [21]:
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    device=0)

idx = 0
print("***** context *****")
print(datasets["validation"]["context"][idx])
print("")
print("***** question *****")
print(datasets["validation"]["question"][idx])
print("")
print("***** true answer *****")
print(datasets["validation"]["answers"][idx]["text"][0])
print("")
print("***** predicted top3 answer *****")
qa_pipeline(
    question = datasets["validation"]["question"][idx],
    context = datasets["validation"]["context"][idx],
    align_to_words = False,
    top_k=3,
)

***** context *****
Paris nằm ở điểm gặp nhau của các hành trình thương mại đường bộ và đường sông, và là trung tâm của một vùng nông nghiệp giàu có. Vào thế kỷ 10, Paris đã là một trong những thành phố chính của Pháp cùng các cung điện hoàng gia, các tu viện và nhà thờ. Từ thế kỷ 12, Paris trở thành một trong những trung tâm của châu Âu về giáo dục và nghệ thuật. Thế kỷ 14, Paris là thành phố quan trọng bậc nhất của Cơ Đốc giáo và trong các thế kỷ 16, 17, đây là nơi diễn ra Cách mạng Pháp cùng nhiều sự kiện lịch sử quan trọng của Pháp và châu Âu. Đến thế kỷ 19 và 20, thành phố trở thành một trong những trung tâm văn hóa của thế giới, thủ đô của nghệ thuật và giải trí.

***** question *****
Paris đạt được thành quả gì sau khoảng 4 thế kỷ tính từ ngày Cách mạng Pháp diễn ra?

***** true answer *****
trở thành một trong những trung tâm văn hóa của thế giới, thủ đô của nghệ thuật và giải trí

***** predicted top3 answer *****


[{'score': 0.001704279100522399,
  'start': 555,
  'end': 621,
  'answer': 'thành phố trở thành một trong những trung tâm văn hóa của thế giới'},
 {'score': 0.0008764154626987875,
  'start': 623,
  'end': 656,
  'answer': 'thủ đô của nghệ thuật và giải trí'},
 {'score': 0.00015209632692858577,
  'start': 565,
  'end': 621,
  'answer': 'trở thành một trong những trung tâm văn hóa của thế giới'}]