**Question answering** comes in many forms. In this example, weâ€™ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the Stanford Question Answering Dataset (SQuAD) 2.0.

We will start by downloading the data:

## **Note :**

Please write your code in the cells with the "**Your code here**" placeholder.

## **Download SQuAD 2.0 Data**

Note : This dataset can be explored in the Hugging Face model hub (SQuAD V2), and can be alternatively downloaded with the ðŸ¤— NLP library with load_dataset("squad_v2").

In [1]:
!pip install transformers==4.0.1

Collecting transformers==4.0.1
  Downloading transformers-4.0.1-py3-none-any.whl (1.4 MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1.4 MB 1.3 MB/s eta 0:00:01
Collecting tokenizers==0.9.4
  Downloading tokenizers-0.9.4-cp37-cp37m-manylinux2010_x86_64.whl (2.9 MB)
[K     |â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2.9 MB 7.0 MB/s eta 0:00:01
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.9.3
    Uninstalling tokenizers-0.9.3:
      Successfully uninstalled tokenizers-0.9.3
  Attempting uninstall: transformers
    Found existing installation: transformers 3.5.1
    Uninstalling transformers-3.5.1:
      Successfully uninstalled transformers-3.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is

In [2]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json

--2021-01-04 08:51:47--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.110.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.110.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: â€˜squad/train-v2.0.jsonâ€™


2021-01-04 08:51:48 (120 MB/s) - â€˜squad/train-v2.0.jsonâ€™ saved [42123633/42123633]

--2021-01-04 08:51:49--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: â€˜squad/dev-v2.0.jsonâ€™


2021-01-04 08:51:49 (58.6 MB/s) - â€˜squad/dev-v2.0.jsonâ€™ saved [437

Each split is in a structured json file with a number of questions and answers for each passage (or context). Weâ€™ll take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since there are multiple questions per context):

In [3]:
import json
from pathlib import Path
from tqdm import tqdm

def read_squad(path):
    
    
    # Your code here

    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in tqdm(squad_dict['data']):
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
    
    return contexts, questions, answers

train_contexts, train_questions, train_answers = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers = read_squad('squad/dev-v2.0.json')


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 442/442 [00:00<00:00, 6462.85it/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 35/35 [00:00<00:00, 3300.30it/s]


The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which token positions the answer begins and ends.

First, letâ€™s get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [4]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
      
        # Your code here
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two â€“ fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

Now train_answers and val_answers include the character end positions and the corrected start positions. Next, letâ€™s tokenize our context/question pairs. ðŸ¤— Tokenizers can accept parallel lists of sequences and encode them together as sequence pairs.

In [5]:
# !pip install transformers
from transformers import DistilBertTokenizerFast
# from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Your code here
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)

# Your code here
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descriptiâ€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descriptiâ€¦




Next we need to convert our character start/end positions to token start/end positions. When using ðŸ¤— Fast Tokenizers, we can use the <b>built in char_to_token()</b> method.

In [6]:
tokenizer.model_max_length

512

In [7]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    
    # Your code here
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length - 1
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length - 1

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Our data is ready. Letâ€™s just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In PyTorch, we define a custom Dataset class. In TensorFlow, we pass a tuple of (inputs_dict, labels_dict) to the from_tensor_slices method.

In [8]:
import tensorflow as tf

# Your code here
train_dataset = tf.data.Dataset.from_tensor_slices((
    {key: train_encodings[key] for key in ['input_ids', 'attention_mask']},
    {key: train_encodings[key] for key in ['start_positions', 'end_positions']}
))

# Your code here
val_dataset = tf.data.Dataset.from_tensor_slices((
    {key: val_encodings[key] for key in ['input_ids', 'attention_mask']},
    {key: val_encodings[key] for key in ['start_positions', 'end_positions']}
))

Now we can use a DistilBert model with a QA head for training:

In [9]:
from transformers import TFDistilBertForQuestionAnswering

# Your code here
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_â€¦




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descriâ€¦




Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_projector', 'vocab_transform', 'vocab_layer_norm', 'activation_13']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The data and model are both ready to go. You can train the model with Trainer/TFTrainer exactly as in the sequence classification example above. If using native PyTorch, replace labels with start_positions and end_positions in the training example. If using Kerasâ€™s fit, we need to make a minor modification to handle this example since it involves multiple model outputs.

In [10]:
# Keras will expect a tuple when dealing with labels

# Write your code here to replace labels with start_positions and end_positions in the training example
train_dataset_2 = train_dataset.map(lambda x, y: (x, (y['start_positions'], y['end_positions'])))

# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.

# Your code here
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.distilbert.return_dict = False # if using ðŸ¤— Transformers >3.02, make sure outputs are tuples

# Your code here
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)

model.compile(optimizer=optimizer, loss=loss) # can also use any keras loss fn
model.fit(train_dataset_2.shuffle(1000).batch(16), epochs=3, batch_size=16)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fbbc178ead0>

### Save the model and tokenizer

In [11]:
model.save_pretrained("models")
tokenizer.save_pretrained("tokenizers")

('tokenizers/tokenizer_config.json',
 'tokenizers/special_tokens_map.json',
 'tokenizers/vocab.txt',
 'tokenizers/added_tokens.json')

## Model Validation

In [12]:
# replace labels with start_positions and end_positions in the validation example
val_dataset_2 = val_dataset.map(lambda x, y: (x, (y['start_positions'], y['end_positions'])))

In [13]:
# evaluate on the validation dataset
model.evaluate(val_dataset_2.shuffle(1000).batch(16), batch_size=16)



[2.8981881141662598, 1.4972575902938843, 1.4009298086166382]

### evaluate using the squad processor

In [14]:
from transformers.data.processors.squad import SquadV2Processor
processor = SquadV2Processor()
examples = processor.get_dev_examples("squad/", filename="dev-v2.0.json")

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 35/35 [00:08<00:00,  3.98it/s]


In [15]:
question_id_example_index_mapping = {example.qas_id: i for i, example in enumerate(examples)}
question_id_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}
answer_question_ids = [question_id for question_id, has_answer in question_id_to_has_answer.items() if has_answer]
no_answer_question_ids = [question_id for question_id, has_answer in question_id_to_has_answer.items() if not has_answer]

# get prediction for a specific question
def get_pred(question_id):
    question = examples[question_id_example_index_mapping[question_id]].question_text
    context = examples[question_id_example_index_mapping[question_id]].context_text
    
    inputs = tokenizer.encode_plus(question, context, return_tensors='tf', truncation=True, padding=True)
    
    start_scores, end_scores = model(inputs)
    
    answer_start = tf.argmax(start_scores, axis=1).numpy()[0]
    answer_end = (tf.argmax(end_scores, axis=1) + 1).numpy()[0]
    
    answer =  tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))
    
    return answer

In [16]:

predictions = {}
from tqdm import tqdm
# generate predictions for questions with answers
for i in tqdm(range(len(answer_question_ids))):    
    prediction = get_pred(answer_question_ids[i])
    example = examples[question_id_example_index_mapping[answer_question_ids[i]]]
    predictions[answer_question_ids[i]] = prediction

# generate predictions for questions with no answers
for i in tqdm(range(len(no_answer_question_ids))):
    prediction = get_pred(no_answer_question_ids[i])
    example = examples[question_id_example_index_mapping[no_answer_question_ids[i]]]
    predictions[no_answer_question_ids[i]] = prediction

100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5928/5928 [05:33<00:00, 17.78it/s]
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 5945/5945 [05:40<00:00, 17.45it/s]


In [17]:
from transformers.data.metrics.squad_metrics import squad_evaluate

In [18]:
# generate squad evaluation report on the default hyper parameter
squad_evaluate(examples, predictions)

OrderedDict([('exact', 21.8563126421292),
             ('f1', 26.530124192115917),
             ('total', 11873),
             ('HasAns_exact', 31.140350877192983),
             ('HasAns_f1', 40.50137728289351),
             ('HasAns_total', 5928),
             ('NoAns_exact', 12.598822539949538),
             ('NoAns_f1', 12.598822539949538),
             ('NoAns_total', 5945),
             ('best_exact', 65.61947275330581),
             ('best_exact_thresh', 0.0),
             ('best_f1', 70.29328430329252),
             ('best_f1_thresh', 0.0)])

In [19]:
# generate squad evaluation report on the a specific no_answer_probability_threshold
squad_evaluate(examples, predictions, no_answer_probability_threshold=-1.15)

OrderedDict([('exact', 50.07159100480081),
             ('f1', 50.07159100480081),
             ('total', 11873),
             ('HasAns_exact', 0.0),
             ('HasAns_f1', 0.0),
             ('HasAns_total', 5928),
             ('NoAns_exact', 100.0),
             ('NoAns_f1', 100.0),
             ('NoAns_total', 5945),
             ('best_exact', 65.61947275330581),
             ('best_exact_thresh', 0.0),
             ('best_f1', 70.29328430329252),
             ('best_f1_thresh', 0.0)])