# The SQuAD Dataset
The [SQuAD dataset] comes in two flavors: SQuAD1.1 and SQuAD2.0. The latter contains the same questions and answers as the former, but also includes additional questions that cannot be answered by the accompanying passage. This is intended to create a more realistic question answering task. The ability to identify unanswerable questions is much more challenging for Transformer models, which is why we focused on the SQuAD2.0 dataset rather than SQuAD1.1. 

SQuAD2.0 consists of over 150k questions, of which more than 35% are unanswerable in relation to their associated passage. [, we fine-tuned on the train set (130k examples); now we'll focus on the dev set, which contains nearly 12k examples. Only about half of these examples are answerable questions. In the following section, we'll look at a couple of these examples to get a feel for them.

In [None]:
# # collapse-hide

# # use this cell to install packages if needed
# !pip install torch  torchvision -f https://download.pytorch.org/whl/torch_stable.html
# !pip install transformers

In [1]:
# collapse-hide
import json
import collections
from pprint import pprint
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# This is the directory in which we'll store all evaluation output
model_dir = "models/distilbert/twmkn9_distilbert-base-uncased-squad2/"

In [2]:
import urllib
urllib.request.urlretrieve ("https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json", "dev-v2.0.json")


('dev-v2.0.json', <http.client.HTTPMessage at 0x16654644b20>)

In [3]:
# collapse-hide

# Download the SQuAD2.0 dev set
#!wget -O data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

In [4]:
from transformers.data.processors.squad import SquadV2Processor

# this processor loads the SQuAD2.0 dev set examples
processor = SquadV2Processor()
examples = processor.get_dev_examples("./", filename="dev-v2.0.json")
print(len(examples))

100%|██████████| 35/35 [00:05<00:00,  6.40it/s]

11873





In [5]:
# generate some maps to help us identify examples of interest
qid_to_example_index = {example.qas_id: i for i, example in enumerate(examples)}
qid_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}
answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if has_answer]
no_answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if not has_answer]

In [6]:
def display_example(qid):    
    from pprint import pprint

    idx = qid_to_example_index[qid]
    q = examples[idx].question_text
    c = examples[idx].context_text
    a = [answer['text'] for answer in examples[idx].answers]
    
    print(f'Example {idx} of {len(examples)}\n---------------------')
    print(f"Q: {q}\n")
    print("Context:")
    pprint(c)
    print(f"\nTrue Answers:\n{a}")

#### A positive example 

In [7]:
display_example(answer_qids[1300])

Example 2548 of 11873
---------------------
Q: Where on Earth is free oxygen found?

Context:
("Free oxygen also occurs in solution in the world's water bodies. The "
 'increased solubility of O\n'
 '2 at lower temperatures (see Physical properties) has important implications '
 'for ocean life, as polar oceans support a much higher density of life due to '
 'their higher oxygen content. Water polluted with plant nutrients such as '
 'nitrates or phosphates may stimulate growth of algae by a process called '
 'eutrophication and the decay of these organisms and other biomaterials may '
 'reduce amounts of O\n'
 '2 in eutrophic water bodies. Scientists assess this aspect of water quality '
 "by measuring the water's biochemical oxygen demand, or the amount of O\n"
 '2 needed to restore it to a normal concentration.')

True Answers:
['water', "in solution in the world's water bodies", "the world's water bodies"]


#### A negative example

In [8]:
display_example(no_answer_qids[1254])

Example 2564 of 11873
---------------------
Q: What happened 3.7-2 billion years ago?

Context:
("Free oxygen gas was almost nonexistent in Earth's atmosphere before "
 'photosynthetic archaea and bacteria evolved, probably about 3.5 billion '
 'years ago. Free oxygen first appeared in significant quantities during the '
 'Paleoproterozoic eon (between 3.0 and 2.3 billion years ago). For the first '
 'billion years, any free oxygen produced by these organisms combined with '
 'dissolved iron in the oceans to form banded iron formations. When such '
 'oxygen sinks became saturated, free oxygen began to outgas from the oceans '
 '3–2.7 billion years ago, reaching 10% of its present level around 1.7 '
 'billion years ago.')

True Answers:
[]


### Load a Transformer model fine-tuned on SQuAD 2.0

In [9]:
tokenizer = AutoTokenizer.from_pretrained("twmkn9/distilbert-base-uncased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("twmkn9/distilbert-base-uncased-squad2")

In [10]:
def get_prediction(qid):
    # given a question id (qas_id or qid), load the example, get the model outputs and generate an answer
    question = examples[qid_to_example_index[qid]].question_text
    context = examples[qid_to_example_index[qid]].context_text

    inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    outputs = model(**inputs)
    answer_start = torch.argmax(outputs[0])  # get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(outputs[1]) + 1 

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

    return answer

In [11]:
# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s):
    """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
    import string, re

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

def get_gold_answers(example):
    """helper function that retrieves all possible true answers from a squad2.0 example"""
    
    gold_answers = [answer["text"] for answer in example.answers if answer["text"]]

    # if gold_answers doesn't exist it's because this is a negative example - 
    # the only correct answer is an empty string
    if not gold_answers:
        gold_answers = [""]
        
    return gold_answers

In [12]:
prediction = get_prediction(answer_qids[1300])
example = examples[qid_to_example_index[answer_qids[1300]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"Question: {example.question_text}")
print(f"Prediction: {prediction}")
print(f"True Answers: {gold_answers}")
print(f"EM: {em_score} \t F1: {f1_score}")

Question: Where on Earth is free oxygen found?
Prediction: water bodies
True Answers: ['water', "in solution in the world's water bodies", "the world's water bodies"]
EM: 0 	 F1: 0.8


In [13]:
prediction = get_prediction(no_answer_qids[1254])
example = examples[qid_to_example_index[no_answer_qids[1254]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"Question: {example.question_text}")
print(f"Prediction: {prediction}")
print(f"True Answers: {gold_answers}")
print(f"EM: {em_score} \t F1: {f1_score}")

Question: What happened 3.7-2 billion years ago?
Prediction: [CLS] what happened 3. 7 - 2 billion years ago? [SEP] free oxygen gas was almost nonexistent in earth's atmosphere before photosynthetic archaea and bacteria evolved, probably about 3. 5 billion years ago. free oxygen first appeared in significant quantities during the paleoproterozoic eon ( between 3. 0 and 2. 3 billion years ago ). for the first billion years, any free oxygen produced by these organisms combined with dissolved iron in the oceans to form banded iron formations. when such oxygen sinks became saturated, free oxygen began to outgas from the oceans
True Answers: ['']
EM: 0 	 F1: 0


# Evaluating a model on the SQuAD2.0 dev set with HF

The same `run_squad.py` script we used to fine-tune a Transformer for question answering can also be used to evaluate the model. (You can grab the script [here](https://github.com/huggingface/transformers/blob/master/examples/question-answering/run_squad.py) or run the hidden cell below.)


In [14]:
# # collapse-hide

# # Grab the run_squad.py script
# !curl -L -O https://raw.githubusercontent.com/huggingface/transformers/master/examples/question-answering/run_squad.py

In [15]:
Results = {
    # a) scores averaged over all examples in the dev set
    'exact': 66.25958056093658,         
    'f1': 69.66994428499025,            
    'total': 11873,  # number of examples in the dev set
    
    # b) scores averaged over only positive examples (have answers)
    'HasAns_exact': 68.91025641025641,  
    'HasAns_f1': 75.74076391627662,     
    'HasAns_total': 5928, # number of positive examples
    
    # c) scores averaged over only negative examples (no answers)
    'NoAns_exact': 63.61648444070648, 
    'NoAns_f1': 63.61648444070648, 
    'NoAns_total': 5945, # number of negative examples
    
    # d) given probabilities of no-answer for each example, what would the best scores and thresholds be?
    'best_exact': 66.25958056093658, 
    'best_exact_thresh': 0.0, 
    'best_f1': 69.66994428499046, 
    'best_f1_thresh': 0.0
}


# Final Thoughts 
This demonstrates the working principle of transformers in handling Question and answering using distilbert pretrained model.  