#Part 2: Answer a question
For this part we again load some requird elements of our model. First we bring in the transformers library to use its tokenizer. Then, our finetuned model from the google drive file where we saved it. Pytorch, string and re are also imported.

We set our model to evaluation mode.

In [None]:
!pip install transformers 

import torch
import string, re

from google.colab import drive
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

drive.mount('/content/drive')

model = torch.load("/content/drive/MyDrive/NLU/model220123/model.bin",map_location=torch.device('cpu'))
model.eval()


Then three usefull functions are defined: One to normalize the text, i.e. remove articles, fix the whote space between words, remove punctuation and lowercase the text, one to compute the answer's f1 score, and a third that uses the other two and actually predicts an answer to the given question.  

In [8]:
def normalize_text(s):
    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)
    def white_space_fix(text):
        return " ".join(text.split())
    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)
    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
  
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
  
    common_tokens = set(pred_tokens) & set(truth_tokens)
  
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
  
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
  
    return 2 * (prec * rec) / (prec + rec)

def give_an_answer(context,query,answer):
    # Predict
    inputs = tokenizer.encode_plus(query, context, return_tensors='pt')
    outputs = model(**inputs)
    answer_start = torch.argmax(outputs[0])  # get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(outputs[1]) + 1 
    prediction = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))

    # prediction = predict(context,query)
    # em_score = compute_exact_match(prediction, answer)
    em_score = int(normalize_text(prediction) == normalize_text(answer))
    f1_score = compute_f1(prediction, answer)

    print(f"Question: {query}")
    print(f"Prediction: {prediction}")
    print(f"True Answer: {answer}")
    print(f"EM: {em_score}")
    print(f"F1: {f1_score}")
    print("\n")
    return f1_score

In [None]:
context = "Batman is a superhero who appears in American comic books published by DC Comics. The character was created by artist Bob Kane and writer Bill Finger, and debuted in the 27th issue of the comic book Detective Comics on March 30, 1939. In the DC Universe continuity, Batman is the alias of Bruce Wayne, a wealthy American playboy, philanthropist, and industrialist who resides in Gotham City."

queries = ["In which comics does Batman appear?", "Who created the character?", "When did Batman debut?"]

answers = ["DC Comics", "Bob Kane", "March 30, 1939"]

for q,a in zip(queries,answers):
  give_an_answer(context,q,a)


#Part 3: 200 Questions from the squad training set
We import the squad training set to randomly check 200 questions and their answers. 

In [10]:
# Import the train set and load it

%%capture
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -O squad/train-v1.1.json

from fastai.imports import *

path = Path('squad/train-v1.1.json')

# Open .json file
with open(path, 'rb') as f:
    train_dict = json.load(f)


This set undergoes the same procedure in order for its data to be extracted from the json file. To make sure that the context/questions/answers triplets are not mixed up, we create a list of dictionaries called triplets, of which we choose 200 in random. Then, we break them back into lists, for them to be used by our model.

In [11]:
test_contexts  = []
test_questions = []
test_answers   = []

# Search for context, question and answer in each passage and append to respective lists
for group in train_dict['data']:
    for passage in group['paragraphs']:
        context = passage['context']
        for qa in passage['qas']:
            question = qa['question']
            for answer in qa['answers']:
                test_contexts.append(context)
                test_questions.append(question)
                test_answers.append(answer)

# Store information triplets in dictionaries and append to a list to randomize
triplets = []
for c,q,a in zip(test_contexts, test_questions, test_answers):
    instance = {}
    instance['context'] = c
    instance['question'] = q
    instance['answer'] = a
    triplets.append(instance)    
random_set = random.choices(triplets, k=200)

# Separate back to lists to use in give_an_answer function
random_contexts  = []
random_questions = []
random_answers   = []

for i in random_set:
    random_contexts.append(i['context'])
    random_questions.append(i['question'])
    random_answers.append(i['answer'])


The give_an_answer function is set to return the f1 score it produces, which we use to calculate the mean f1 score for the 200 answers. Here we try it in the first example:`

In [15]:
f1_score = give_an_answer(random_contexts[0],random_questions[0],random_answers[0]['text'])
print('F1 score:', f1_score)

Question: Which police force in London does not typically engage in police activity with the general public?
Prediction: which police force
True Answer: the Ministry of Defence Police
EM: 0
F1: 0.28571428571428575


F1 score: 0.28571428571428575


As in Part 2, we use the give_an_answer function to check the answers for the 200 random selections from the training set. 

In [13]:
totalF1 = 0

for c, q, a in zip(random_contexts, random_questions, random_answers):
    f = give_an_answer(c, q, a['text'])
    totalF1 += f

print(totalF1/200)

Question: Which police force in London does not typically engage in police activity with the general public?
Prediction: which police force
True Answer: the Ministry of Defence Police
EM: 0
F1: 0.28571428571428575


Question: About how many cubic kilometers of the vast stock forest's wood were harvested in 1991?
Prediction: 3. 5
True Answer: 3.5
EM: 0
F1: 0


Question: In the 2009 election, who was the candidate of the PAIGC?
Prediction: sanha
True Answer: Sanhá
EM: 0
F1: 0


Question: What was the name of the a cappella musical that first opened 28 January 1994?
Prediction: avenue x
True Answer: Avenue X
EM: 1
F1: 1.0


Question: What political idealogy did Gaddafi not believe in?
Prediction: arab nationalist activism
True Answer: factionalism
EM: 0
F1: 0


Question: What element made Oasis unique among 1990s Britpop bands?
Prediction: 
True Answer: a hard rock sound
EM: 0
F1: 0


Question: How did the Methodist movement spread so far and wide?
Prediction: because of vigorous missiona