# Milestone 3

We start by deciding what model to use, you can find all the models at `huggingface.co/models`. I have chosen the finetund `distilbert` on SQuAD since it does not require a lot of RAM to run and yet performs quite well. Secondly we decide the amount of data which we want to test the model on. 

In [1]:
MODEL = "distilbert-base-uncased-distilled-squad"
TEST_SAMPLE_SIZE = 1000

We load the SQuAD dataset and save it in a list of triples consisting of the question, answer and context. Then we take a sample of that. 

In [2]:
import random
import json

with open("data/dev-v2.0.json") as f:
    data = json.load(f)

def get_qustion_answers_context(data):
    qac = []
    for idata in data["data"]:
        for paragraph in idata["paragraphs"]:
            for question in paragraph["qas"]:
                answers = [answer["text"] for answer in question["answers"]]
                qac.append((question["question"], answers , paragraph["context"]))
    return qac

qac = random.sample(get_qustion_answers_context(data), TEST_SAMPLE_SIZE)

Next we want to build our Exact Matching scoring function, this function will get the list of triples and the Question Answering model(which we will set up later). The function will check how many exact matches it the model provides. 

In [3]:
def get_em_scores(qac, qa_model):
    score = []
    for question, answers, context in qac:
        answer = qa_model(question=question, context=context)
        if not answer and not answers:
            score.append(True)
        else:
            score.append(any([answer.lower()==ans.lower() for ans in answers]))
    return score


By using the `pipline` there is not much set-up needed and can be used as a blackbox. We want to filter out answers which have a lower score since if the model is not certain then perhaps there is not answer. The answer extracted might have some characters at the beginning or end that are not desirable, to avoid that we strip off some potential ones.

In [4]:
from transformers import pipeline

qamodel = pipeline("question-answering", model=MODEL, tokenizer=MODEL, device=-1)

def get_answer_pipeline(question, context):
    answer = qamodel(question=question, context=context)
    if answer["score"] < 0.6:
        return ""
    else:
        return answer["answer"].rstrip(".").rstrip(",").lstrip("(").rstrip(")").rstrip(".").strip("'").strip(":")


2023-02-26 17:13:51.294165: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-26 17:13:51.377715: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-26 17:13:51.380273: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-26 17:13:51.380283: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore 

Final step here is to get the Exact Matching score

In [5]:
scores = get_em_scores(qac, get_answer_pipeline)
print(sum(scores)/len(scores))

0.53


If you chose to not use the `pipeline` and want to set up the Question Answering model yourself by using the `AutoTokenizer` and `AutoModel` you can look at the example below. This requires a bit more work but gives you better ability to adjust some parameters. We estimate the score here by taking the mean between the start and end score. 

In [6]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL)


def get_answer(question, context):
    inputs = tokenizer.encode_plus(question, 
                                   context, 
                                   add_special_tokens=True, 
                                   return_tensors="pt", 
                                   max_length=tokenizer.model_max_length, truncation=True)
    input_ids = inputs["input_ids"].tolist()[0]

    with torch.no_grad():
        answer_start_scores, answer_end_scores = model(**inputs, return_dict=False)
        answer_start_scores, answer_end_scores = answer_start_scores.cpu().numpy(), answer_end_scores.cpu().numpy()
        
    answer_start = np.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = np.argmax(
        answer_end_scores
    ) + 1  # Get the most likely end of answer with the argmax of the score
    
    # Normalize logits and spans to retrieve the answer
    start_ = np.exp(answer_start_scores - np.log(np.sum(np.exp(answer_start_scores), axis=-1, keepdims=True)))
    end_ = np.exp(answer_end_scores - np.log(np.sum(np.exp(answer_end_scores), axis=-1, keepdims=True)))
    score = np.mean([start_[0][answer_start], end_[0][answer_end-1]])
    
    if score > 0.9:
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
        return answer
    else:
        return ""

And here is the Exact Matching score for the model above

In [7]:
scores = get_em_scores(qac, get_answer)
print(sum(scores)/len(scores))


0.529
