<a href="https://colab.research.google.com/github/Spartan-119/A-B-Testing-Approach-for-Comparing-Performance-of-ML-Models/blob/main/a_b_testing_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# installing all the necessary packages
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade
!pip install transformers --upgrade
!pip install tqdm
!pip install tensorboard

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting transformers
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━

In [2]:
# importing all the necessary libraries
import json
import os
import timeit
import collections
import time
from pprint import pprint
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, squad_convert_examples_to_features
from transformers.data.processors.squad import SquadV2Processor,SquadResult
from transformers.data.metrics.squad_metrics import (
    compute_predictions_log_probs,
    compute_predictions_logits,
    squad_evaluate,
)

In [3]:
DO_LOWER_CASE = True
NBEST_SIZE = 20
DOC_STRIDE = 128
MAX_SEQ_LENGTH = 384
MAX_QUERY_LENGTH = 64
MAX_ANSWER_LENGTH = 30
DATA_DIR = 'data/squad'
PREDICT_FILE = 'dev-v2.0.json'

BERT_MODEL_TYPE = 'bert'
BERT_MODEL_HF_PATH = 'twmkn9/bert-base-uncased-squad2'
BERT_OUTPUT_DIR = 'models/bert/twmkn9_bert-case-uncased-squad2'

DISTILBERT_MODEL_TYPE = 'distilbert'
DISTILBERT_MODEL_HF_PATH = 'twmkn9/distilbert-base-uncased-squad2'
DISTILBERT_OUTPUT_DIR = 'models/distilbert/twmkn9_distilbert-base-uncased-squad2'

# Downloading and Exploring the dataset

In [4]:
# downloading the dataset
!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-08-31 08:42:47--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘data/squad/dev-v2.0.json’


2023-08-31 08:42:47 (76.9 MB/s) - ‘data/squad/dev-v2.0.json’ saved [4370528/4370528]



#### <i>Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

#### SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.</i>


## Loading the DEV set using Hugging Face's data processors

I am going to make use of [Processors](https://huggingface.co/transformers/main_classes/processors.html) to facilitate basic processing tasks with some canonical NLP datasets. The processors can be used for loading datasets and converting their examples to features for direct use in the model. More specifically, we will be using the [SQuAD processors](https://huggingface.co/transformers/main_classes/processors.html#squad)

In [5]:
def to_list(tensor):
  return tensor.detach().cpu().tolist()

In [6]:
def load_and_cache_examples(model_name_or_path,
                            data_dir= DATA_DIR,
                            predict_file=PREDICT_FILE,
                            max_seq_length=MAX_SEQ_LENGTH,
                            doc_stride=DOC_STRIDE,
                            max_query_length=MAX_QUERY_LENGTH,
                            overwrite_cache=True):

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
    # Load data features from cache or dataset file
    input_dir = data_dir if data_dir else "."
    cached_features_file = os.path.join(
        input_dir,
        "cached_{}_{}_{}".format(
            "dev",
            list(filter(None, model_name_or_path.split("/"))).pop(),
            str(max_seq_length),
        ),
    )

    # Init features and dataset from cache if it exists
    if os.path.exists(cached_features_file) and not overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features_and_dataset = torch.load(cached_features_file)
        features, dataset, examples = (
            features_and_dataset["features"],
            features_and_dataset["dataset"],
            features_and_dataset["examples"],
        )
    else:

        processor = SquadV2Processor()

        examples = processor.get_dev_examples(data_dir, filename=predict_file)

        features, dataset = squad_convert_examples_to_features(
            examples=examples,
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset="pt",
            threads=1,
        )


    return dataset, examples, features

In [7]:
dataset, examples, features = load_and_cache_examples(BERT_MODEL_HF_PATH)

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

100%|██████████| 35/35 [00:04<00:00,  7.76it/s]
convert squad examples to features: 100%|██████████| 11873/11873 [01:44<00:00, 113.23it/s]
add example index and unique id: 100%|██████████| 11873/11873 [00:00<00:00, 1023133.39it/s]


In [8]:
print(f'There are {len(examples)} examples in the dev dataset.')

There are 11873 examples in the dev dataset.


This list of examples contains objects of type transformers.data.processors.squad.SquadExample.
We use the functionbelow to extract the information we want from such objects.
More specifically: 'qid', 'question_text', 'context_text' and 'answer'.

I will first create some extra variables to help on manipulation of data.

In [9]:
# generating some maps to help identify examples of interest.
qid_to_example_index = {example.qas_id: i for i, example in enumerate(examples)}
qid_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}
answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if has_answer]
no_answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if not has_answer]

And also, the function below to help on extracting information given a `qid` (question unique identifier)

In [10]:
def display_example(qid: str) -> None:
  idx = qid_to_example_index[qid]
  q = examples[idx].question_text
  c = examples[idx].context_text
  a = [answer['text'] for answer in examples[idx].answers]

  print(f'Examples {idx} of {len(examples)}\n------------------')
  print(f'Q: {q}\n')
  print('Context:')
  pprint(c)
  print(f'\nTrue Answers:\n{a}')

## Positive Example

50% of the examples in the test set are questionst hat have answers contained within their corresponding passage. In these cases, up to 5 possible correct answers are provided. Such answers must come directly from the passage, we will see later, however, that there are several ways to arrive at a "correct" answer.

In [11]:
display_example(answer_qids[2456])

Examples 4959 of 11873
------------------
Q: It is now possible to convert old relative ages into what type of ages using isotopic dating?

Context:
('At the beginning of the 20th century, important advancement in geological '
 'science was facilitated by the ability to obtain accurate absolute dates to '
 'geologic events using radioactive isotopes and other methods. This changed '
 'the understanding of geologic time. Previously, geologists could only use '
 'fossils and stratigraphic correlation to date sections of rock relative to '
 'one another. With isotopic dates it became possible to assign absolute ages '
 'to rock units, and these absolute dates could be applied to fossil sequences '
 'in which there was datable material, converting the old relative ages into '
 'new absolute ages.')

True Answers:
['absolute ages', 'rock units', 'new absolute']


## Negative Example

The remaining 50% of the questions in the test set do not have an answer. This is important as in a real life Q&A system, the model needs to learn when **NOT TO ANSWER.**

In [12]:
display_example(no_answer_qids[1235])

Examples 2520 of 11873
------------------
Q: What is difficult with a satellite-to-noise ratio?

Context:
('Oxygen presents two spectrophotometric absorption bands peaking at the '
 'wavelengths 687 and 760 nm. Some remote sensing scientists have proposed '
 'using the measurement of the radiance coming from vegetation canopies in '
 'those bands to characterize plant health status from a satellite platform. '
 'This approach exploits the fact that in those bands it is possible to '
 "discriminate the vegetation's reflectance from its fluorescence, which is "
 'much weaker. The measurement is technically difficult owing to the low '
 'signal-to-noise ratio and the physical structure of vegetation; but it has '
 'been proposed as a possible method of monitoring the carbon cycle from '
 'satellites on a global scale.')

True Answers:
[]


## Metrics for Q&A Systems

When measuring the performance of a machine learning system, we need to think about both **model** and **customer metrics.**

Q&A systems are usually measured by two dominant metrics: **F1** and **Exact Match (EM)**. They are computed on individual question and answer **pairs**. When multiple correct answers are available for a given question the maximum score over all possible correct answers is computed. Overall EM and F1 scores are computed for a model by averaging over the individual example scores.

### Exact Match:

For each Q&A pair, if the characters of the model's prediction are an exact matvh of the characters of any of the True Answer(s), **EM = 1**, else, **EM = 0**. This is a strict all-or-nothing metric, which may have little value for final customers of a Q&A system. It may be beneficial only when assessing against a negative example; if the model predicts any text at all, it automatically receives a 0 for that example.

### F1 Score:

Almost all classification problems rely on F1 score to measure the model performance. It is mostly appropriate when we care equally about precision and recall. On an QnA system, however, it is computed over the individual words in the prediction against those in the True answer. The number of shared words between the prediction and the trust is the basis of F1 score. While **precision** is the ratio between the number of shared words to the total number of words in the prediction; **Recall** is the ratio of the number of shared words to the toal number of words in the ground truth.

### Latency

Latency is an important metric for ML systems. in the QnA example, it is of the utmost importance when the system is used in a conversational application. <br> For Example, Alexa and Google home are devices that have very strict latency constraints as the users expect an answer within a few seconds or macroseconds after the question is asked. When updating models we should take this dimension according to the application of the system.

### Answer Rate

In the QnA systems, models that attempt to answer every question are often perceived inaccurate. The system should only provide an output when it is confident enough to do so. In other words, when the probabiliries of predictions are above a certain threshold. <br> In some applications, a model should be able to say *I don't know* or *The context has not enough information to answer the question.*

# Q&A Models

Question and Answering make use of Large Language Models (LLMs) as any other classification problem in NLP. The main difference relies on how the input and the output is provided to the model. Generally speaking models are trained to match the true answer to the question as they are provided together as an input to the model.

# Bert

BERT is a neural approach to pre-train language representations which obtains near SOTA results on a wide array of NLP tasks, including SQuAD Question Answering dataset <br><br>

Developed in 2019, BERT achieves 80.422% in the EM score and 83.118% in the F1 score.<br><br>

BERT-base has 110 million parameters and BERT-large has 340 million parameters.

## Model Parameters comparison

As LLMs were developed, the amount of parameters in these models have grown exponentially. Although this improves model performance, it comes at a cost: Latency. As it will be discussed, for use cases where inference is done on batches that may have less impact, however, on real time systems such as voice assistants or web search, latency plays a major role on deciding whether one model is better than the other.

# BERT Input

[CLS] context [SEP] question [SEP] [PAD] [PAD] [PAD]

**context** = "The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body under the auspices of the United Nations."

**question** = "What organization is the IPCC a part of?"

**after being merged by the tokenizer**:
```
"[CLS] The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body under the auspices of the United Nations. [SEP] What organization is the IPCC a part of? [SEP] [PAD] [PAD] [PAD]"
```


**token-id format**:

[101, 1109, 11300, 2758, 24472, 15595, 20339, 1113, 13540, 9091, 113, 14274, 12096, 114, 1110, 170, 3812,
 9455, 2758, 24472, 15595, 1404, 1223, 1103, 22105, 1104, 1103, 1244, 3854, 119, 102, 1327, 2369, 1110, 1103,
 14274, 12096, 170, 1226, 1104, 136, 102, 0, 0, 0]

# Loading Pre-Trained BERT from Hugging Face's repository.

In [13]:
tokenizer = AutoTokenizer.from_pretrained(BERT_MODEL_HF_PATH, use_fast = False)
model = AutoModelForQuestionAnswering.from_pretrained(BERT_MODEL_HF_PATH)

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at twmkn9/bert-base-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Utility Functions

Given a question_id, model and tokenizer, we get an answer text. In here we get the maximum probability of beginning and end for the answer in the softmax output.

In [14]:
def get_prediction(qid: str, model: AutoModelForQuestionAnswering, tokenizer: AutoTokenizer):
  # given a question_id (qas_id or qid), load the example, get the model outputs and generate an answer
  question = examples[qid_to_example_index[qid]].question_text
  context = examples[qid_to_example_index[qid]].context_text

  inputs = tokenizer.encode_plus(question, context, return_tensors = 'pt')

  outputs = model(**inputs)

  answer_start = torch.argmax(outputs[0])       # get the most likely beginning of answer with the argmax of the score
  answer_end = torch.argmax(outputs[1]) + 1

  answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start: answer_end]))

  return answer

I am creatinga simple function that given an example list, it extracts the gold answers.

In [15]:
def get_gold_answers(example):
  """
  helper function that retrieves all possible true answers from a squad2.0 example.
  """

  gold_answers = [answer['text'] for answer in example.answers if answer['text']]

  # if gold_answers doesn't exist, it's because this is a negative example.
  # the only correct answer is an empty string then in this case.
  if not gold_answers:
    gold_answers = [""]

  return gold_answers

For metrics like EM, we need to make sure that texts are normalized so we can compare on a character level.

In [16]:
# these functions are heavily influenced by the HF squad_metrics.py script
def normalize_text(s: str):
  """
  removing articles and punctuation, and standardizing whitespace are all typical text processing steps.
  """
  import string, re

  def remove_articles(text):
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
    return re.sub(regex, " ", text)

  def white_space_fix(text):
    return " ".join(text.split())

  def remove_punc(text):
    exclude = set(string.punctuation)
    return "".join(ch for ch in text if ch not in exclude)

  def lower(text):
    return text.lower()

  return white_space_fix(remove_articles(remove_punc(lower(s))))

# Metrics Calculation

## Exact Match (EM)

In [17]:
def compute_exact_match(prediction, truth):
  return int(normalize_text(prediction) == normalize_text(truth))

## F1 Score

In [18]:
def compute_f1(prediction, truth):
  pred_tokens = normalize_text(prediction).split()
  truth_tokens = normalize_text(truth).split()

  # if either the prediction or the truth is no-answer then f1 = 1 if they agree, else 0
  if len(pred_tokens) == 0 or len(truth_tokens) == 0:
    return int(pred_tokens == truth_tokens)

  common_tokens = set(pred_tokens) & set(truth_tokens)

  # if there are no common tokens then f1 = 0
  if len(common_tokens) ==  0:
    return 0

  prec = len(common_tokens) / len(pred_tokens)
  rec = len(common_tokens) / len(truth_tokens)

  return 2 * (prec * rec) / (prec + rec)

Computing EM and F1 for an example with a gold answer

In [19]:
prediction = get_prediction(answer_qids[1303], model, tokenizer, )
example = examples[qid_to_example_index[answer_qids[1303]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"Question: {example.question_text}")
print(f"Prediction: {prediction}")
print(f"True Answers: {gold_answers}")
print(f"EM: {em_score} \t F1: {f1_score}")

Question: What measurement do scientists used to determine the quality of water?
Prediction: biochemical oxygen demand
True Answers: ['biochemical oxygen demand', 'biochemical oxygen demand', "measuring the water's biochemical oxygen demand", 'biochemical oxygen demand', "measuring the water's biochemical oxygen demand"]
EM: 1 	 F1: 1.0


Now let's try and compute an example without answer

In [20]:
prediction = get_prediction(no_answer_qids[1254], model, tokenizer)
example = examples[qid_to_example_index[no_answer_qids[1254]]]

gold_answers = get_gold_answers(example)

em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

print(f"Question: {example.question_text}")
print(f"Prediction: {prediction}")
print(f"True Answers: {gold_answers}")
print(f"EM: {em_score} \t F1: {f1_score}")

Question: What happened 3.7-2 billion years ago?
Prediction: [CLS]
True Answers: ['']
EM: 0 	 F1: 0


Both metrics are zero, this model does not correctly assess that this question is unanswerable. It predicts the [CLS] token (it means it considers the entire context as an answer to the question)

## Putting it all together

In [21]:
def get_answers_metrics(
    model: AutoModelForQuestionAnswering,
    tokenizer: AutoTokenizer,
    answer_qids = answer_qids,
    examples = examples):

  answers_arr = []
  start_time = time.time()
  errors = []

  for qid in tqdm(answer_qids):
    try:
      prediction = get_prediction(qid, model, tokenizer)
      example = examples[qid_to_example_index[qid]]

      gold_answers = get_gold_answers(example)

      em_score = max((compute_exact_match(prediction, answer)) for answer in gold_answers)
      f1_score = max((compute_f1(prediction, answer)) for answer in gold_answers)

      result_dict = {}
      result_dict["qid"] = qid
      result_dict["question"] = example.question_text
      result_dict["prediction"] = prediction
      result_dict["true_answers"] = ';'.join(gold_answers)
      result_dict["f1"] = f1_score
      result_dict["em"] = em_score
      answers_arr.append(result_dict)
    except:
      errors.append(qid)

  end_time = time.time()

  return pd.DataFrame(answers_arr), end_time - start_time, errors

In [22]:
metrics_df, total_time, errors = get_answers_metrics(model, tokenizer, answer_qids[: 100])

100%|██████████| 100/100 [00:38<00:00,  2.61it/s]


In [23]:
metrics_df.head()

Unnamed: 0,qid,question,prediction,true_answers,f1,em
0,56ddde6b9a695914005b9628,In what country is Normandy located?,france,France;France;France;France,1.0,1
1,56ddde6b9a695914005b9629,When were the Normans in Normandy?,10th and 11th centuries,10th and 11th centuries;in the 10th and 11th c...,1.0,1
2,56ddde6b9a695914005b962a,From which countries did the Norse originate?,"denmark , iceland and norway","Denmark, Iceland and Norway;Denmark, Iceland a...",1.0,1
3,56ddde6b9a695914005b962b,Who was the Norse leader?,rollo,Rollo;Rollo;Rollo;Rollo,1.0,1
4,56ddde6b9a695914005b962c,What century did the Normans first gain their ...,10th,10th century;the first half of the 10th centur...,1.0,1


In [24]:
metrics_df['f1'].mean()

0.7866849965752404

In [25]:
metrics_df['em'].mean()

0.72

## Improving measurement functions through model thresholding

When we tokenize a question and context, and we pass it to the model, the output consists of two probabilities (logits). One is the start of the answer span, the other for the end of the answer span. <br><br>

Every tokent that is passed to the model is assigned a logit, and tokens corresponding to the question itself. <br><br>

Let's have a look at what this means, using a previous question ("What happened 3.7 - 2 billion years ago?"):

In [26]:
inputs = tokenizer.encode_plus(example.question_text, example.context_text, return_tensors = 'pt')
output = model(**inputs)

Looking below, we can observe how large is the first position of the array, this is the [CLS] token position. This has a strong probability that this question has no answer, but we answered it anyway.

In [27]:
start_logits = output.start_logits
end_logits = output.end_logits

In [28]:
start_logits

tensor([[  5.1171,  -8.3404,  -9.2660,  -8.0987,  -9.1736,  -9.5905,  -9.6239,
          -9.4974,  -9.7725,  -9.9778, -10.1417,  -9.6171,  -8.7191,  -3.9945,
          -5.9036,  -6.7555,  -8.8160,  -7.9155,  -8.7225,  -9.7324,  -9.5704,
         -10.2445,  -9.0550,  -7.5630,  -9.8712, -10.0757,  -7.6297,  -7.4481,
          -6.5011,  -9.7018, -10.0513,  -9.4285,  -8.6840, -10.2489,  -9.9774,
          -8.0076,  -7.7707,  -8.0321,  -6.8884,  -6.6425,  -6.5631,  -9.4518,
          -8.6434,  -9.5573,  -9.8626,  -9.7323,  -7.0476,  -4.5778,  -6.4825,
          -6.7371,  -6.9433,  -9.1568,  -6.9630,  -8.9800,  -6.9723,  -7.3322,
          -5.2532,  -9.6134,  -9.4807, -10.0298,  -9.8842,  -8.8732,  -7.9342,
          -8.2085,  -8.0398,  -7.8698,  -7.2027,  -9.6577,  -9.1055,  -9.7395,
          -7.7438,  -9.5543,  -9.0663,  -9.2180,  -9.8046,  -9.6983,  -8.9082,
          -6.9894,  -6.7306,  -7.6100,  -6.7132,  -8.6220,  -9.4562,  -8.3209,
          -5.3854,  -6.1157,  -6.8288,  -8.7172,  -9

Our model gets predictions by selecting the start and end tokens with the largest logits. It would be more sensible to choose any sensible start + end combinations as possible to answer the question.

These combinations can be scored independently and the one with the highest score would be considered the best answer.

A possible (candidate) answer is scored as the sum of its start and end logits

## Calculating possible combinations

We start by taking the n largest start and end logits. Any sensible combination can be considered an answer, however, some consistency checks must first be performed.

For instance:
    
    - End token must fall after the start token
    - Candidate answers wherein the start or end tokens are associated with question tokens

[CLS] is not removed from the answers as it can indicate null answer

In [29]:
# convert our start and end logit tensors to lists
start_logits = to_list(start_logits)[0]
end_logits = to_list(end_logits)[0]

In [30]:
# sort our start and end logits from the largest to the smallest, keeping track of the index
start_idx_and_logit = sorted(enumerate(start_logits), key = lambda x: x[1], reverse = True)
end_idx_and_logit = sorted(enumerate(end_logits), key = lambda x: x[1], reverse = True)

In [31]:
# select the top n (in this case, 5)
print(start_idx_and_logit[: 5])
print(end_idx_and_logit[: 5])

[(0, 5.117067337036133), (111, 1.3977444171905518), (104, 0.6027640700340271), (106, -1.1286717653274536), (113, -1.7321603298187256)]
[(0, 6.168288707733154), (119, 3.2872860431671143), (109, 0.9794887900352478), (135, 0.30854496359825134), (116, -0.20684294402599335)]


The null answer token (index 0) is in the top five of both the start and end logit lists.

In order to eventually predict a text answer (or an empty string), we need to keep track of the indexes which will be use to pull the corresponding token ids later on. We'll also need to identify which indexes correspond to the question tokens, so we can ensure we don't alow a non-sensical prediction.

In [32]:
start_indexes = [idx for idx, logit in start_idx_and_logit[: 5]]
end_indexes = [idx for idx, logit in end_idx_and_logit[: 5]]

In [33]:
# convert the token ids from a tensor to a list
tokens = to_list(inputs['input_ids'])[0]

In [34]:
# question tokens are defined as those between the CLS token (101, at position 0) and first SEP (102) token
question_indexes = [i + 1 for i, token in enumerate(tokens[1: tokens.index(102)])]
question_indexes

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [35]:
# keep track of all preliminary predictions
PrelimPrediction = collections.namedtuple(
    "PrelimPrediction", ["start_index", "end_index", "start_logit", "end_logit"]
)

We will generate a list of candidate predictions by looping through all combinations of the start and end token indexes, excluding non-sensical combinations.

In [36]:
prelim_preds = []

for start_index in start_indexes:
  for end_index in end_indexes:
    # throw out invalid predictions
    if start_index in question_indexes:
      continue
    if end_index in question_indexes:
      continue
    if end_index < start_index:
      continue

    prelim_preds.append(
        PrelimPrediction(
            start_index = start_index,
            end_index = end_index,
            start_logit = start_logits[start_index],
            end_logit = end_logits[end_index]
        )
    )

With a list of sensible candidate predictions, it's time to score them now.

For a candidate answer, score = start_logit + end_logit. Below, we sort our candidate predictions by their score.

In [37]:
# sort preliminary predictions by their score
prelim_preds = sorted(prelim_preds, key = lambda x: (x.start_logit + x.end_logit), reverse = True)
print(prelim_preds[: 5])

[PrelimPrediction(start_index=0, end_index=0, start_logit=5.117067337036133, end_logit=6.168288707733154), PrelimPrediction(start_index=0, end_index=119, start_logit=5.117067337036133, end_logit=3.2872860431671143), PrelimPrediction(start_index=0, end_index=109, start_logit=5.117067337036133, end_logit=0.9794887900352478), PrelimPrediction(start_index=0, end_index=135, start_logit=5.117067337036133, end_logit=0.30854496359825134), PrelimPrediction(start_index=0, end_index=116, start_logit=5.117067337036133, end_logit=-0.20684294402599335)]


We need to convert our preliminary predictions into actual text (or the empty string, if null). We will keep track of text predictions we've seen, because different token combinations can result in the same text prediction and we only want to keep the one with the highest score (we're looping in descending score order). Finally, we'll trim this list down to the best 5 predictions.

In [38]:
# keep track of all the best predictions
BestPrediction = collections.namedtuple( # pylint: disable = invalid-name
                                       "BestPrediction", ["text", "start_logit", "end_logit"]
                                        )

In [39]:
nbest = []
seen_predictions = []

for pred in prelim_preds:
  # for now we only care about the top 5 best predictions
  if len(nbest) >= 5:
    break

  # loop through the predictions according to their start index
  if pred.start_index > 0: # non-null answers have start_index > 0
    text = tokenizer.convert_tokens_to_string(
        tokenizer.convert_ids_to_tokens(
            tokens[pred.start_index: pred.end_index + 1]
        )
    )

    # clean the whitespace
    text = text.strip()
    text = " ".join(text.split())

    if text in seen_predictions:
      continue

    # flag this text as being seen - if we see it again, don't add it to the nbest list
    seen_predictions.append(text)

    # add this text prediction to a pruned list of the top 5 best predictions
    nbest.append(BestPrediction(text = text, start_logit = pred.start_logit, end_logit = pred.end_logit))

In [40]:
nbest

[BestPrediction(text='free oxygen began to outgas from the oceans', start_logit=1.3977444171905518, end_logit=3.2872860431671143),
 BestPrediction(text='when such oxygen sinks became saturated , free oxygen began to outgas from the oceans', start_logit=0.6027640700340271, end_logit=3.2872860431671143),
 BestPrediction(text='oxygen sinks became saturated , free oxygen began to outgas from the oceans', start_logit=-1.1286717653274536, end_logit=3.2872860431671143),
 BestPrediction(text='free oxygen began to outgas from the oceans 3 – 2 . 7 billion years ago , reaching 10 % of its present level', start_logit=1.3977444171905518, end_logit=0.30854496359825134),
 BestPrediction(text='when such oxygen sinks became saturated', start_logit=0.6027640700340271, end_logit=0.9794887900352478)]

At this point, we have a neat list of the top 5 best predictions for this question, let's now also add the null answer.

In [41]:
# and don't forget -- include the null answer!
nbest.append(BestPrediction(text = "", start_logit = start_logits[0], end_logit = end_logits[0]))
nbest

[BestPrediction(text='free oxygen began to outgas from the oceans', start_logit=1.3977444171905518, end_logit=3.2872860431671143),
 BestPrediction(text='when such oxygen sinks became saturated , free oxygen began to outgas from the oceans', start_logit=0.6027640700340271, end_logit=3.2872860431671143),
 BestPrediction(text='oxygen sinks became saturated , free oxygen began to outgas from the oceans', start_logit=-1.1286717653274536, end_logit=3.2872860431671143),
 BestPrediction(text='free oxygen began to outgas from the oceans 3 – 2 . 7 billion years ago , reaching 10 % of its present level', start_logit=1.3977444171905518, end_logit=0.30854496359825134),
 BestPrediction(text='when such oxygen sinks became saturated', start_logit=0.6027640700340271, end_logit=0.9794887900352478),
 BestPrediction(text='', start_logit=5.117067337036133, end_logit=6.168288707733154)]

The null answer is scored as the sum of the start_logit and end_logit associated with the [CLS] token.

The last step is to compute the null score - more specifically, the difference between the null score and the best non-null score as shown below.

In [42]:
# compute the null score as the sum of the [CLS] token logits
score_null = start_logits[0] + end_logits[0]
score_null

11.285356044769287

In [43]:
nbest[0].start_logit + nbest[0].end_logit

4.685030460357666

In [44]:
# compute the difference between the null score and tehbest non-null score
score_diff = score_null - nbest[0].start_logit - nbest[0].end_logit
score_diff

6.600325584411621

# SQuAD Evaluation

In [45]:
def evaluate(model_name_or_path,
             dataset,
             output_dir,
             per_gpu_eval_batch_size = 12,
             n_gpu = 1,
             model_type = BERT_MODEL_TYPE,
             do_lower_case = DO_LOWER_CASE,
             nbest_size = NBEST_SIZE,
             max_answer_length = MAX_ANSWER_LENGTH,
             null_score_diff_threshold = 0.0):

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast = False)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_path)

    model.to(device)

    eval_batch_size = per_gpu_eval_batch_size * max(1, n_gpu)

    # Note that DistributedSampler samples randomly
    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler = eval_sampler, batch_size = eval_batch_size)

    # multi-gpu evaluate
    if n_gpu > 1 and not isinstance(model, torch.nn.DataParallel):
        model = torch.nn.DataParallel(model)

    all_results = []
    start_time = timeit.default_timer()

    for batch in tqdm(eval_dataloader, desc = "Evaluating"):
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            if model_type in ["xlm", "roberta", "distilbert", "camembert", "bart", "longformer"]:
                del inputs["token_type_ids"]

            feature_indices = batch[3]

            outputs = model(**inputs)

        for i, feature_index in enumerate(feature_indices):
            eval_feature = features[feature_index.item()]
            unique_id = int(eval_feature.unique_id)

            output = [to_list(output[i]) for output in outputs.to_tuple()]

            start_logits, end_logits = output
            result = SquadResult(unique_id, start_logits, end_logits)

            all_results.append(result)

    evalTime = timeit.default_timer() - start_time
    print(f"Evaluation done in total {evalTime} seconds ({evalTime / len(dataset)} seconds per example)")

    # compute predictions
    os.makedirs(output_dir, exist_ok = True)

    output_prediction_file = os.path.join(output_dir, "predictions.json")
    output_nbest_file = os.path.join(ouptut_dir, "nbest_predictions.json")

    output_null_log_odds_file = os.path.join(ouptut_dir, "null_odds.json")

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        nbest_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,
        True,
        null_score_diff_threshold,
        tokenizer,
    )

    # compute the F1 and EM scores
    results = squad_evaluate(examples, predictions)

    results.update({"eval_time": evalTime, "prediction_time": evalTime / len(dataset)})

    return results

In [46]:
result = evaluate(BERT_MODEL_HF_PATH, dataset, BERT_OUTPUT_DIR)
result

Some weights of the model checkpoint at twmkn9/bert-base-uncased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Evaluating: 100%|██████████| 1020/1020 [3:19:51<00:00, 11.76s/it]


Evaluation done in total 11991.023680532 seconds (0.9802995160670372 seconds per example)


NameError: ignored