<a href="https://colab.research.google.com/github/Spartan-119/A-B-Testing-Approach-for-Comparing-Performance-of-ML-Models/blob/main/a_b_testing_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# installing all the necessary packages
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade
!pip install transformers --upgrade
!pip install tqdm
!pip install tensorboard

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━

In [2]:
# importing all the necessary libraries
import json
import os
import timeit
import collections
import time
from pprint import pprint
import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, squad_convert_examples_to_features
from transformers.data.processors.squad import SquadV2Processor,SquadResult
from transformers.data.metrics.squad_metrics import (
    compute_predictions_log_probs,
    compute_predictions_logits,
    squad_evaluate,
)

In [3]:
DO_LOWER_CASE = True
NBEST_SIZE = 20
DOC_STRIDE = 128
MAX_SEQ_LENGTH = 384
MAX_QUERY_LENGTH = 64
MAX_ANSWER_LENGTH = 30
DATA_DIR = 'data/squad'
PREDICT_FILE = 'dev-v2.0.json'

BERT_MODEL_TYPE = 'bert'
BERT_MODEL_HF_PATH = 'twmkn9/bert-base-uncased-squad2'
BERT_OUTPUT_DIR = 'models/bert/twmkn9_bert-case-uncased-squad2'

DISTILBERT_MODEL_TYPE = 'distilbert'
DISTILBERT_MODEL_HF_PATH = 'twmkn9/distilbert-base-uncased-squad2'
DISTILBERT_OUTPUT_DIR = 'models/distilbert/twmkn9_distilbert-base-uncased-squad2'

# Downloading and Exploring the dataset

In [4]:
# downloading the dataset
!wget -P data/squad/ https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

--2023-08-19 20:48:54--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘data/squad/dev-v2.0.json’


2023-08-19 20:48:55 (50.1 MB/s) - ‘data/squad/dev-v2.0.json’ saved [4370528/4370528]



#### <i>Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

#### SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.</i>


## Loading the DEV set using Hugging Face's data processors

I am going to make use of [Processors](https://huggingface.co/transformers/main_classes/processors.html) to facilitate basic processing tasks with some canonical NLP datasets. The processors can be used for loading datasets and converting their examples to features for direct use in the model. More specifically, we will be using the [SQuAD processors](https://huggingface.co/transformers/main_classes/processors.html#squad)

In [5]:
def to_list(tensor):
  return tensor.detach().cpu().tolist()

In [6]:
def load_and_cache_examples(model_name_or_path,
                            data_dir= DATA_DIR,
                            predict_file=PREDICT_FILE,
                            max_seq_length=MAX_SEQ_LENGTH,
                            doc_stride=DOC_STRIDE,
                            max_query_length=MAX_QUERY_LENGTH,
                            overwrite_cache=True):

    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)
    # Load data features from cache or dataset file
    input_dir = data_dir if data_dir else "."
    cached_features_file = os.path.join(
        input_dir,
        "cached_{}_{}_{}".format(
            "dev",
            list(filter(None, model_name_or_path.split("/"))).pop(),
            str(max_seq_length),
        ),
    )

    # Init features and dataset from cache if it exists
    if os.path.exists(cached_features_file) and not overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features_and_dataset = torch.load(cached_features_file)
        features, dataset, examples = (
            features_and_dataset["features"],
            features_and_dataset["dataset"],
            features_and_dataset["examples"],
        )
    else:

        processor = SquadV2Processor()

        examples = processor.get_dev_examples(data_dir, filename=predict_file)

        features, dataset = squad_convert_examples_to_features(
            examples=examples,
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset="pt",
            threads=1,
        )


    return dataset, examples, features

In [7]:
dataset, examples, features = load_and_cache_examples(BERT_MODEL_HF_PATH)

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

100%|██████████| 35/35 [00:13<00:00,  2.61it/s]
convert squad examples to features: 100%|██████████| 11873/11873 [02:41<00:00, 73.70it/s]
add example index and unique id: 100%|██████████| 11873/11873 [00:00<00:00, 597633.08it/s]


In [8]:
print(f'There are {len(examples)} examples in the dev dataset.')

There are 11873 examples in the dev dataset.


This list of examples contains objects of type transformers.data.processors.squad.SquadExample.
We use the functionbelow to extract the information we want from such objects.
More specifically: 'qid', 'question_text', 'context_text' and 'answer'.

I will first create some extra variables to help on manipulation of data.

In [10]:
# generating some maps to help identify examples of interest.
qid_to_example_index = {example.qas_id: i for i, example in enumerate(examples)}
qid_to_has_answer = {example.qas_id: bool(example.answers) for example in examples}
answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if has_answer]
no_answer_qids = [qas_id for qas_id, has_answer in qid_to_has_answer.items() if not has_answer]

And also, the function below to help on extracting information given a `qid` (question unique identifier)

In [11]:
def display_example(qid: str) -> None:
  idx = qid_to_example_index[qid]
  q = examples[idx].question_text
  c = examples[idx].context_text
  a = [answer['text'] for answer in examples[idx].answers]

  print(f'Examples {idx} of {len(examples)}\n------------------')
  print(f'Q: {q}\n')
  print('Context:')
  pprint(c)
  print(f'\nTrue Answers:\n{a}')

## Positive Example

50% of the examples in the test set are questionst hat have answers contained within their corresponding passage. In these cases, up to 5 possible correct answers are provided. Such answers must come directly from the passage, we will see later, however, that there are several ways to arrive at a "correct" answer.

In [14]:
display_example(answer_qids[2456])

Examples 4959 of 11873
------------------
Q: It is now possible to convert old relative ages into what type of ages using isotopic dating?

Context:
('At the beginning of the 20th century, important advancement in geological '
 'science was facilitated by the ability to obtain accurate absolute dates to '
 'geologic events using radioactive isotopes and other methods. This changed '
 'the understanding of geologic time. Previously, geologists could only use '
 'fossils and stratigraphic correlation to date sections of rock relative to '
 'one another. With isotopic dates it became possible to assign absolute ages '
 'to rock units, and these absolute dates could be applied to fossil sequences '
 'in which there was datable material, converting the old relative ages into '
 'new absolute ages.')

True Answers:
['absolute ages', 'rock units', 'new absolute']


## Negative Example

The remaining 50% of the questions in the test set do not have an answer. This is important as in a real life Q&A system, the model needs to learn when **NOT TO ANSWER.**

In [15]:
display_example(no_answer_qids[1235])

Examples 2520 of 11873
------------------
Q: What is difficult with a satellite-to-noise ratio?

Context:
('Oxygen presents two spectrophotometric absorption bands peaking at the '
 'wavelengths 687 and 760 nm. Some remote sensing scientists have proposed '
 'using the measurement of the radiance coming from vegetation canopies in '
 'those bands to characterize plant health status from a satellite platform. '
 'This approach exploits the fact that in those bands it is possible to '
 "discriminate the vegetation's reflectance from its fluorescence, which is "
 'much weaker. The measurement is technically difficult owing to the low '
 'signal-to-noise ratio and the physical structure of vegetation; but it has '
 'been proposed as a possible method of monitoring the carbon cycle from '
 'satellites on a global scale.')

True Answers:
[]
