ODQA (Open-domain Question Answering ): Refers to the task of providing answers to questions over a broad range of topics, leveraging vast corpora of unstructured text without being restricted to a specific domain. \
In this project, we will build an end-to-end QA system that can retrieve relevant contexts, extract best answer for a given question. The domain of the question can be varied in different topics.

### 1. Install and import libraries

In [1]:
!pip install -qq datasets==2.16.1 evaluate==0.4.1 transformers[sentencepiece]==4.35.2
!pip install accelerate==0.26.1
!apt install git-lfs



'apt' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
import numpy as np
from tqdm.auto import tqdm
import collections

import torch

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModelForQuestionAnswering
from transformers import TrainingArguments
from transformers import Trainer
import evaluate

device = torch.device("cube") if torch.cuda.is_available() else torch.device("cpu")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(





  _torch_pytree._register_pytree_node(


### 2. Setup config

In [4]:
MODEL_NAME = "distilbert-base-uncased"
MAX_LENGTH = 384 #max length for each section

STRIDE = 128

### 3. Setup dataset

#### 3.1 Download dataset

In [5]:
#download dataset from HuggingFace
DATASET_NAME = 'squad_v2'
raw_datasets = load_dataset(DATASET_NAME)

#### 3.2 EDA dataset

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [7]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context:  Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
Question:  When did Beyonce start becoming popular?
Answer:  {'text': ['in the late 1990s'], 'answer_start': [269]}


In [8]:
non_answer = raw_datasets["train"].filter(
    lambda x: len(x['answers']['text']) > 0
)

#### 3.3 Load tokenizer and run some examples

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [10]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

In [11]:
inputs = tokenizer(
    question,
    context,
    max_length=MAX_LENGTH,
    truncation="only_second",
    stride=STRIDE,
    return_overflowing_tokens=True, #return to tokens exceed the max length
    return_offsets_mapping=True, #return to mapping of token in the origin
    padding="max_length" #fill in data to have the same max_length
)

In [12]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [None]:
#input_ids: tokenization of inputs, [CCLS] Q [SEP] C
#offset_mapping: gap in position of origin and new input

In [13]:
inputs

{'input_ids': [[101, 2043, 2106, 20773, 2707, 3352, 2759, 1029, 102, 20773, 21025, 19358, 22815, 1011, 5708, 1006, 1013, 12170, 23432, 29715, 3501, 29678, 12325, 29685, 1013, 10506, 1011, 10930, 2078, 1011, 2360, 1007, 1006, 2141, 2244, 1018, 1010, 3261, 1007, 2003, 2019, 2137, 3220, 1010, 6009, 1010, 2501, 3135, 1998, 3883, 1012, 2141, 1998, 2992, 1999, 5395, 1010, 3146, 1010, 2016, 2864, 1999, 2536, 4823, 1998, 5613, 6479, 2004, 1037, 2775, 1010, 1998, 3123, 2000, 4476, 1999, 1996, 2397, 4134, 2004, 2599, 3220, 1997, 1054, 1004, 1038, 2611, 1011, 2177, 10461, 1005, 1055, 2775, 1012, 3266, 2011, 2014, 2269, 1010, 25436, 22815, 1010, 1996, 2177, 2150, 2028, 1997, 1996, 2088, 1005, 1055, 2190, 1011, 4855, 2611, 2967, 1997, 2035, 2051, 1012, 2037, 14221, 2387, 1996, 2713, 1997, 20773, 1005, 1055, 2834, 2201, 1010, 20754, 1999, 2293, 1006, 2494, 1007, 1010, 2029, 2511, 2014, 2004, 1037, 3948, 3063, 4969, 1010, 3687, 2274, 8922, 2982, 1998, 2956, 1996, 4908, 2980, 2531, 2193, 1011, 2028, 3

In [14]:
tokenizer.decode(inputs['input_ids'][0])

'[CLS] when did beyonce start becoming popular? [SEP] beyonce giselle knowles - carter ( / biːˈjɒnseɪ / bee - yon - say ) ( born september 4, 1981 ) is an american singer, songwriter, record producer and actress. born and raised in houston, texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of r & b girl - group destiny\'s child. managed by her father, mathew knowles, the group became one of the world\'s best - selling girl groups of all time. their hiatus saw the release of beyonce\'s debut album, dangerously in love ( 2003 ), which established her as a solo artist worldwide, earned five grammy awards and featured the billboard hot 100 number - one singles " crazy in love " and " baby boy ". [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

### 4. Tokenizer dataset

#### 4.1 Tokenizer train set

In [15]:
#Create preprocessing function
def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]

    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=MAX_LENGTH,
        truncation="only_second",
        stride=STRIDE,
        return_overflowing_tokens=True, 
        return_offsets_mapping=True, 
        padding="max_length" ,
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]

    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping): #token belong to context = 1, not belong = 0
        sample_idx = sample_map[i]
        sequence_ids = inputs.sequence_ids(i)
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx 
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        answer = answers[sample_idx]

        if len(answer['text']) == 0:
            start_positions.append(0)
            end_positions.append(0)   #if context doesn't have answer
        else:
            start_char = answer["answer_start"][0]
            end_char = answer["answer_start"][0] + len(answer["text"][0])

            #if answer not in context totally, label to [0,0]
            if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
                start_positions.append(0)
                end_positions.append(0)
            else:
                idx = context_start
                while idx <= context_end and offset[idx][0] <= start_char:
                    idx += 1
                start_positions.append(idx - 1)

                idx = context_end
                while idx >= context_start and offset[idx][1] >= end_char:
                    idx -= 1
                end_positions.append(idx + 1)
    #add infor to start and end position of inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    return inputs

train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

len(raw_datasets["train"]), len(train_dataset)


(130319, 131754)

#### 4.2 Tokenize Val set

In [16]:
print(train_dataset[0:10]['start_positions'])
print(train_dataset[0:10]['end_positions'])

[75, 68, 143, 58, 78, 94, 134, 101, 77, 84]
[78, 70, 143, 60, 79, 97, 136, 102, 78, 85]


In [17]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]

    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=MAX_LENGTH,
        truncation="only_second",
        stride=STRIDE,
        return_overflowing_tokens=True, 
        return_offsets_mapping=True, 
        padding="max_length" ,
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]

        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names
)

len(raw_datasets["validation"]), len(validation_dataset)

(11873, 12134)

### 5. Train model

In [18]:
#Load model
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
args = TrainingArguments(
    output_dir="distilbert-finetuned-squadv2", 
    #output_dir="distilbert-base-cased-distilled-squad",
    evaluation_strategy="no",  #avaluate no-auto after ech epoch
    save_strategy="epoch", #save checkpoint after each epoch
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01, #reduce weight model to avoid overfitting
    #fp16=True, #half-precision (16-bit) data type to optimize resource
    push_to_hub=True,
)

In [22]:
trainer = Trainer(
    model = model,
    args = args,
    train_dataset = train_dataset,
    eval_dataset = validation_dataset,
    tokenizer = tokenizer,
)
trainer.train()

  0%|          | 0/49410 [00:00<?, ?it/s]

In [None]:
trainer.push_to_hub(commit_message="Training complete")

### 6. Evaluate model

In [None]:
metric = evaluate.load("squad_v2")

In [None]:
N_BEST = 20 #number of best result be selected
MAX_AND_LENGTH = 30 #max length for predicted answer

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature['example_id']].append(idx)

    predicted_answer = []
    for example in tqdm[examples]:
        example_id = example['id']
        context = example['context']
        answers = []

        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]['offset_mapping']

            #take indexes have max value for start and end logits
            start_indexes = np.argsort(start_logit)[-1 : -N_BEST - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -N_BEST - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    if end_index - start_index + 1 > MAX_ANS_LENGTH:
                        continue
                    answer = {
                        
                    }

In [None]:
preidcted_answers = 