### Question Answering using BERT on the SQUAD Dataset

For Question Answering Bert takes 2 parameters input question and the text which contains the answer as a packed sequence. In this blog we will take SQuAD dataset and will train an question answering system. We will use hugging face library to solve our problem.


In [2]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering
import warnings
warnings.simplefilter('ignore')

In [3]:
weight_path = 'kaporter/bert-base-uncased-finetuned-squad'
tokenizer = BertTokenizer.from_pretrained(weight_path)
model = BertForQuestionAnswering.from_pretrained(weight_path)

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

#### Taking an Example:
Generate token_ids using tokenizer

In [4]:
question = "How many parameters does BERT-large have?"
context = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

input_ids = tokenizer.encode(question, context)
print(f'We have about {len(input_ids)} tokens generated')

tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(' ')
print('Some examples of token-input_id pairs:')
for i, (token, id) in enumerate(zip(tokens, input_ids)):
    print(f'{token} : {id}')

We have about 70 tokens generated
 
Some examples of token-input_id pairs:
[CLS] : 101
how : 2129
many : 2116
parameters : 11709
does : 2515
bert : 14324
- : 1011
large : 2312
have : 2031
? : 1029
[SEP] : 102
bert : 14324
- : 1011
large : 2312
is : 2003
really : 2428
big : 2502
. : 1012
. : 1012
. : 1012
it : 2009
has : 2038
24 : 2484
- : 1011
layers : 9014
and : 1998
an : 2019
em : 7861
##bed : 8270
##ding : 4667
size : 2946
of : 1997
1 : 1015
, : 1010
02 : 6185
##4 : 2549
, : 1010
for : 2005
a : 1037
total : 2561
of : 1997
340 : 16029
##m : 2213
parameters : 11709
! : 999
altogether : 10462
it : 2009
is : 2003
1 : 1015
. : 1012
34 : 4090
##gb : 18259
, : 1010
so : 2061
expect : 5987
it : 2009
to : 2000
take : 2202
a : 1037
couple : 3232
minutes : 2781
to : 2000
download : 8816
to : 2000
your : 2115
cola : 15270
##b : 2497
instance : 6013
. : 1012
[SEP] : 102


#### Generate segmentation embedding
Segmentation emebdding will be 0 for all tokens related to question and 1 for all tokens related to Context.

In [5]:
sep_idx = tokens.index('[SEP]')

token_type_ids = [0 for i in range(sep_idx + 1)] + [1 for i in range(sep_idx+1, len(tokens))]

print(token_type_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Now lets pass our input through model and see the output.

In [6]:
out = model(torch.tensor([input_ids]),
            token_type_ids=torch.tensor([token_type_ids]))

start_logits, end_logits = out.start_logits, out.end_logits

# Find the tokens with the highest 'start' and 'end' scores
answer_start = torch.argmax(start_logits)
answer_end = torch.argmax(end_logits)

# Combine the tokens in the answer and print it out
answer = ' '.join(tokens[answer_start:answer_end+1])

print(f'Predicted Answer: {answer}')


Predicted Answer: 340 ##m


In [7]:
del model
del tokenizer

### Train and model on Squad dataest

#### Data Preprocessing

In [8]:
import transformers
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.simplefilter('ignore')

**About dataset**

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

##### Loading Dataset

In [9]:
# !pip install datasets

In [10]:
from datasets import load_dataset
squad_dataset = load_dataset('squad')
squad_dataset

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [11]:
# to make text bold
s_bold = '\033[1m'
e_bold = '\033[0;0m'

print(s_bold + 'Train Data Sample.....' + e_bold)
train_data = squad_dataset["train"]
for data in train_data:
    print(' ')
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break

print('---'*30)
print(s_bold + 'Validation Data Sample.....' + e_bold)
train_data = squad_dataset["validation"]
for data in train_data:
    print(' ')
    print(s_bold + 'ID -' + e_bold, data['id'])
    print(s_bold +'TITLE - '+ e_bold, data['title'])
    print(s_bold + 'CONTEXT - '+ e_bold,data['context'])
    print(s_bold + 'ANSWERS - ' + e_bold,data['answers']['text'])
    print(s_bold + 'ANSWERS START INDEX - ' + e_bold,data['answers']['answer_start'])
    print(' ')
    break

[1mTrain Data Sample.....[0;0m
 
[1mID -[0;0m 5733be284776f41900661182
[1mTITLE - [0;0m University_of_Notre_Dame
[1mCONTEXT - [0;0m Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
[1mANSWERS - [0;0m ['Saint Bernadette Soubirous']
[1mANSWERS START INDEX - [0;0m [515]
 
-----------------------------------------------------------------------

There are multiple answers in one of our validation sample.

In [12]:
squad_dataset['train'].filter(lambda x: len(x['answers']['text']) != 1)

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [13]:
squad_dataset['validation'].filter(lambda x: len(x['answers']['text']) != 1)

Filter:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})

In validation data there are 10567 samples with multiple answers.

In [14]:
# Sample some dataset to reduce training time
squad_dataset['train'] = squad_dataset['train'].select([i for i in range(8000)])
squad_dataset['validation'] = squad_dataset['validation'].select([i for i in range(2000)])

###### Labelling the dataset
Label (1,0) corresponding to the starting of answer token among the input tokens to the model. Similarly label (0,1) corresponding to ending of answer token. (0,0) label correponds to all other tokens.

Note:

In case of hugging face library, we do not need to provide labels.We just need to start and end position of tokens. Model will provide start logits and end logits as output annd we can apply arg max to find start position and end position.

#### Handling long contexts

In [15]:
from transformers import AutoTokenizer

trained_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

context = squad_dataset['train'][0]['context']
question = squad_dataset['train'][0]['question']
answer = squad_dataset['train'][0]['answers']['text']

inputs = tokenizer(
    question,
    context,
    max_length=160,
    truncation='only_second',
    stride=70,
    return_overflowing_tokens=True,
)

print(f"The 4 examples gave {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")

print('Question: ',question)
print(' ')
print('Context : ',context)
print(' ')
print('Answer: ', answer)
print('--'*25)

for i, ids in enumerate(inputs['input_ids']):
  print('Context piece', i+1)
  print(tokenizer.decode(ids[ids.index(102):]))
  print(' ')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The 4 examples gave 2 features.
Here is where each comes from: [0, 0].
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context :  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer:  ['Saint Bernadette Soubirous']
--------------------------------------------------
Context piece 1
[SEP] architecturally, the s

For entire dataset:

In [18]:
from transformers import AutoTokenizer

del tokenizer
trained_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

def train_data_preprocess(examples):
  """
  generate start and end indexes of answer in context
  """
  def find_context_start_end_index(sequence_ids):
    """
    returns the token index in whih context starts and ends
    """
    token_idx = 0
    while sequence_ids[token_idx] != 1:
      token_idx += 1
    context_start_idx = token_idx

    while sequence_ids[token_idx] == 1:
      token_idx += 1
    context_end_idx = token_idx
    return context_start_idx, context_end_idx


  questions = [q.strip() for q in examples["question"]]
  context = examples["context"]
  answers = examples["answers"]

  inputs = tokenizer(
      questions,
      context,
      max_length=512,
      truncation="only_second",
      stride=128,
      return_overflowing_tokens=True,  #returns id of base context
      return_offsets_mapping=True,  # returns (start_index,end_index) of each token
      padding="max_length"
  )

  start_positions = []
  end_positions = []


  for i,mapping_idx_pairs in enumerate(inputs['offset_mapping']):
      context_idx = inputs['overflow_to_sample_mapping'][i]

      # from main context
      answer = answers[context_idx]
      answer_start_char_idx = answer['answer_start'][0]
      answer_end_char_idx = answer_start_char_idx + len(answer['text'][0])

      # sub contexts
      tokens = inputs['input_ids'][i]
      sequence_ids = inputs.sequence_ids(i)

      # find context start and end index wrt subcontexts
      context_start_idx, context_end_idx = find_context_start_end_index(sequence_ids)
      context_start_char_index = mapping_idx_pairs[context_start_idx][0]
      context_end_char_index = mapping_idx_pairs[context_end_idx][1]

      # If the answer is not fully inside the context, label is (0,0)
      if(context_start_char_index > answer_end_char_idx) or (context_end_char_index < answer_start_char_idx):
        start_positions.append(0)
        end_positions.append(0)

      else:
        idx = context_start_idx
        while idx <= context_end_idx and mapping_idx_pairs[idx][0] <= answer_start_char_idx:
          idx += 1
        start_positions.append(idx - 1)

        idx = context_end_idx
        while idx >= context_start_idx and mapping_idx_pairs[idx][1] > answer_end_char_idx:
          idx -= 1
        end_positions.append(idx + 1)


  inputs["start_positions"] = start_positions
  inputs["end_positions"] = end_positions
  return inputs

train_sample = squad_dataset['train'].select([i for i in range(200)])

train_dataset = train_sample.map(train_data_preprocess,
                                 batched=True,
                                 remove_columns=train_sample.column_names)

len(squad_dataset['train']), len(train_dataset)


(8000, 200)

Comparing the values before and after tokenization.

In [20]:
def print_context_and_answer(idx,mini_ds=squad_dataset["train"]):

    print(idx)
    print('----')
    question = mini_ds[idx]['question']
    context = mini_ds[idx]['context']
    answer = mini_ds[idx]['answers']['text']
    print('Theoretical values :')
    print(' ')
    print('Question: ')
    print(question)
    print(' ')
    print('Context: ')
    print(context)
    print(' ')
    print('Answer: ')
    print(answer)
    print(' ')
    answer_start_char_idx = mini_ds[idx]['answers']['answer_start'][0]
    answer_end_char_idx = answer_start_char_idx + len(mini_ds[idx]['answers']['text'][0])
    print('Start and end index of text: ',answer_start_char_idx,answer_end_char_idx)
    print('----'*20)
    print('Values after tokenization:')


    #answer
    sep_tok_index = train_dataset[idx]['input_ids'].index(102) #get index for [SEP]
    question_ = train_dataset[idx]['input_ids'][:sep_tok_index+1]
    question_decoded = tokenizer.decode(question_)
    context_ = train_dataset[idx]['input_ids'][sep_tok_index+1:]
    context_decoded = tokenizer.decode(context_)
    start_idx = train_dataset[idx]['start_positions']
    end_idx = train_dataset[idx]['end_positions']
    answer_toks = train_dataset[idx]['input_ids'][start_idx:end_idx]
    answer_decoded = tokenizer.decode(answer_toks)
    print(' ')
    print('Question: ')
    print(question_decoded)
    print(' ')
    print('Context: ')
    print(context_decoded)
    print(' ')
    print('Answer: ')
    print(answer_decoded)
    print(' ')
    print('Start pos and end pos of tokens: ',train_dataset[idx]['start_positions'],train_dataset[idx]['end_positions'])
    print('____'*20)


print_context_and_answer(0)
print_context_and_answer(1)
print_context_and_answer(2)
print_context_and_answer(3)

0
----
Theoretical values :
 
Question: 
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
 
Context: 
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
 
Answer: 
['Saint Bernadette Soubirous']
 
Start and end index of text:  515 541
--------------------------------------------------------------------------------
Values after tok

#### Model Evaluation

In [22]:
from transformers import AutoTokenizer

def preprocess_validation_examples(examples):
  """
  preprocessing validation data
  """

  questions = [q.strip() for q in examples["question"]]
  inputs = tokenizer(
      questions,
      examples["context"],
      max_length=512,
      truncation="only_second",
      stride=128,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length"
  )

  sample_map = inputs.pop("overflow_to_sample_mapping")
  base_ids = []

  for i in range(len(inputs["input_ids"])):
      base_context_idx = sample_map[i]
      base_ids.append(examples["id"][base_context_idx])

      # sequence id indicates the input. 0 for first input and 1 for second input
      # and None for special tokens by default
      sequence_ids = inputs.sequence_ids(i)
      offset = inputs["offset_mapping"][i]
      # for Question tokens provide offset_mapping as None
      inputs["offset_mapping"][i] = [
          o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
      ]

  inputs["base_id"] = base_ids
  return inputs

trained_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)

data_val_sample = squad_dataset["validation"].select([i for i in range(100)])
eval_set = data_val_sample.map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=squad_dataset["validation"].column_names,
)
len(eval_set)


Map:   0%|          | 0/100 [00:00<?, ? examples/s]

100

In [23]:
import torch
from transformers import DistilBertForQuestionAnswering

# del tokenizer
# take a small sample

eval_set_for_model = eval_set.remove_columns(["base_id", "offset_mapping"])
eval_set_for_model.set_format("torch")

checkpoint =  "distilbert-base-uncased"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

batch = {k: eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}

model = DistilBertForQuestionAnswering.from_pretrained(checkpoint).to(
    device
)


with torch.no_grad():
    outputs = model(**batch)

start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

start_logits.shape,end_logits.shape

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


((100, 512), (100, 512))

#### Model Evaluation
We will evaluate our model using Evaluate library. We use 2 metrics for evaluation.

1. Exact match
2. f1 score

In [24]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
Successfully installed evaluate-0.4.2


In [25]:
import collections
import evaluate

def predict_answers_and_evaluate(start_logits,end_logits,eval_set,examples):
    """
    make predictions
    Args:
    start_logits : strat_position prediction logits
    end_logits: end_position prediction logits
    eval_set: processed val data
    examples: unprocessed val data with context text
    """
    # appending all id's corresponding to the base context id
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(eval_set):
        example_to_features[feature["base_id"]].append(idx)

    n_best = 20
    max_answer_length = 30
    predicted_answers = []

    for example in examples:
        example_id = example["id"]
        context = example["context"]
        answers = []

        # looping through each sub contexts corresponding to a context and finding
        # answers
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = eval_set["offset_mapping"][feature_index]

            # sorting the predictions of all hidden states and taking best n_best prediction
            # means taking the index of top 20 tokens
            start_indexes = np.argsort(start_logit).tolist()[::-1][:n_best]
            end_indexes = np.argsort(end_logit).tolist()[::-1][:n_best]


            for start_index in start_indexes:
                for end_index in end_indexes:

                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length.
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                       ):
                        continue

                    answers.append({
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                        })


            # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    metric = evaluate.load("squad")

    theoretical_answers = [
            {"id": ex["id"], "answers": ex["answers"]} for ex in examples
    ]

    metric_ = metric.compute(predictions=predicted_answers, references=theoretical_answers)
    return predicted_answers,metric_

In [26]:
pred_answers,metrics_ = predict_answers_and_evaluate(start_logits,end_logits,eval_set,data_val_sample)
metrics_

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

{'exact_match': 0.0, 'f1': 8.271020793825185}