# Building Your First Question-Answering Model

This notebook will introduce you to using a pre-trained BERT model for Question-Answering task and finetuning the pre-trained model using the Stanford Question Answering Dataset (SQuAD). Our focus will be on understanding the fundamentals of NLP models for question answering, including pre-processing techniques and the impact of fine-tuning.

#### Installation Guide

In [None]:
#!pip install transformers datasets pillow torch

<img src="https://lh3.googleusercontent.com/d/1nRaX21am1QvUUXlp1YE2tE9a5q96SG5-" alt="drawing" width="650">

### Using a pre-trained model for Question-Answering

First, let's see how the pre-trained BERT model performs on a QA task without any fine-tuning. BERT is a deep learning model developed by Google that understands the context of words in text by looking at the words that come before and after them. In our example the model needs to find an answer within a given passage. 

In [1]:
from transformers import BertForQuestionAnswering, AutoTokenizer, DefaultDataCollator, TrainingArguments, Trainer, BertTokenizer
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_out

Define a context and a question where the model might initially struggle

In [2]:
context = "The University of California was founded in 1868, located in Berkeley."
question = "When was the University of California established?"

#### Model Prediction
Tokenize the input, make a prediction, and decode the answer:

In [3]:
inputs = tokenizer(question, context, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

# Find the tokens with the highest `start` and `end` scores
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

# Convert tokens to answer string
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs.input_ids[0, answer_start:answer_end]))
print("Answer:", answer)

Answer: 


## With Fine-Tuning

Now, let's fine-tune this BERT model on a similar task to potentially improve its performance. For this task we will use the Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) dataset, which is a large-scale dataset designed to test the reading comprehension ability of machine learning models. It contains 100,000+ question - answers created by humans on a range of Wikipedia articles, similar to the below example.

<img src="https://lh3.googleusercontent.com/d/1wMv0dnLe2VhULsJxtGdxZPM9kkkRrKV1" alt="drawing" width="400">

### Load SQuAD dataset

In [4]:
from datasets import load_dataset
squad = load_dataset("squad", split="train[:100]")
squad = squad.train_test_split(test_size=0.2)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Let's take a look at an example:

In [5]:
squad["train"][0]

{'id': '5733b2fe4776f41900661093',
 'title': 'University_of_Notre_Dame',
 'context': "The Lobund Institute grew out of pioneering research in germ-free-life which began in 1928. This area of research originated in a question posed by Pasteur as to whether animal life was possible without bacteria. Though others had taken up this idea, their research was short lived and inconclusive. Lobund was the first research organization to answer definitively, that such life is possible and that it can be prolonged through generations. But the objective was not merely to answer Pasteur's question but also to produce the germ free animal as a new tool for biological and medical research. This objective was reached and for years Lobund was a unique center for the study and production of germ free animals and for their use in biological and medical investigations. Today the work has spread to other universities. In the beginning it was under the Department of Biology and a program leading to the mast

There important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

#### Preprocessing
There are a few preprocessing steps particular to question answering tasks you should be aware of:

- Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
- Next, map the start and end positions of the answer to the original context by setting `return_offset_mapping=True.`
- With the mapping in hand, now you can find the start and end tokens of the answer. Use the `sequence_ids` method to find which part of the offset corresponds to the question and which corresponds to the context.

To apply the preprocessing function over the entire dataset, `map` function. You can speed up the map function by setting `batched=True` to process multiple elements of the dataset at once.

In [6]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=128,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

tokenized_squad = squad.map(preprocess_function, batched=True)
data_collator = DefaultDataCollator()

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

### Train
We can now finetune the model

In [7]:
training_args = TrainingArguments(
    output_dir="qa_model",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()



  0%|          | 0/25 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 4.2972588539123535, 'eval_runtime': 1.0973, 'eval_samples_per_second': 18.226, 'eval_steps_per_second': 1.823, 'epoch': 1.0}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 3.9105992317199707, 'eval_runtime': 1.1842, 'eval_samples_per_second': 16.889, 'eval_steps_per_second': 1.689, 'epoch': 2.0}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 3.7255935668945312, 'eval_runtime': 1.1474, 'eval_samples_per_second': 17.431, 'eval_steps_per_second': 1.743, 'epoch': 3.0}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 3.6754119396209717, 'eval_runtime': 1.2219, 'eval_samples_per_second': 16.368, 'eval_steps_per_second': 1.637, 'epoch': 4.0}


  0%|          | 0/2 [00:00<?, ?it/s]

{'eval_loss': 3.671262264251709, 'eval_runtime': 1.1975, 'eval_samples_per_second': 16.701, 'eval_steps_per_second': 1.67, 'epoch': 5.0}
{'train_runtime': 102.9251, 'train_samples_per_second': 3.886, 'train_steps_per_second': 0.243, 'train_loss': 3.5192266845703126, 'epoch': 5.0}


TrainOutput(global_step=25, training_loss=3.5192266845703126, metrics={'train_runtime': 102.9251, 'train_samples_per_second': 3.886, 'train_steps_per_second': 0.243, 'train_loss': 3.5192266845703126, 'epoch': 5.0})

### Evaluate

We have seen how even a single epoch of fine-tuning can refine the model's understanding and improve the answering accuracy. Fine-tuning can potentially improve the model's accuracy significantly depending on the nature and amount of the fine-tuning data

In [9]:
context = "The University of California was founded in 1868, located in Berkeley."
question = "When was the University of California established?"

# Tokenize the context to find the exact start and end position of the answer
encoded = tokenizer.encode_plus(question, context, return_tensors="pt")
input_ids = encoded["input_ids"].tolist()[0]

model.eval()
with torch.no_grad():
    outputs = model(**encoded)

answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

# Convert tokens to answer string
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print("Answer:", answer)

Answer: 1868


We have seen how even a few epochs of fine-tuning can refine the model's understanding and improve the answering accuracy. Fine-tuning can potentially improve the model's accuracy significantly depending on the nature and amount of the fine-tuning data. In practice, we would consider more robust training process such as  increasing training examples, epochs and the `max_length` parameter. 

### Conclusion

In this workshop we've covered how to take a pre-trained BERT model and fine-tune it on the SQuAD dataset to enhance its answering capabilities. We discussed the importance of proper preprocessing, observed the model's behavior with and without fine-tuning, and highlighted key techniques in managing large-scale NLP datasets