# **Mastering Fine-Tuning: BERT for Question Answering on SQuAD**
---

## **Overview**
This notebook demonstrates how to fine-tune a BERT model for question-answering tasks using the SQuAD (Stanford Question Answering Dataset). The model will be trained to answer questions based on the context provided. The process involves loading the dataset, tokenizing the input data, preparing training features, and using the Trainer class from the transformers library to perform training.

## **Install Required Libraries**
This command installs the `transformers` library, which provides pre-trained models and tokenizers necessary for building and fine-tuning NLP models, and the `datasets` library, which allows easy access to various datasets like SQuAD, which is specifically designed for question-answering tasks. The installation can be run directly in a code cell, especially in environments like Google Colab, which facilitates direct package installations. In local environments, ensure that you have a compatible version of `pip` and access to the internet to download the packages.

In [1]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

## **1. Import Libraries**
This section imports necessary libraries:

- `torch`: PyTorch library for tensor computation and model training.
- `load_dataset`: Function to load datasets from the Hugging Face Hub.
- `AutoTokenizer` and `AutoModelForQuestionAnswering`: Classes to automatically fetch the appropriate tokenizer and model for question answering.
- `TrainingArguments` and `Trainer`: Classes used to define training parameters and handle the training loop.
- `DefaultDataCollator`: A class that prepares batches of data during training.

In [23]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer
from transformers import DefaultDataCollator

## **2. Load Dataset**
This line loads the SQuAD dataset, which consists of questions and context passages from which answers can be derived. The dataset is loaded as a dictionary with 'train' and 'validation' splits.

In [24]:
data = load_dataset("squad")

## **3. Initialize Tokenizer and Model**
The BERT tokenizer and model are instantiated using a pre-trained version of BERT (specifically the uncased version). The tokenizer converts text into a format suitable for the model (e.g., token IDs).

**Options and Alternatives**: <br>
You can choose other models like "distilbert-base-uncased" for a lighter model or "roberta-base" if you prefer a different architecture. Ensure the selected model supports the task of question answering.

In [25]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForQuestionAnswering.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **4. Preparing Training Features for Question-Answering Models**
This function is designed to preprocess input data for training a question-answering model. This function tokenizes the questions and contexts, calculates the start and end positions of the answers, and prepares the data for training.

### **1. Tokenization** <br>
- **Function Call**: The tokenizer function is called with the question and context from the examples input.

- **Parameters**:
  - `max_length=384`: Sets the maximum length of the tokenized sequences to 384 tokens. Any sequence longer than this will be truncated.
  - `truncation="only_second"`: This indicates that if truncation is needed, only the second sequence (the context) will be truncated, preserving the question as much as possible.
  - `stride=128`: This parameter allows for overlapping sequences when creating input tokens, which is helpful for long contexts.
  - `return_overflowing_tokens=True`: This enables the return of multiple tokenized outputs for instances where the input exceeds the maximum length.
  - `return_offsets_mapping=True`: This returns a mapping of the start and end character offsets of each token in the original text.
  -	`padding="max_length"`: This pads all sequences to the maximum length, ensuring uniform input size for the model.

```python
  tokenized_examples = tokenizer(
    examples["question"],
    examples["context"],
    max_length=384,
    truncation="only_second",
    stride=128,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    padding="max_length",
  )
```

### **2. Overflow Mapping and Offsets**
- **Overflow Mapping**: The `overflow_to_sample_mapping` is extracted and removed from tokenized_examples. This mapping indicates which original example corresponds to each tokenized instance, especially when the input exceeds the maximum length.
- **Offsets**: The `offset_mapping` is extracted and removed, providing information on the start and end character positions for each token in the original text. This information will be crucial for determining the position of answers.

```python
  sample_map = tokenized_examples.pop("overflow_to_sample_mapping")
  offset_mapping = tokenized_examples.pop("offset_mapping")
```

### 3. Position Initialization**
- Two empty lists, `start_positions` and `end_positions`, are created to store the token indices for the beginning and ending positions of the answers within the context. These will be populated later in the function.

```python
  tokenized_examples["start_positions"] = []
  tokenized_examples["end_positions"] = []
```

### 4. Example Iteration**
- **Loop**: The function loops through each example in the offset_mapping. The index i represents the current example, and offsets contains the character offsets for the tokens.
- **Input IDs**: The input_ids for the current example are retrieved, which are the token IDs that will be input into the model.
- **CLS Index**: The index of the CLS token (used to indicate the start of input) is found. This will be useful if there are no valid answer positions.

```python
  for i, offsets in enumerate(offset_mapping):
    input_ids = tokenized_examples["input_ids"][i]
    cls_index = input_ids.index(tokenizer.cls_token_id)
```

### 5. Sequence and Answers
- **Sequence IDs**: The sequence_ids method retrieves the sequence IDs for the tokens in the current example, which helps to distinguish between the question and the context (1 for context, 0 for question).
- **Sample Index**: The corresponding sample index for the current tokenized example is retrieved from sample_map.
- **Answers**: The answers associated with the current example are accessed, which includes the start position of the answer and the answer text itself.

```python
    sequence_ids = tokenized_examples.sequence_ids(i)
    sample_index = sample_map[i]
    answers = examples["answers"][sample_index]
```

### **6. No Answers Case**
- This block handles cases where there are no answers provided (i.e., the answer start list is empty).
- **Default Positions**: If no answers exist, the CLS token index is appended to both `start_positions` and `end_positions`, indicating that there is no valid answer span.

```python
    if len(answers["answer_start"]) == 0:
      tokenized_examples["start_positions"].append(cls_index)
      tokenized_examples["end_positions"].append(cls_index)
```

### **7. Valid Answers**
- If answers exist, the start and end character positions of the answer text are calculated. `start_char` is the starting character index of the answer, while `end_char` is computed by adding the length of the answer text to the start index.

```python
    else:
      start_char = answers["answer_start"][0]
      end_char = start_char + len(answers["text"][0])
```

### **8. Start Token Index**
- **Token Start Index**: This loop iterates through the sequence_ids to find the first token index corresponding to the context (where `sequence_ids` equals 1). This helps identify where the answer begins in the tokenized context.

```python
      token_start_index = 0
      while sequence_ids[token_start_index] != 1:
        token_start_index += 1
```

### **9. End Token Index**
- **Token End Index**: A similar loop is used to find the last token index for the context, starting from the end of the input IDs and moving backwards until it finds the first context token.

```python
      token_end_index = len(input_ids) - 1
      while sequence_ids[token_end_index] != 1:
        token_end_index -= 1
```

### **10. Offset Validation**
- This conditional checks if the start and end character positions fall within the bounds of the token offsets. If the character positions are not valid within the token offsets, it appends the CLS index to the start and end positions, indicating no valid answer span.

```python
      if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
        tokenized_examples["start_positions"].append(cls_index)
        tokenized_examples["end_positions"].append(cls_index)
```

### **11. Valid Answer Position Calculation**
- If the character positions are valid, the function iterates through the offsets to find the token index for the start position, moving forward until it finds the correct offset that corresponds to the start character index.
- The calculated `token_start_index` is then appended to the `start_positions` list.

```python
      else:
        while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
          token_start_index += 1       
        tokenized_examples["start_positions"].append(token_start_index - 1)
```

### **12. End Position Calculation**
- A similar process is applied to find the end position token index by iterating backward through the offsets until it finds the token that corresponds to the end character index. The resulting `token_end_index` is adjusted and appended to the `end_positions list`.

```python
        while offsets[token_end_index][1] >= end_char:
          token_end_index -= 1
        tokenized_examples["end_positions"].append(token_end_index + 1)
```

### **13. Function Return**
Finally, the function returns the modified `tokenized_examples`, which now contains the tokenized inputs along with the calculated start and end positions for the answers.

```python
  return tokenized_examples
```


In [26]:
def prepare_train_features(examples):
  tokenized_examples = tokenizer(
      examples["question"],
      examples["context"],
      max_length=384,
      truncation="only_second",
      stride=128,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length",
  )

  sample_map = tokenized_examples.pop("overflow_to_sample_mapping")
  offset_mapping = tokenized_examples.pop("offset_mapping")

  tokenized_examples["start_positions"] = []
  tokenized_examples["end_positions"] = []

  for i, offsets in enumerate(offset_mapping):
    input_ids = tokenized_examples["input_ids"][i]
    cls_index = input_ids.index(tokenizer.cls_token_id)

    sequence_ids = tokenized_examples.sequence_ids(i)
    sample_index = sample_map[i]
    answers = examples["answers"][sample_index]

    if len(answers["answer_start"]) == 0:
      tokenized_examples["start_positions"].append(cls_index)
      tokenized_examples["end_positions"].append(cls_index)
    else:
      start_char = answers["answer_start"][0]
      end_char = start_char + len(answers["text"][0])

      token_start_index = 0
      while sequence_ids[token_start_index] != 1:
        token_start_index += 1

      token_end_index = len(input_ids) - 1
      while sequence_ids[token_end_index] != 1:
        token_end_index -= 1

      if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
        tokenized_examples["start_positions"].append(cls_index)
        tokenized_examples["end_positions"].append(cls_index)
      else:
        while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
          token_start_index += 1

        tokenized_examples["start_positions"].append(token_start_index - 1)

        while offsets[token_end_index][1] >= end_char:
          token_end_index -= 1

        tokenized_examples["end_positions"].append(token_end_index + 1)

  return tokenized_examples

## **5. Map and Tokenize the Dataset**
This line applies the `prepare_train_features` function to the dataset, which tokenizes it and prepares it for training. The `batched=True` argument specifies that the dataset should be processed in batches for efficiency, and the `remove_columns` parameter discards the original columns after processing to simplify the resulting dataset structure.

In [27]:
tokenized_dataset = data.map(prepare_train_features, batched=True, remove_columns=data["train"].column_names)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

## **6. Set Training Arguments**
The training parameters are defined using the `TrainingArguments class`. The  `output_dir` specifies the directory where model checkpoints and results will be saved. The `evaluation_strategy` is set to evaluate the model after each epoch. The `learning_rate` parameter controls the step size for the optimizer. Both `per_device_train_batch_size` and `per_device_eval_batch_size` define the batch sizes for training and evaluation processes, respectively. The `num_train_epochs` indicates how many times the training data will be iterated over, while `weight_decay` is a regularization parameter that helps prevent overfitting.

In [28]:
args = TrainingArguments(
    output_dir="finetune-BERT-squad",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)



## **7. Create a Data Collator**
A data collator is created to efficiently prepare batches of data for training. The `DefaultDataCollator` automatically handles padding of sequences within a batch to ensure that all sequences are of the same length, which is necessary for model input. This is especially useful when the batch contains sequences of varying lengths.

In [29]:
data_collator = DefaultDataCollator()

## **8. Initialize the Trainer**
The `Trainer` class is instantiated with the specified model, training arguments, and datasets. The training dataset is limited to the first 1000 examples for quicker experimentation, while the evaluation dataset is also constrained in the same manner. The `data_collator` is included to manage batch creation, and the `tokenizer` is passed for any necessary tokenization during the training process.

In [30]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"].select(range(1000)),
    eval_dataset=tokenized_dataset["validation"].select(range(1000)),
    data_collator=data_collator,
    tokenizer=tokenizer,
)

## **9. Train the Model**
This command initiates the training process for the model using the previously defined parameters and datasets. During this step, the model will learn to answer questions based on the provided contexts, optimizing its parameters to improve performance on the question-answering task.

In [31]:
trainer.train()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,No log,3.241208
2,No log,2.411355
3,No log,2.328903


TrainOutput(global_step=375, training_loss=2.8479749348958334, metrics={'train_runtime': 399.4811, 'train_samples_per_second': 7.51, 'train_steps_per_second': 0.939, 'total_flos': 587917702656000.0, 'train_loss': 2.8479749348958334, 'epoch': 3.0})