# **Question Answering over Wikipedia Passages**

The goal of this work is to evaluate the performance of different models on an extractive question answering task over Wikipedia passages. In particular, we focus on analyzing the behavior of various BERT-based models on two widely used benchmark datasets: SQuAD v1.1 and SQuAD v2.0.

## **Libraries installation and import**

The first step consists in downloading the required libraries and performing the necessary imports to set up the environment.

In [None]:
!pip install torch transformers
!pip install evaluate
!pip install scikit-learn
!pip install numpy
!pip install tqdm
!pip uninstall -y datasets
!pip install datasets

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
import datasets
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast, AlbertTokenizerFast
import torch
import numpy as np
from collections import defaultdict
from transformers import TrainingArguments, BertForQuestionAnswering, Trainer, AlbertForQuestionAnswering
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers.integrations import WandbCallback
from transformers.data.metrics.squad_metrics import compute_predictions_logits
from tqdm import tqdm
import evaluate
import re
import string
from collections import Counter

## **Preprocessing utilities**

In this section, we define three functions to load and preprocess the datasets for extractive question answering.

We use the load_dataset() function from the Hugging Face datasets library, which provides ready-to-use versions of the SQuAD v1.1 and v2.0 datasets. Since the official test set is not available, we adopt a common strategy:

- The original validation set is used as our test set.

- The training set is split into a new training and validation set using a 90/10 ratio.

This ensures that we can evaluate our models on held-out data while keeping the original structure of the datasets.

In [None]:
# Function to retrieve the datasets from the HuggingFace library and to split them
def obtain_dataset(dataset):
  squad = load_dataset(dataset)

  # Obtain the training set
  data = squad["train"]

  df = data.to_pandas()

  # Split the dataset into training and validation (90-10%)
  train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

  train_dataset = Dataset.from_pandas(train_df, preserve_index=False)
  val_dataset = Dataset.from_pandas(val_df, preserve_index=False)

  # Load the validation set as test set
  testData = squad["validation"]

  return train_dataset, val_dataset, testData

# Obtain squad v1.1 and squad v2.0 datasets
s1_trData, s1_devData, s1_testData = obtain_dataset("squad")
s2_trData, s2_devData, s2_testData = obtain_dataset("squad_v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

The following two functions are designed to tokenize question-context pairs using a specified tokenizer.

The **training_mapping** function is applied to the training split of the dataset. It returns a new dataset that includes:

- input_ids

- token_type_ids

- offset_mapping

- The start and end positions of each answer (in token indices)

In cases where no answer is present (as in SQuAD v2.0), the start and end positions are set to the position of the [CLS] token, following standard practice for unanswerable questions.

In [None]:
# Function to map the training set into the one needed for training
def training_mapping(examples, tokenizer, max_length):

  # Definition of the tokenizer
  tokenized_examples = tokenizer(
      examples["question"],
      examples["context"],
      truncation="only_second",
      add_special_tokens=True,
      max_length=max_length,
      stride=128,
      return_attention_mask=True,
      return_token_type_ids=True,
      return_overflowing_tokens=True,
      return_offsets_mapping=True,
      padding="max_length",
  )

  # Obtain the overflow_to_sample mapping and the offset_mapping
  sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
  offset_mapping = tokenized_examples.pop("offset_mapping")

  # Create the start and the end positions in the dataset to return
  tokenized_examples["start_positions"] = []
  tokenized_examples["end_positions"] = []
  tokenized_examples["ids"] = []

  for i, offsets in enumerate(offset_mapping):
      input_ids = tokenized_examples["input_ids"][i]

      # Obtain of the cls_index token
      cls_index = input_ids.index(tokenizer.cls_token_id)

      sequence_ids = tokenized_examples.sequence_ids(i)

      # Selection of the correct chunk and the corresponding answer
      sample_index = sample_mapping[i]
      answers = examples["answers"][sample_index]
      tokenized_examples["ids"].append(sample_index)

      # If there are no answer, put the start and the end positions to [CLS]
      if len(answers["answer_start"]) == 0:
          tokenized_examples["start_positions"].append(cls_index)
          tokenized_examples["end_positions"].append(cls_index)
      else:

          # Else, set the answer start and end to the first and unique answer positions
          start_char = answers["answer_start"][0]
          end_char = start_char + len(answers["text"][0])

          # Set the start token index to the first token of the context
          token_start_index = 0
          while sequence_ids[token_start_index] != 1:
              token_start_index += 1

          # Set the end token index to the last token of the context
          token_end_index = len(input_ids) - 1
          while sequence_ids[token_end_index] != 1:
              token_end_index -= 1

          # If the answer is not in this chunk, set the start and the end positions to [CLS]
          if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
              tokenized_examples["start_positions"].append(cls_index)
              tokenized_examples["end_positions"].append(cls_index)
          else:
              # Else, move the start and the end token index to the correct positions
              while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                  token_start_index += 1
              tokenized_examples["start_positions"].append(token_start_index - 1)
              while offsets[token_end_index][1] >= end_char:
                  token_end_index -= 1
              tokenized_examples["end_positions"].append(token_end_index + 1)

  # Return the mapped dataset
  return tokenized_examples

The **test_mapping** function is similar in structure to the training_mapping function but is applied to the validation and test splits.

While the core tokenization logic remains the same, this function focuses on preparing the data for inference and evaluation. Specifically, it returns:

- input_ids

- token_type_ids

- offset_mapping

- The ID of the original context/sample, which is used later to reconstruct predictions and evaluate the results

Unlike the training function, this mapping does not compute start and end positions, as these will be predicted by the model.

In [None]:
# Function to map the test set into the one needed for testing
def test_mapping(examples, tokenizer, max_length):
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=128,
        return_attention_mask=True,
        return_token_type_ids=True,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Obtain the overflow_to_sample mapping and the offset_mapping
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples["offset_mapping"]

    # # Create the offset mapping and the ids in the dataset to return
    tokenized_examples["offset_mapping"] = []
    tokenized_examples["id"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        sequence_ids = tokenized_examples.sequence_ids(i)

        # Obtain the id of the original context from the chunk
        sample_index = sample_mapping[i]
        example_id = examples["id"][sample_index]

        # Remove the question offsets and mantain only the context ones
        cleaned_offsets = []
        for k, offset in enumerate(offsets):
            if sequence_ids[k] == 1:
                cleaned_offsets.append(offset)
            else:
                cleaned_offsets.append(None)

        # Construct the dataset to return
        tokenized_examples["offset_mapping"].append(cleaned_offsets)
        tokenized_examples["id"].append(example_id)
    return tokenized_examples

In [None]:
# Check if the GPU is available
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
# Else we use the CPU
else:
    device = torch.device("cpu")
    print("Using CPU")

Using GPU: Tesla T4


## **Evaluation utilities**

The following functions are used to compute the Exact Match (EM) and F1 score on the SQuAD v2.0 dataset.
They are specifically designed to correctly handle both answerable and unanswerable questions.

In the case of unanswerable questions, the ground truth answer is represented as an empty string (""), and the prediction is expected to match this when no answer is found.

These metrics provide a clear and interpretable measure of model performance:

- Exact Match (EM): measures whether the predicted span exactly matches the ground truth.

- F1 Score: measures the overlap between the predicted and true answer spans in terms of precision and recall.

In [None]:
# Functions to normalize the argument string
def normalize_text(s):
    def white_space_fix(text):
        return " ".join(text.split())
    def remove_punc(text):
        return "".join(ch for ch in text if ch not in string.punctuation)
    def lower(text):
        return text.lower()
    return white_space_fix(remove_punc(lower(s)))

# Function to calculate the exact match from predictions and references
def compute_exact_match(predictions, references):
    id_to_pred = {p["id"]: p["prediction_text"] for p in predictions}
    total = 0
    exact_matches = 0

    # Retrieve information from references
    for ref in references:
        qid = ref["id"]
        gold_answers = ref["answers"]["text"]
        pred_answer = id_to_pred.get(qid, "")

        normalized_pred = normalize_text(pred_answer)
        normalized_golds = [normalize_text(ans) for ans in gold_answers]

        # Calculate the match between the predicted answer and the gold one
        match = any(normalized_pred == gold for gold in normalized_golds)
        exact_matches += int(match)
        total += 1
    # Compute the final exact match
    return 100.0 * exact_matches / total if total > 0 else 0.0

# Function to calculate the f1-score from the predictions and the references
def compute_f1_score(predictions, references):
    id_to_pred = {p["id"]: p["prediction_text"] for p in predictions}
    total = 0
    total_f1 = 0.0

    # Retrieve the information from the references
    for ref in references:
        qid = ref["id"]
        gold_answers = ref["answers"]["text"]
        pred_answer = id_to_pred.get(qid, "")
        if all(normalize_text(g).strip() == "" for g in gold_answers) and normalize_text(pred_answer).strip() == "":
            total_f1 += 1.0
            total += 1
            continue
        normalized_pred = normalize_text(pred_answer).split()
        max_f1 = 0.0

        # Verify the match with the gold answers
        for gold in gold_answers:
            normalized_gold = normalize_text(gold).split()
            common = Counter(normalized_pred) & Counter(normalized_gold)
            num_same = sum(common.values())

            if num_same == 0:
                f1 = 0.0
            else:
                # Calculate the precision, the recall and the final f1-score
                precision = num_same / len(normalized_pred) if normalized_pred else 0.0
                recall = num_same / len(normalized_gold) if normalized_gold else 0.0
                f1 = (2 * precision * recall) / (precision + recall)

            max_f1 = max(max_f1, f1)

        total_f1 += max_f1
        total += 1

    # Calculate the final f1-score
    return 100.0 * total_f1 / total if total > 0 else 0.0


In [None]:
# Function to calculate the evaluation metrics for a given model over SQuAD v1.1
def compute_metrics_v1(originalDataset, dataset, trainer):
  id_to_context_test = {example["id"]: example["context"] for example in originalDataset}

  # Retrieve the predicted start logits and end logits
  predictions = trainer.predict(dataset)
  start_logits = predictions.predictions[0]
  end_logits = predictions.predictions[1]

  # Retrieve the metric from the HuggingFace library
  squad_metric = evaluate.load("squad")

  max_answer_len = 30
  n_best = 1

  # Create a dictionary and retrieve all the needed information from the dataset
  examples_by_id = defaultdict(list)
  all_ids = dataset["id"]
  all_input_ids = dataset["input_ids"]
  all_offset_mappings = dataset["offset_mapping"]

  # Create a list with all the information
  for i in tqdm(range(len(dataset))):
    example_id = all_ids[i]
    inputs = all_input_ids[i]
    offset_mapping = all_offset_mappings[i]
    context = id_to_context_test[example_id]
    start_logit = start_logits[i]
    end_logit = end_logits[i]
    examples_by_id[example_id].append({
        "start_logit": start_logit,
        "end_logit": end_logit,
        "offset_mapping": offset_mapping,
        "context": context,
        "input_ids": inputs
    })

  final_predictions = {}

  # For every chunk retrieve the logits
  for example_id, chunks in examples_by_id.items():
      best_score = -float("inf")
      best_text = ""
      for chunk in chunks:
          start_logits = chunk["start_logit"]
          end_logits = chunk["end_logit"]
          offset_mapping = chunk["offset_mapping"]
          context = chunk["context"]

          # For every pair of logits values research the best one
          for start_idx in range(len(start_logits)):
              for end_idx in range(start_idx, min(start_idx + max_answer_len, len(end_logits))):
                  if offset_mapping[start_idx] is None or offset_mapping[end_idx] is None:
                      continue
                  score = start_logits[start_idx] + end_logits[end_idx]
                  if score > best_score:
                      start_char = offset_mapping[start_idx][0]
                      end_char = offset_mapping[end_idx][1]
                      best_text = context[start_char:end_char]
                      best_score = score
      final_predictions[example_id] = best_text

  predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]

  id_to_answers_dev = {example["id"]: example["answers"] for example in originalDataset}

  answer_map = {example["id"]: example["answers"]["text"] for example in originalDataset}

  references = []
  for k in final_predictions.keys():
    answers_start = id_to_answers_dev[k]["answer_start"]
    reference = {"id": k, "answers": {"text": answer_map[k], "answer_start": answers_start}}

    references.append(reference)

  # Calculate the metrics over the predictions and the references
  metrics = squad_metric.compute(predictions=predictions, references=references)
  return metrics

In [None]:
# Function to calculate the evaluation metrics for a given model over SQuAD v2.0
def compute_metrics_v2(originalDataset, dataset, trainer):

  # Retrieve the predicted start logits and end logits
  predictions_raw = trainer.predict(dataset)
  start_logits_all = predictions_raw.predictions[0]
  end_logits_all = predictions_raw.predictions[1]
  id_to_context_dev = {example["id"]: example["context"] for example in originalDataset}

  max_answer_len = 30
  n_best_size = 20

  # Create a dictionary and retrieve all the needed information from the dataset
  examples_by_id = defaultdict(list)
  all_ids = dataset["id"]
  all_input_ids = dataset["input_ids"]
  all_offset_mappings = dataset["offset_mapping"]

  # Create a list with all the information
  for i in tqdm(range(len(dataset))):
      example_id = all_ids[i]
      examples_by_id[example_id].append({
          "start_logit": start_logits_all[i],
          "end_logit": end_logits_all[i],
          "offset_mapping": all_offset_mappings[i],
          "context": id_to_context_dev[example_id],
          "input_ids": all_input_ids[i],
      })

  predictions = []

  # For every chunk retrieve the logits
  for example_id, chunks in examples_by_id.items():
      min_null_score = float("inf")
      best_score = -float("inf")
      best_text = ""
      prelim_predictions = []

      for chunk in chunks:
          start_logits = chunk["start_logit"]
          end_logits = chunk["end_logit"]
          offset_mapping = chunk["offset_mapping"]
          context = chunk["context"]

          null_score = start_logits[0] + end_logits[0]
          min_null_score = min(min_null_score, null_score)

          # For every pair of logits values research the best one
          for start_idx in range(len(start_logits)):
              for end_idx in range(start_idx, min(start_idx + max_answer_len, len(end_logits))):
                  if offset_mapping[start_idx] is None or offset_mapping[end_idx] is None:
                      continue
                  if offset_mapping[start_idx] == (0, 0) or offset_mapping[end_idx] == (0, 0):
                      continue
                  start_char = offset_mapping[start_idx][0]
                  end_char = offset_mapping[end_idx][1]
                  if end_char < start_char:
                      continue
                  score = start_logits[start_idx] + end_logits[end_idx]
                  text = context[start_char:end_char]
                  prelim_predictions.append({
                      "text": text,
                      "score": score
                  })

      if prelim_predictions:
          sorted_preds = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)
          n_best_predictions = sorted_preds[:n_best_size]
          best_text = n_best_predictions[0]["text"]
          best_score = n_best_predictions[0]["score"]
      else:
          best_text = ""
          best_score = -float("inf")

      score_diff = min_null_score - best_score
      predictions.append({
          "id": example_id,
          "prediction_text": best_text,
          "no_answer_probability": float(score_diff)
      })

  answer_map = {example["id"]: example["answers"]["text"] for example in originalDataset}
  id_to_answers_dev = {example["id"]: example["answers"] for example in originalDataset}

  best_f1 = 0
  best_em = 0
  best_threshold = 0

  # Verify the threshold
  for thresh in np.arange(-2, 2.1, 0.1):
      temp_preds = []
      references = []
      for pred in predictions:
          temp_pred = pred.copy()
          temp_pred["prediction_text"] = "" if pred["no_answer_probability"] > thresh else pred["prediction_text"]
          temp_preds.append(temp_pred)

          k = pred["id"]
          texts = answer_map.get(k, [])
          if texts:
              answers_start = id_to_answers_dev[k]["answer_start"]
              reference = {
                  "id": k,
                  "answers": {
                      "text": texts,
                      "answer_start": answers_start
                  }
              }
          else:
              reference = {
                  "id": k,
                  "answers": {
                      "text": [""],
                      "answer_start": [0]
                  }
              }
          references.append(reference)

      # Compute the EM and F1 with the previous functions
      em_score = compute_exact_match(temp_preds, references)
      f1_score = compute_f1_score(temp_preds, references)
      if f1_score > best_f1:
          best_f1 = f1_score
          best_em = em_score
          best_threshold = thresh
  # Return the information needed
  return {
      "best_threshold": best_threshold,
      "best_f1": best_f1,
      "best_em": best_em
  }

# **BERT**

The following code is responsible for managing the training, validation, and evaluation of the BERT model on both SQuAD v1.1 and SQuAD v2.0.

We define the training pipeline by:

- Setting the training arguments (batch size, number of epochs, learning rate, etc.) used by the Hugging Face Trainer

- Initializing the Trainer class with the model, datasets, tokenizer, and evaluation functions

- Running the training and evaluation procedures separately for the two datasets:

  - SQuAD v1.1: all questions are answerable, so no special handling is needed

  - SQuAD v2.0: includes unanswerable questions, which are properly handled in both the preprocessing and evaluation phases

In [None]:
# Retrieve the tokenizer for BERT-base
tokenizerFast = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Calls for mapping each dataset needed
s1_tr_dataset = s1_trData.map(training_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_trData.column_names)
s2_tr_dataset = s2_trData.map(training_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_trData.column_names)

s1_dev_dataset = s1_devData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_devData.column_names)
s2_dev_dataset = s2_devData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_devData.column_names)

s1_test_dataset = s1_testData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_testData.column_names)
s2_test_dataset = s2_testData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_testData.column_names)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/78839 [00:00<?, ? examples/s]

Map:   0%|          | 0/117287 [00:00<?, ? examples/s]

Map:   0%|          | 0/8760 [00:00<?, ? examples/s]

Map:   0%|          | 0/13032 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

## **SQuAD** **v1.1**

In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s1_models/checkpoints/bert",
    num_train_epochs=3,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./s1_models/logs/bert',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Retrieve the BERT model
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Configuration of the trainer
s1_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s1_tr_dataset.with_format("torch"),
)
# Training of the model and subsequent saving
s1_trainer.train()
s1_trainer.save_model("./s1_models/checkpoints/bert")


Step,Training Loss
1,5.9314


KeyboardInterrupt: 

### **Validation and Evaluation**

In [None]:
# Compute the metrics for the validation part
metrics = compute_metrics_v1(s1_devData, s1_dev_dataset, s1_trainer)
print(metrics)

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

100%|██████████| 1/1 [00:00<00:00, 10618.49it/s]

{'exact_match': 0.0, 'f1': 0.0}





In [None]:
# Compute the metrics for the test part
metrics = compute_metrics_v1(s1_testData, s1_test_dataset, s1_trainer)
print(metrics)

100%|██████████| 10/10 [00:00<00:00, 13906.84it/s]

{'exact_match': 0.0, 'f1': 0.0}





## **SQuAD v2.0**




In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s2_models/checkpoints/bert",
    num_train_epochs=3,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./s2_models/logs/bert',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Loading of the BERT model
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Configuration of the trainer
s2_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s2_tr_dataset.with_format("torch")
)

# Training of the model and subsequent saving
s2_trainer.train()
s2_trainer.save_model("./s2_models/checkpoints/bert")


Step,Training Loss
1,5.9107


KeyboardInterrupt: 

### **Validation and evaluation**

In [None]:
# Compute the metrics for the validation part
metrics = compute_metrics_v2(s2_devData, s2_dev_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 1/1 [00:00<00:00, 5866.16it/s]

Best threshold: -1.80, Best F1: 15.38, Best EM: 0.00





In [None]:
# Compute the metrics for the test part
metrics = compute_metrics_v2(s2_testData, s2_test_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 10/10 [00:00<00:00, 43509.38it/s]

Best threshold: -2.00, Best F1: 40.00, Best EM: 40.00





# **AlBERT**

The following code manages the training, validation, and evaluation of the AlBERT model on both SQuAD v1.1 and SQuAD v2.0.

As with BERT, we configure the Trainer with appropriate hyperparameters and datasets. AlBERT, being a lighter and more efficient variant of BERT, often achieves strong performance with fewer parameters.

We follow the same procedure:

- Define training arguments and metrics computation

- Fine-tune AlBERT on each dataset

- Evaluate its performance using Exact Match and F1, handling unanswerable questions correctly in SQuAD v2.0

In [None]:
# Retrieve the tokenizer for the AlBERT model
albert_tokenizerFast = AlbertTokenizerFast.from_pretrained("albert-base-v2")

# Map the needed datasets
s1_tr_dataset = s1_trData.map(training_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_trData.column_names)
s2_tr_dataset = s2_trData.map(training_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_trData.column_names)

s1_dev_dataset = s1_devData.map(test_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_devData.column_names)
s2_dev_dataset = s2_devData.map(test_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_devData.column_names)

s1_test_dataset = s1_testData.map(test_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s1_testData.column_names)
s2_test_dataset = s2_testData.map(test_mapping, fn_kwargs={"tokenizer": albert_tokenizerFast, "max_length": 384}, batched=True, remove_columns=s2_testData.column_names)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Map:   0%|          | 0/78839 [00:00<?, ? examples/s]

Map:   0%|          | 0/117287 [00:00<?, ? examples/s]

Map:   0%|          | 0/8760 [00:00<?, ? examples/s]

Map:   0%|          | 0/13032 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

## **SQUAD v1.1**

In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s1_models/checkpoints/albert",
    num_train_epochs=2,
    learning_rate=3e-5,
    per_device_train_batch_size=32,
    weight_decay=0.05,
    save_total_limit=3,
    logging_dir='./s1_models/logs/albert',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Retrieve the AlBERT model v2
model = AlbertForQuestionAnswering.from_pretrained("albert-base-v2")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Configuration of the trainer
s1_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s1_tr_dataset.with_format("torch")
)

# Training of the model and subsequent saving
s1_trainer.train()
s1_trainer.save_model("./s1_models/checkpoints/albert")


Step,Training Loss
1,5.7185
2,5.7185
3,5.7185
4,4.2263
5,3.2446
6,2.5666
7,2.1004
8,1.6967
9,1.3118
10,0.8499


KeyboardInterrupt: 

### **Validation and Evaluation**

In [None]:
# Validation part
metrics = compute_metrics_v1(s1_devData, s1_dev_dataset, s1_trainer)
print(metrics)

In [None]:
# Test part
metrics = compute_metrics_v1(s1_testData, s1_test_dataset, s1_trainer)
print(metrics)

## **SQUAD v2.0**

In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s2_models/checkpoints/albert",
    num_train_epochs=3,
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./s2_models/logs/albert',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Load of the AlBERT model
model = AlbertForQuestionAnswering.from_pretrained("albert-base-v2")

You are using a model of type albert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['bert.embeddings.LayerNorm.bias', 'bert.embeddings.LayerNorm.weight', 'bert.embeddings.position_embeddings.weight', 'bert.embeddings.token_type_embeddings.weight', 'bert.embeddings.word_embeddings.weight', 'bert.encoder.layer.0.attention.output.LayerNorm.bias', 'bert.encoder.layer.0.attention.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.output.dense.bias', 'bert.encoder.layer.0.attention.output.dense.weight', 'bert.encoder.layer.0.attention.self.key.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.layer.0.attention.self.query.bias', 'bert.encoder.layer.0.attention.self.query.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.va

In [None]:
# Configuration of the trainer
s2_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s2_tr_dataset.with_format("torch")
)

# Training of the model and subsequent saving
s2_trainer.train()
s2_trainer.save_model("./s2_models/checkpoints/albert")

OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 38.12 MiB is free. Process 4542 has 14.70 GiB memory in use. Of the allocated memory 14.55 GiB is allocated by PyTorch, and 22.41 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

### **Validation and evaluation**

In [None]:
# Compute the metrics for the validation part
metrics = compute_metrics_v2(s2_devData, s2_dev_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 1/1 [00:00<00:00, 11554.56it/s]

{'best_threshold': np.float64(-2.0), 'best_f1': 17.391304347826082, 'best_em': 0.0}





In [None]:
# Compute the metrics for the test part
metrics = compute_metrics_v2(s2_testData, s2_test_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 10/10 [00:00<00:00, 109511.85it/s]


{'best_threshold': np.float64(-2.0), 'best_f1': 2.5, 'best_em': 0.0}


# **SpanBERT**

This section focuses on the training and evaluation of the SpanBERT model on the SQuAD v1.1 and SQuAD v2.0 datasets.

SpanBERT is specifically designed to improve span-level predictions, making it particularly well-suited for extractive question answering.

We:

- Initialize the model and tokenizer

- Define training configurations with Trainer

- Use the same preprocessing and evaluation logic as for BERT and ALBERT

- Compare its performance with other models using EM and F1 scores

In [None]:
# Load the tokenizer for SpanBERT model
model_name = "SpanBERT/spanbert-base-cased"
tokenizerSpanBert = AutoTokenizer.from_pretrained(model_name)

# Map into the needed datasets
s1_tr_dataset = s1_trData.map(training_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s1_trData.column_names)
s2_tr_dataset = s2_trData.map(training_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s2_trData.column_names)

s1_dev_dataset = s1_devData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s1_devData.column_names)
s2_dev_dataset = s2_devData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s2_devData.column_names)

s1_test_dataset = s1_testData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s1_testData.column_names)
s2_test_dataset = s2_testData.map(test_mapping, fn_kwargs={"tokenizer": tokenizerSpanBert, "max_length": 512}, batched=True, remove_columns=s2_testData.column_names)

config.json:   0%|          | 0.00/413 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Map:   0%|          | 0/78839 [00:00<?, ? examples/s]

Map:   0%|          | 0/117287 [00:00<?, ? examples/s]

Map:   0%|          | 0/8760 [00:00<?, ? examples/s]

Map:   0%|          | 0/13032 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

## **SQUAD v1.1**

In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s1_models/checkpoints/spanbert",
    num_train_epochs=4,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    weight_decay=0.01,
    save_total_limit=4,
    logging_dir='./s1_models/logs/spanbert',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Retrieve the SpanBERT model
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

pytorch_model.bin:   0%|          | 0.00/215M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Configuration of the trainer
s1_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s1_tr_dataset.with_format("torch")
)

# Training of the model and subsequent saving
s1_trainer.train()
s1_trainer.save_model("./s1_models/checkpoints/spanbert")

Step,Training Loss
1,6.2144


KeyboardInterrupt: 

### **Validation and evaluation**

In [None]:
# Compute the metrics for the validation part
metrics = compute_metrics_v1(s1_devData, s1_dev_dataset, s1_trainer)
print(metrics)

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

100%|██████████| 8785/8785 [00:00<00:00, 413522.93it/s]


{'exact_match': 70.81050228310502, 'f1': 84.29649837307102}


In [None]:
# Compute the metrics for the test part
metrics = compute_metrics_v1(s1_testData, s1_test_dataset, s1_trainer)
print(metrics)

100%|██████████| 10648/10648 [00:00<00:00, 419213.86it/s]


{'exact_match': 83.67076631977294, 'f1': 91.04972865448869}


## **SQUAD v2.0**

In [None]:
# Configuration of the training arguments
training_args = TrainingArguments(
    output_dir="./s2_models/checkpoints/spanbert/model2",
    num_train_epochs=4,
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    weight_decay=0.01,
    save_total_limit=3,
    logging_dir='./s2_models/logs/spanbert/model2',
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    logging_first_step=True,
    logging_strategy="epoch",
    report_to="none"
)

# Retrieve the SpanBERT model
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

pytorch_model.bin:   0%|          | 0.00/215M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/215M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Configuration of the trainer
s2_trainer = Trainer(
    model=model.to(device),
    args=training_args,
    train_dataset=s2_tr_dataset.with_format("torch")
)

# Training of the model and subsequent saving
s2_trainer.train()
s2_trainer.save_model("./s2_models/checkpoints/spanbert/model2")

Step,Training Loss


### **Validation and evaluation**

In [None]:
# Compute the metrics for the validation part
metrics = compute_metrics_v2(s2_devData, s2_dev_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 13073/13073 [00:00<00:00, 443414.04it/s]


{'best_threshold': np.float64(-0.2999999999999985), 'best_f1': 80.83484012623705, 'best_em': 70.5340699815838}


In [None]:
# Compute the metrics for the test part
metrics = compute_metrics_v2(s2_testData, s2_test_dataset, s2_trainer)
best_f1 = metrics["best_f1"]
best_em = metrics["best_em"]
best_threshold = metrics["best_threshold"]
print(f"Best threshold: {best_threshold:.2f}, Best F1: {best_f1:.2f}, Best EM: {best_em:.2f}")

100%|██████████| 11985/11985 [00:00<00:00, 453055.14it/s]


{'best_threshold': np.float64(-2.0), 'best_f1': 82.2296015690096, 'best_em': 77.90785816558578}
