# Install Dependencies & Import Core Libraries

This part:

A. Installs required Python packages:

1. *evaluate* for metric computation.

2. Latest *datasets ≥ 2.14.6* and a compatible *fsspec* version (2023.9.2) to avoid filesystem bugs.

B. Imports key libraries for the QA fine-tuning workflow:

1. *datasets* to load and manipulate JSON SQuAD-style data.

2. Hugging Face Transformers classes (tokenizer, XLM-RoBERTa QA model, training utilities, custom callback).

3. *KFold* from scikit-learn for 5-fold cross-validation.

4. Core Python/NumPy/Torch modules for randomness control and tensor operations.

5. *OrderedDict* for deterministic mapping, and *evaluate.load* for SQuAD-v2 metrics.

These steps ensure all external dependencies are installed and all necessary modules are in scope before data processing and model training begin.

In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.5


In [None]:
!pip install -U -q "datasets>=2.14.6" fsspec==2023.9.2

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import (XLMRobertaTokenizerFast,
                          XLMRobertaForQuestionAnswering,
                          TrainingArguments, Trainer, TrainerCallback,
                          default_data_collator, get_cosine_schedule_with_warmup)
from sklearn.model_selection import KFold
import torch, numpy as np, random, evaluate, os
from collections import OrderedDict
from evaluate import load

# Load Model & Dataset

This part performs three preparatory tasks:

1. **Model and Tokenizer Loading**

   Downloads the Persian XLM-RoBERTa large checkpoint and instantiates its fast tokenizer for subsequent encoding

2. **Reproducibility Setup**

   Sets a fixed random seed (42) for Python’s random, NumPy, and PyTorch to ensure consistent results across runs.

3. **Dataset Ingestion**

   Reads the custom SQuAD-style JSON files (pqa_train.json, pqa_test.json) into a Hugging Face DatasetDict called raw, using the top-level "data" field as the dataset root.


In [None]:
model_name = "pedramyazdipoor/persian_xlm_roberta_large"
tokenizer   = XLMRobertaTokenizerFast.from_pretrained(model_name)
seed = 42
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/621 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

<torch._C.Generator at 0x782a16c35e10>

In [None]:
data_files = {"train": "/content/pqa_train.json",
              "test" : "/content/pqa_test.json"}
raw = load_dataset("json", data_files=data_files, field="data")

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

# Flatten Data & Tokenize with Sliding Window

This part converts the raw nested JSON into a model-ready, token-level dataset:

1. **flatten_squad → article-to-row conversion**

   Iterates through each paragraph and QA pair, producing a flat table with id, question, context, and answers.

   Properly handles unanswerable questions by returning empty answer fields.

   Runs on both train and test splits, yielding raw_flat.

2. **prepare_features → token-level encoding**

   Uses the XLM-RoBERTa tokenizer with a 384-token window and 128-token stride (sliding window) to create model inputs.

   For every overflow “feature” it computes start/end token indices, marks no-answer samples (map to CLS), and stores cleaned offset_mapping for post-processing.

   Adds helper columns (example_id, context) for later metric calculation.

Finally, `tokenized_ds` contains all train features (\~multiple windows per example) and is ready for cross-validation training.


In [None]:
def flatten_squad(batch):
    out = {"id": [], "question": [], "context": [], "answers": []}
    for titles, paragraphs in zip(batch["title"], batch["paragraphs"]):
        for para in paragraphs:
            context = para["context"]
            for qa in para["qas"]:
                out["id"].append(str(qa["id"]))
                out["question"].append(qa["question"])
                out["context"].append(context)

                if qa["is_impossible"]:
                    out["answers"].append(
                        {"text": [], "answer_start": [], "answer_end": []}
                    )
                else:
                    out["answers"].append({
                        "text": [a["text"] for a in qa["answers"]],
                        "answer_start": [a["answer_start"] for a in qa["answers"]],
                        "answer_end":   [a["answer_end"]   for a in qa["answers"]],
                    })
    return out

In [None]:
train_flat = raw["train"].map(flatten_squad,
                              batched=True,
                              remove_columns=raw["train"].column_names,
                              desc="Flatten train")

test_flat  = raw["test"].map(flatten_squad,
                             batched=True,
                             remove_columns=raw["test"].column_names,
                             desc="Flatten test")

raw_flat = DatasetDict({"train": train_flat, "test": test_flat})

In [None]:
max_len, doc_stride = 384, 128

def prepare_features(examples):
    tokenized = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_len,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    raw_offset_mapping = tokenized.pop("offset_mapping")

    start_positions, end_positions = [], []
    example_ids, contexts, new_offset_mapping = [], [], []

    for i, offsets in enumerate(raw_offset_mapping):
        input_ids  = tokenized["input_ids"][i]
        cls_index  = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized.sequence_ids(i)
        sample_idx = sample_mapping[i]
        answers    = examples["answers"][sample_idx]

        if len(answers["text"]) == 0:
            start_positions.append(cls_index)
            end_positions.append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char   = answers["answer_end"][0]
            token_start_index = sequence_ids.index(1)
            token_end_index   = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

            if not (offsets[token_start_index][0] <= start_char <= offsets[token_end_index][1]):
                start_positions.append(cls_index)
                end_positions.append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                start_positions.append(token_start_index - 1)
                end_positions.append(token_end_index + 1)

        example_ids.append(examples["id"][sample_idx])
        contexts.append(examples["context"][sample_idx])

        cleaned_offsets = [
            (o if s == 1 else (0, 0)) for o, s in zip(offsets, sequence_ids)
        ]
        new_offset_mapping.append(cleaned_offsets)

    tokenized["start_positions"] = start_positions
    tokenized["end_positions"]   = end_positions
    tokenized["example_id"]      = example_ids
    tokenized["context"]         = contexts
    tokenized["offset_mapping"]  = new_offset_mapping

    return tokenized

In [None]:
tokenized_ds = raw_flat.map(
    prepare_features,
    batched=True,
    remove_columns=raw_flat["train"].column_names,
    desc="Tokenizing",
)

Tokenizing:   0%|          | 0/9008 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/930 [00:00<?, ? examples/s]

# Utility Functions

This group of helper functions underpins training and evaluation:

1. **postprocess_qa_predictions** – full SQuAD-style answer extractor (n-best search, CLS-based no-answer scoring).

2. **build_fast_metrics_fn / compute_fast** – ultra-light EM/F1 metric (single arg-max per feature, best-per-example aggregation) for near-instant evaluation during training.

3. **freeze_backbone** – freezes all layers of XLM-RoBERTa except the QA head + last *n* transformer layers (default = 2), enabling lightweight fine-tuning.

4. **LossPrinter (TrainerCallback)** – prints live train_loss and eval_loss every logging step for quick monitoring.


In [None]:
def postprocess_qa_predictions(examples, features, raw_preds,
                               n_best_size=20, max_answer_len=30):
    all_start_logits, all_end_logits = raw_preds
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = OrderedDict()
    for i, f in enumerate(features):
        features_per_example.setdefault(f["example_id"], []).append(i)

    predictions = OrderedDict()
    for example_id, feature_indices in features_per_example.items():
        min_null_score = None
        valid_answers = []
        for idx in feature_indices:
            start_logits = all_start_logits[idx]
            end_logits   = all_end_logits[idx]
            offset_mapping = features[idx]["offset_mapping"]
            cls_index = features[idx]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or feature_null_score < min_null_score:
                min_null_score = feature_null_score

            start_indexes = np.argsort(start_logits)[-1:-n_best_size-1:-1]
            end_indexes   = np.argsort(end_logits)[-1:-n_best_size-1:-1]
            for s in start_indexes:
                for e in end_indexes:
                    if e < s or (e - s + 1) > max_answer_len:
                        continue
                    start_char, _ = offset_mapping[s]
                    _, end_char   = offset_mapping[e]
                    valid_answers.append({
                        "score": start_logits[s] + end_logits[e],
                        "text": features[idx]["context"][start_char:end_char]
                    })

        if valid_answers:
            best_answer = max(valid_answers, key=lambda x: x["score"])
        else:
            best_answer = {"text": "", "score": 0.0}

        if min_null_score < best_answer["score"]:
            predictions[example_id] = ""
        else:
            predictions[example_id] = best_answer["text"]

    return predictions

In [None]:
fast_metric = load("squad_v2")

In [None]:
def build_fast_metrics_fn(examples, features):
    ctx_list   = features["context"]
    offsets    = features["offset_mapping"]
    ex_ids     = features["example_id"]
    cls_indices = [ids.index(tokenizer.cls_token_id) for ids in features["input_ids"]]

    def compute_fast(p):
        start_logits, end_logits = p.predictions

        best_per_example = {}

        for i in range(len(start_logits)):
            s = int(np.argmax(start_logits[i]))
            e = int(np.argmax(end_logits[i]))
            cls = cls_indices[i]

            if s == cls or e == cls or s > e:
                score = start_logits[i][cls] + end_logits[i][cls]
                text  = ""
            else:
                score = start_logits[i][s] + end_logits[i][e]
                start_char = offsets[i][s][0]
                end_char   = offsets[i][e][1]
                text = ctx_list[i][start_char:end_char]

            eid = ex_ids[i]
            if (eid not in best_per_example) or (score > best_per_example[eid][0]):
                best_per_example[eid] = (score, text)

        preds = [{"id": k, "prediction_text": v[1],
                  "no_answer_probability": 0.0}
                 for k, v in best_per_example.items()]

        refs = [{"id": ex["id"],
                 "answers": {"text": ex["answers"]["text"],
                             "answer_start": ex["answers"]["answer_start"]}}
                for ex in examples]

        return fast_metric.compute(predictions=preds, references=refs)

    return compute_fast

In [None]:
def freeze_backbone(model, unfreeze_last_n=2):
    for name, param in model.named_parameters():
        param.requires_grad = False
        if "qa_outputs" in name or any(f"layer.{24-i}." in name
                                       for i in range(1, unfreeze_last_n+1)):
            param.requires_grad = True

In [None]:
class LossPrinter(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kw):
        if not logs: return
        if "loss" in logs:
            print(f"[{state.global_step}] train_loss={logs['loss']:.4f}")
        if "eval_loss" in logs:
            print(f"[{state.global_step}] val_loss  ={logs['eval_loss']:.4f}")

# 5-Fold Cross-Validation Training Loop

This part orchestrates end-to-end fine-tuning with cross-validation:

1. **Split definition** – obtains 5 shuffled folds (KFold) over the 9008 flat train examples.

2. **Per-fold data selection** – maps example indices to matching tokenized “feature” slices, generating train_dataset and eval_dataset.

3. **Model init & layer freezing** – reloads a fresh Persian XLM-RoBERTa QA model each fold, unfreezing only the QA head + last two transformer layers for efficient fine-tuning.

4. **TrainingArguments setup** – 10 epochs, cosine LR scheduler, FP16, epoch-level evaluation/saving, live loss logging every 100 steps.

5. **Trainer construction** – injects:

   * fast EM/F1 metric specific to this fold (build_fast_metrics_fn)
   * LossPrinter callback for real-time loss display.

6. **Train → Evaluate → Record** – runs trainer.train(), evaluates on the fold’s validation split, prints metrics, and appends them to all_fold_metrics for later averaging.


In [None]:
n_examples = len(raw_flat["train"])
kf = KFold(n_splits=5, shuffle=True, random_state=42)
example_indices = np.arange(n_examples)

**Note:**

I interrupted and stopped the training because of GPU limitations. With 5 epochs instead of 50 epochs we have reached to 67% EM and 82 F1 score.

In [None]:
all_fold_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(example_indices), 1):
    train_examples = raw_flat["train"].select(train_idx)
    val_examples   = raw_flat["train"].select(val_idx)

    train_id_set = set(train_examples["id"])
    val_id_set   = set(val_examples["id"])

    all_eids = tokenized_ds["train"]["example_id"]
    train_feat_idx = [i for i, eid in enumerate(all_eids) if eid in train_id_set]
    val_feat_idx   = [i for i, eid in enumerate(all_eids) if eid in val_id_set]

    train_dataset = tokenized_ds["train"].select(train_feat_idx)
    eval_dataset  = tokenized_ds["train"].select(val_feat_idx)

    eval_examples = val_examples
    eval_features = eval_dataset

    model = XLMRobertaForQuestionAnswering.from_pretrained(model_name)
    freeze_backbone(model, unfreeze_last_n=2)

    steps_per_epoch = len(train_dataset) // 8
    total_steps = steps_per_epoch * 10
    warmup_steps = int(total_steps * 0.1)

    args = TrainingArguments(
        output_dir=f"/content/qa_ckpt/fold{fold}",
        num_train_epochs=10,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        learning_rate=2e-5,
        weight_decay=0.01,
        gradient_checkpointing=True,
        eval_strategy="epoch",
        save_strategy="epoch",
        fp16=True,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        logging_strategy="steps",
        logging_steps=100,
        disable_tqdm=False,
        report_to="none",
        lr_scheduler_type="cosine",
        warmup_ratio=0.1,
        seed=42,
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics=build_fast_metrics_fn(eval_examples, eval_features),
        callbacks=[LossPrinter()],
    )

    trainer.train()
    metrics = trainer.evaluate()
    print(metrics)
    all_fold_metrics.append(metrics)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--pedramyazdipoor--persian_xlm_roberta_large/snapshots/aec1d0acc3730fd0edf6e84c4b382e8583b47ee3/config.json
Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "language": "english",
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "name": "XLMRoberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.53.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

loading wei

Epoch,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Noans Exact,Noans F1,Noans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
1,1.6623,1.543832,50.721421,68.611451,1802,35.186656,60.792561,1259,86.740331,86.740331,543,50.721421,0.0,68.611451,0.0
2,1.6889,1.493759,51.220866,69.007064,1802,35.980937,61.438228,1259,86.556169,86.556169,543,51.220866,0.0,69.007064,0.0
3,1.549,1.478227,51.498335,69.111371,1802,36.219222,61.428667,1259,86.924494,86.924494,543,51.498335,0.0,69.111371,0.0
4,1.5592,1.467171,51.442841,69.226679,1802,36.29865,61.752562,1259,86.556169,86.556169,543,51.442841,0.0,69.226679,0.0
5,1.5682,1.460859,51.442841,69.191647,1802,36.536934,61.940705,1259,86.003683,86.003683,543,51.442841,0.0,69.191647,0.0
6,1.6098,1.45861,51.664817,69.253686,1802,36.854647,62.029501,1259,86.003683,86.003683,543,51.664817,0.0,69.253686,0.0


[100] train_loss=2.0320
[200] train_loss=2.0173
[300] train_loss=1.8904
[400] train_loss=1.8117
[500] train_loss=1.8016
[600] train_loss=1.8592
[700] train_loss=1.7538
[800] train_loss=1.7496
[900] train_loss=1.6623


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[942] val_loss  =1.5438


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-942
Configuration saved in /content/qa_ckpt/fold1/checkpoint-942/config.json
Model weights saved in /content/qa_ckpt/fold1/checkpoint-942/model.safetensors
tokenizer config file saved in /content/qa_ckpt/fold1/checkpoint-942/tokenizer_config.json
Special tokens file saved in /content/qa_ckpt/fold1/checkpoint-942/special_tokens_map.json


[1000] train_loss=1.6476
[1100] train_loss=1.6948
[1200] train_loss=1.6867
[1300] train_loss=1.6587
[1400] train_loss=1.6347
[1500] train_loss=1.6450
[1600] train_loss=1.6717
[1700] train_loss=1.5805
[1800] train_loss=1.6889


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[1884] val_loss  =1.4938


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-1884
Configuration saved in /content/qa_ckpt/fold1/checkpoint-1884/config.json
Model weights saved in /content/qa_ckpt/fold1/checkpoint-1884/model.safetensors
tokenizer config file saved in /content/qa_ckpt/fold1/checkpoint-1884/tokenizer_config.json
Special tokens file saved in /content/qa_ckpt/fold1/checkpoint-1884/special_tokens_map.json


[1900] train_loss=1.6428
[2000] train_loss=1.5724
[2100] train_loss=1.5819
[2200] train_loss=1.6103
[2300] train_loss=1.6453
[2400] train_loss=1.6477
[2500] train_loss=1.7760
[2600] train_loss=1.6445
[2700] train_loss=1.6036
[2800] train_loss=1.5490


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[2826] val_loss  =1.4782


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-2826
Configuration saved in /content/qa_ckpt/fold1/checkpoint-2826/config.json
Model weights saved in /content/qa_ckpt/fold1/checkpoint-2826/model.safetensors
tokenizer config file saved in /content/qa_ckpt/fold1/checkpoint-2826/tokenizer_config.json
Special tokens file saved in /content/qa_ckpt/fold1/checkpoint-2826/special_tokens_map.json


[2900] train_loss=1.5973
[3000] train_loss=1.6647
[3100] train_loss=1.7120
[3200] train_loss=1.6272
[3300] train_loss=1.6099
[3400] train_loss=1.6661
[3500] train_loss=1.5602
[3600] train_loss=1.5711
[3700] train_loss=1.5592


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[3768] val_loss  =1.4672


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-3768
Configuration saved in /content/qa_ckpt/fold1/checkpoint-3768/config.json
Model weights saved in /content/qa_ckpt/fold1/checkpoint-3768/model.safetensors
tokenizer config file saved in /content/qa_ckpt/fold1/checkpoint-3768/tokenizer_config.json
Special tokens file saved in /content/qa_ckpt/fold1/checkpoint-3768/special_tokens_map.json


[3800] train_loss=1.5723
[3900] train_loss=1.6815
[4000] train_loss=1.5495
[4100] train_loss=1.5143
[4200] train_loss=1.6329
[4300] train_loss=1.6383
[4400] train_loss=1.5908
[4500] train_loss=1.5682
[4600] train_loss=1.6741
[4700] train_loss=1.5682


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[4710] val_loss  =1.4609


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-4710
Configuration saved in /content/qa_ckpt/fold1/checkpoint-4710/config.json
Model weights saved in /content/qa_ckpt/fold1/checkpoint-4710/model.safetensors
tokenizer config file saved in /content/qa_ckpt/fold1/checkpoint-4710/tokenizer_config.json
Special tokens file saved in /content/qa_ckpt/fold1/checkpoint-4710/special_tokens_map.json


[4800] train_loss=1.5559
[4900] train_loss=1.6070
[5000] train_loss=1.5798
[5100] train_loss=1.7150
[5200] train_loss=1.5651
[5300] train_loss=1.5100
[5400] train_loss=1.6577
[5500] train_loss=1.6036
[5600] train_loss=1.6098


The following columns in the Evaluation set don't have a corresponding argument in `XLMRobertaForQuestionAnswering.forward` and have been ignored: context, offset_mapping, example_id. If context, offset_mapping, example_id are not expected by `XLMRobertaForQuestionAnswering.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1877
  Batch size = 8


[5652] val_loss  =1.4586


Saving model checkpoint to /content/qa_ckpt/fold1/checkpoint-5652
Configuration saved in /content/qa_ckpt/fold1/checkpoint-5652/config.json


KeyboardInterrupt: 

# Evaluate Best Fold Checkpoint on *pqa_test.json*

This part runs the full evaluation pipeline for the test split:

1. **Flatten & Tokenize Test Data**

   Converts pqa_test.json to flat SQuAD format (test_flat) and encodes it (tokenized_test) with the same sliding-window strategy as training.

2. **Locate Best Checkpoint**

   Retrieves the path saved in trainer.state.best_model_checkpoint(printed for verification).

3. **Load Fine-Tuned Model**

   Initialises best_model from that checkpoint.

4. **Create a Test-Only Trainer**

   No training—just evaluation—with the fast EM/F1 metric built specifically for the test features.

5. **Run Evaluation & Report Metrics**

   Prints Exact Match and F1, followed by every key in the results dictionary (e.g. eval_loss, eval_exact, eval_f1, eval_total).

This confirms final performance of the best cross-validation model on the untouched pqa_test set.


In [None]:
test_flat = raw["test"].map(
    flatten_squad,
    batched=True,
    remove_columns=raw["test"].column_names,
    desc="Flatten test",
)

In [None]:
tokenized_test = test_flat.map(
    prepare_features,
    batched=True,
    remove_columns=test_flat.column_names,
    desc="Tokenizing test",
)

In [None]:
best_ckpt = "/content/qa_ckpt/fold1"

In [None]:
best_ckpt = trainer.state.best_model_checkpoint
print("Best checkpoint for this fold:", best_ckpt)

Best checkpoint for this fold: /content/qa_ckpt/fold1/checkpoint-3768


In [None]:
best_model = XLMRobertaForQuestionAnswering.from_pretrained(best_ckpt)

test_trainer = Trainer(
    model=best_model,
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=build_fast_metrics_fn(test_flat, tokenized_test),
)

results = test_trainer.evaluate(tokenized_test)

loading configuration file /content/qa_ckpt/fold1/checkpoint-3768/config.json
Model config XLMRobertaConfig {
  "architectures": [
    "XLMRobertaForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "language": "english",
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "name": "XLMRoberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.53.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

loading weights file /content/qa_ckpt/fold1/checkpoint-3768/model.safetensors
All model checkpoint weights were used when initial

Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


In [None]:
print(f"Exact Match: {results['eval_exact']}\nF1 Score: {results['eval_f1']}")

Exact Match: 67.41935483870968
F1 Score: 82.14093160157432


In [None]:
for field in results.keys():
    print(f"{field}: {results[field]}")

eval_loss: 1.2159231901168823
eval_model_preparation_time: 0.0063
eval_exact: 67.41935483870968
eval_f1: 82.14093160157432
eval_total: 930
eval_HasAns_exact: 58.525345622119815
eval_HasAns_f1: 79.55616956906937
eval_HasAns_total: 651
eval_NoAns_exact: 88.17204301075269
eval_NoAns_f1: 88.17204301075269
eval_NoAns_total: 279
eval_best_exact: 67.41935483870968
eval_best_exact_thresh: 0.0
eval_best_f1: 82.14093160157437
eval_best_f1_thresh: 0.0
eval_runtime: 18.8013
eval_samples_per_second: 49.996
eval_steps_per_second: 6.276
