#Importing Required Libraries

In [1]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [2]:
import os
# # Disabling Wandb (a tool for experiment tracking and model logging, which we are not going to use)
# os.environ["WANDB_DISABLED"] = "false"

import json
import pandas as pd
import evaluate
import torch
import torch.nn as nn

from tqdm import tqdm
from torch.utils.data import Dataset
from transformers import AutoTokenizer
from transformers import AutoModel
from torch.utils.data import DataLoader
from transformers import get_scheduler
from transformers import TrainingArguments, Trainer
from transformers import DefaultDataCollator

#Loading the Dataset

The following function uses the parameter 'path' to load the json file from:

In [3]:
def load_json(path):
    with open(path, encoding="utf-8") as f:
        raw_json = json.load(f)

    return raw_json

Downloading dataset files from pquad github to colab ./content path:

In [4]:
!git clone https://github.com/AUT-NLP/PQuAD.git

# make folder if not exists '/content/Data'
!mkdir -p /content/
!cp PQuAD/Dataset/* /content/
!rm -rf PQuAD

Cloning into 'PQuAD'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 27 (delta 9), reused 15 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (27/27), 5.71 MiB | 15.86 MiB/s, done.
Resolving deltas: 100% (9/9), done.


Using `load_json()` function to load train, validation and test sets. You can change the path if needed.

In [5]:
raw_train = load_json('Train.json')
raw_val = load_json('Validation.json')
raw_test = load_json('Test.json')

Taking a look at a sample from the dataset:

In [6]:
print(json.dumps(raw_train['data'][0], indent=2, ensure_ascii=False))

{
  "title": "آرسنال",
  "paragraphs": [
    {
      "qas": [
        {
          "question": "موقعیت جغرافی باشگاه فوتبال آرسنال را بگویید؟",
          "id": "101001",
          "answers": [
            {
              "text": "شمال شهر لندن",
              "answer_start": 86
            }
          ],
          "is_impossible": false
        },
        {
          "question": "لیگ برتر انگلستان موفق به کسب چند عنوان قهرمانی در جام حذفی فوتبال انگلستان شده است؟",
          "id": "101002",
          "answers": [
            {
              "text": "۱۴",
              "answer_start": 173
            }
          ],
          "is_impossible": false
        },
        {
          "question": "بیشترین بازی بدون باخت پیاپی متعلق به کدام باشگاه است؟",
          "id": "101003",
          "answers": [
            {
              "text": "باشگاه فوتبال انگلیسی",
              "answer_start": 61
            }
          ],
          "is_impossible": false
        },
        {
          "question":

The following function flattens the json datasets to DataFrame with keys as columns:

In [7]:
def json_to_dataframe(raw_json):
    flattened_data = []

    for data in raw_train['data']:
        for paragraph in data['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                # Create a dictionary for each question-answer pair
                row = {
                    'title': data['title'],
                    'context': context,
                    'question': qa['question'],
                    'id': qa['id'],
                    'is_impossible': qa['is_impossible']
                }
                if qa['answers']:
                    row['answer_text'] = qa['answers'][0]['text']
                    row['answer_start'] = qa['answers'][0]['answer_start']
                else:
                    row['answer_text'] = None
                    row['answer_start'] = None
                flattened_data.append(row)

    # Convert to DataFrame
    df = pd.DataFrame(flattened_data)

    # Reorder columns for clarity
    columns = ['title', 'context', 'question', 'id', 'answer_text', 'answer_start', 'is_impossible']
    df = df[columns]

    return df

To easily work with the dataset, we need to flatten them into DataFrames:

In [8]:
df_train = json_to_dataframe(raw_train)
df_val = json_to_dataframe(raw_val)
df_test = json_to_dataframe(raw_test)

Checking the result:

In [9]:
df_train.head()

Unnamed: 0,title,context,question,id,answer_text,answer_start,is_impossible
0,آرسنال,باشگاه فوتبال آرسنال (به انگلیسی: Arsenal Foo...,موقعیت جغرافی باشگاه فوتبال آرسنال را بگویید؟,101001,شمال شهر لندن,86.0,False
1,آرسنال,باشگاه فوتبال آرسنال (به انگلیسی: Arsenal Foo...,لیگ برتر انگلستان موفق به کسب چند عنوان قهرمان...,101002,۱۴,173.0,False
2,آرسنال,باشگاه فوتبال آرسنال (به انگلیسی: Arsenal Foo...,بیشترین بازی بدون باخت پیاپی متعلق به کدام باش...,101003,باشگاه فوتبال انگلیسی,61.0,False
3,آرسنال,باشگاه فوتبال آرسنال (به انگلیسی: Arsenal Foo...,باشگاه فوتبال آرسنال موفق به کسب چند عنوان قهر...,101004,۱۳,119.0,False
4,آرسنال,باشگاه فوتبال آرسنال (به انگلیسی: Arsenal Foo...,باشگاه فوتبال آرسنال چند عنوان قهرمانی در جام ...,101005,۱۶,214.0,False


#Preprocess

The preprocess part includes tokenizing the input sentences to words and embedding each word to vectors which semantically relate.

Tokenizer is responsible for converting text to input tokens for the model. Loading tokenizer compatible with Persian BERT model:

In [10]:
tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-base-parsbert-uncased")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

The following function does tokenizing and embedding parts and also calculates answer start and answer ending indices for each question. Then it returns a dictionary of tensors compatible with BERT based models:

In [None]:
def preprocess(dataset):
    input_ids = []
    attention_masks = []
    token_type_ids = []
    start_positions = []
    end_positions = []

    for i in tqdm(range(len(dataset))):
        question = dataset.iloc[i]['question']
        context = dataset.iloc[i]['context']
        answer_text = dataset.iloc[i]['answer_text']
        answer_start = dataset.iloc[i]['answer_start']
        is_impossible = dataset.iloc[i]['is_impossible']

        # Tokenize with offset mapping to align character and token positions
        encoding = tokenizer(
            question,
            context,
            truncation=True,
            max_length=512,
            padding='max_length',
            return_offsets_mapping=True,
            return_tensors="pt"
        )

        offset_mapping = encoding['offset_mapping'][0]
        input_id = encoding['input_ids'][0]
        attention_mask = encoding['attention_mask'][0]
        token_type_id = encoding['token_type_ids'][0]

        # Default start and end positions
        start_token = 0
        end_token = 0

        if not is_impossible:
            start_char = answer_start
            end_char = answer_start + len(answer_text)

            # Loop through the offsets to find start and end token positions
            for idx, (start, end) in enumerate(offset_mapping):
                if start <= start_char < end:
                    start_token = idx
                if start < end_char <= end:
                    end_token = idx

        # Append each field to the batch lists
        input_ids.append(input_id)
        attention_masks.append(attention_mask)
        token_type_ids.append(token_type_id)
        start_positions.append(torch.tensor(start_token))
        end_positions.append(torch.tensor(end_token))

    # Return everything as a dict of lists of tensors
    return {
        'input_ids': torch.stack(input_ids),
        'attention_mask': torch.stack(attention_masks),
        'token_type_ids': torch.stack(token_type_ids),
        'start_positions': torch.stack(start_positions),
        'end_positions': torch.stack(end_positions)
    }


Calling `preprocess` function on DataFrames to apply the operations on each of them. Also, each output is saved as a .pt file to load them in case of missing the following 3 cells (to speed up execution):

In [12]:
tokenized_train = preprocess(df_train)
torch.save(tokenized_train, "tokenized_train.pt")

100%|██████████| 63994/63994 [23:51<00:00, 44.70it/s]


In [13]:
tokenized_val = preprocess(df_val)
torch.save(tokenized_val, "tokenized_val.pt")

100%|██████████| 63994/63994 [23:55<00:00, 44.58it/s]


In [14]:
tokenized_test = preprocess(df_test)
torch.save(tokenized_test, "tokenized_test.pt")

100%|██████████| 63994/63994 [24:23<00:00, 43.72it/s]


Defining a custom Dataset class to be used with PyTorch DataLoader (This class prepares each data sample (question, context, answer) for model input). It also provides methods to show dataset size and return a specific sample:

In [15]:
class QADataset(Dataset):
    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data['input_ids']
        self.attention_mask = tokenized_data['attention_mask']
        self.token_type_ids = tokenized_data['token_type_ids']
        self.start_positions = tokenized_data['start_positions']
        self.end_positions = tokenized_data['end_positions']

    def __len__(self):
        return self.input_ids.size(0)

    def __getitem__(self, index):
        return {
            'input_ids': self.input_ids[index],
            'attention_mask': self.attention_mask[index],
            'token_type_ids': self.token_type_ids[index],
            'start_positions': self.start_positions[index],
            'end_positions': self.end_positions[index]
        }

#Model

Checking device type (either cuda or cpu) before training:

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

Defining a custom question-answering model based on BERT architecture. This model is designed to predict the start and end positions of an answer span within a given context:

In [17]:
class BertForQA(nn.Module):
    def __init__(self, model_name):
        super(BertForQA, self).__init__()

        # Load the pre-trained BERT model from Hugging Face Transformers
        self.bert = AutoModel.from_pretrained(model_name)

        # Add a linear layer to predict start and end logits (2 outputs per token)
        self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2)

        # Define loss function: CrossEntropyLoss is standard for classification tasks
        self.loss_fct = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask=None, token_type_ids=None, start_positions=None, end_positions=None, return_loss=True):
        # Pass inputs through BERT to get contextualized token embeddings
        outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

        # Extract the last hidden states (sequence of token embeddings)
        sequence_output = outputs.last_hidden_state

        # Apply linear layer to get start and end logits for each token
        logits = self.qa_outputs(sequence_output)

        # Split logits into start and end logits along the last dimension
        start_logits, end_logits = logits.split(1, dim=-1)

        # Remove last singleton dimension to get shape (batch_size, sequence_length)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        output = {"start_logits": start_logits, "end_logits": end_logits}

        if start_positions is not None and end_positions is not None:
            # Compute loss for start and end positions separately
            start_loss = self.loss_fct(start_logits, start_positions)
            end_loss = self.loss_fct(end_logits, end_positions)

            loss = (start_loss + end_loss) / 2
            output["loss"] = loss

        return output

Creating a DataLoader for batching and shuffling training and validation data:

In [18]:
train_dataset = QADataset(tokenized_train)
train_loader = DataLoader(train_dataset)

val_dataset = QADataset(tokenized_val)
val_loader = DataLoader(val_dataset)

Initializing the model and transferring it to the GPU (or back to the CPU) with PyTorch `.to_device()` method:

In [19]:
model = BertForQA("HooshvareLab/bert-base-parsbert-uncased").to(device)

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/654M [00:00<?, ?B/s]

Defining a function to calculate squad specified metrics:

In [20]:
squad_metric = evaluate.load("squad")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    formatted_predictions = [{"id": str(i), "prediction_text": p} for i, p in enumerate(predictions)]
    formatted_references = [{"id": str(i), "answers": {"text": [r], "answer_start": [0]}} for i, r in enumerate(labels)]

    metrics = squad_metric.compute(predictions=formatted_predictions, references=formatted_references)

    return metrics

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

Setting training arguments to be used during training:

In [21]:
training_args = TrainingArguments(
    output_dir="parsbert_new",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=3e-5,
    num_train_epochs=2,
    warmup_ratio=0.1,
    group_by_length=True,
    weight_decay=0.01,
    logging_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=1,
    eval_strategy="steps",
    eval_steps=500,
    fp16=True if torch.cuda.is_available() else False,
    # load_best_model_at_end=True,
    # metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="tensorboard",
    logging_dir="./logs"
)

We are going to use `DefaultDataCollator()`, which does not do any padding or truncation. It assumes all input samples are of the same length. If the input samples are not of the same length, this would throw errors.

In [22]:
data_collator = DefaultDataCollator()

Defining `Trainer()` with parsbert model and our dataset and training it using the arguments:

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    # compute_metrics=compute_metrics
)

trainer.train()

  trainer = Trainer(


Step,Training Loss,Validation Loss
500,2.2987,0.931787
1000,0.9638,0.713591
1500,0.8476,0.602333
2000,0.7773,0.492316
2500,0.5333,0.426707
3000,0.524,0.378403
3500,0.522,0.345991
4000,0.5081,0.326427


TrainOutput(global_step=4000, training_loss=0.8718599624633789, metrics={'train_runtime': 6947.6795, 'train_samples_per_second': 18.422, 'train_steps_per_second': 0.576, 'total_flos': 0.0, 'train_loss': 0.8718599624633789, 'epoch': 2.0})

#Evaluation

applying the preprocessing steps on test set:

In [29]:
test_dataset = QADataset(tokenized_test)

Setting model to evaluation mode:

In [30]:
model.eval()

BertForQA(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(100000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_af

Defining DataLoader for the test set:

In [31]:
test_loader = DataLoader(test_dataset, batch_size=16)

Prediction on test set using the fine-tuned model:

In [32]:
predictions = []
references = []

# Disable gradient calculation for evaluation
with torch.no_grad():
    for batch in tqdm(test_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch["token_type_ids"].to(device)
        start_positions = batch["start_positions"].to(device)
        end_positions = batch["end_positions"].to(device)

        # Get model predictions
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        start_logits = outputs["start_logits"]
        end_logits = outputs["end_logits"]

        # For each example in the batch, find the best start and end token
        for i in range(input_ids.size(0)):
            start_idx = torch.argmax(start_logits[i]).item()
            end_idx = torch.argmax(end_logits[i]).item()

            # Ensure start <= end, otherwise adjust
            if start_idx > end_idx:
                start_idx, end_idx = end_idx, start_idx

            # Decode the predicted answer span back into text
            answer_ids = input_ids[i][start_idx:end_idx+1]
            pred_answer = tokenizer.decode(answer_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            predictions.append(pred_answer)

            # Decode the ground truth answer span for reference (same way it was constructed in preprocess)
            true_start = start_positions[i].item()
            true_end = end_positions[i].item()
            true_answer_ids = input_ids[i][true_start:true_end+1]
            true_answer = tokenizer.decode(true_answer_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
            references.append(true_answer)

100%|██████████| 4000/4000 [07:38<00:00,  8.72it/s]


Format predictions and references for squad metric:

In [33]:
formatted_predictions = [{"id": str(i), "prediction_text": p} for i, p in enumerate(predictions)]
formatted_references = [{"id": str(i), "answers": {"text": [r], "answer_start": [0]}} for i, r in enumerate(references)]

Computing metrics using evaluate library:

In [34]:
metrics = squad_metric.compute(predictions=formatted_predictions, references=formatted_references)

print(f"Exact Match (EM): {metrics['exact_match']:.2f}")
print(f"F1 Score: {metrics['f1']:.2f}")

Exact Match (EM): 82.92
F1 Score: 69.60


#Exceptions Handling

Since transformer models like ParsBERT have a 512-token input limit, some context-question pairs exceed this maximum length. To handle this, we used `truncation=True` and `max_length=512` during tokenization. If the answer span was truncated and could not be mapped to token positions, we defaulted the start_position and end_position to 0. This ensured that training would not crash due to misaligned or missing answers. These samples were treated as unanswerable, and their number was logged. This approach helped maintain stability during training, though it may slightly impact performance due to lost valid samples.

#Observations

The model achieved an Exact Match (EM) of 82.92 and an F1 score of 69.60 on the test set, which is quite impressive considering the context of the original BERT paper. In that paper, the base BERT model (BERT-Base, uncased) achieved around 80.8 EM and 88.5 F1 on the SQuAD v1.1 dataset. While our EM score slightly exceeds BERT-Base's reported EM, the F1 score is notably lower. This discrepancy may stem from differences in dataset language (our model uses ParsBERT on Persian data, PQuAD), preprocessing nuances, or fine-tuning dynamics such as learning rate, max sequence length, or training epochs. Overall, our EM result indicates strong localization of answers, but the lower F1 suggests the model may struggle to predict answers with token-level completeness or precision compared to the original BERT on English datasets.