**Installing requirements**

In [None]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




**Turn off Wandb (For reporting and needs API key)**

In [None]:
import os

os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "disabled"

**Importing libraries**

In [None]:
import json, numpy as np, torch, csv, pandas as pd
from datasets import load_dataset, Dataset
from sklearn.model_selection import KFold
import evaluate
from transformers import (
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    default_data_collator,
    AutoTokenizer,
    TrainerCallback
)
from tqdm.notebook import tqdm

**Model loading**

In [None]:
model_ckpt = "pedramyazdipoor/parsbert_question_answering_PQuAD"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt, use_fast=True)

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

Some weights of the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be 

**Data loading and dataset making**

In [None]:
def load_pqa(path):
    raw = load_dataset("json", data_files=path, field="data")["train"]
    flat_rows = []
    for art in raw:
        for para in art["paragraphs"]:
            ctx = para["context"]
            for qa in para["qas"]:
                flat_rows.append({
                    "id":           str(qa["id"]),
                    "question":     qa["question"],
                    "context":      ctx,
                    "is_impossible": qa["is_impossible"],
                    "answer_text":   qa["answers"][0]["text"]        if not qa["is_impossible"] else "",
                    "answer_start":  qa["answers"][0]["answer_start"] if not qa["is_impossible"] else 0,
                })

    return Dataset.from_list(flat_rows)

In [None]:
train_ds = load_pqa("/kaggle/input/pqa-dataset/pqa_train.json")
test_ds  = load_pqa("/kaggle/input/pqa-dataset/pqa_test.json")

In [None]:
print(train_ds)
print(train_ds[0])

Dataset({
    features: ['id', 'question', 'context', 'is_impossible', 'answer_text', 'answer_start'],
    num_rows: 9008
})
{'id': '1', 'question': 'شرکت فولاد مبارکه در کجا واقع شده است', 'context': 'شرکت فولاد مبارکۀ اصفهان، بزرگ\u200cترین واحد صنعتی خصوصی در ایران و بزرگ\u200cترین مجتمع تولید فولاد در کشور ایران است، که در شرق شهر مبارکه قرار دارد. فولاد مبارکه هم\u200cاکنون محرک بسیاری از صنایع بالادستی و پایین\u200cدستی است. فولاد مبارکه در ۱۱ دوره جایزۀ ملی تعالی سازمانی و ۶ دوره جایزۀ شرکت دانشی در کشور رتبۀ نخست را بدست آورده\u200cاست و همچنین این شرکت در سال ۱۳۹۱ برای نخستین\u200cبار به عنوان تنها شرکت ایرانی با کسب امتیاز ۶۵۴ تندیس زرین جایزۀ ملی تعالی سازمانی را از آن خود کند. شرکت فولاد مبارکۀ اصفهان در ۲۳ دی ماه ۱۳۷۱ احداث شد و اکنون بزرگ\u200cترین واحدهای صنعتی و بزرگترین مجتمع تولید فولاد در ایران است. این شرکت در زمینی به مساحت ۳۵ کیلومتر مربع در نزدیکی شهر مبارکه و در ۷۵ کیلومتری جنوب غربی شهر اصفهان واقع شده\u200cاست. مصرف آب این کارخانه در کمترین میزان خود، ۱٫۵٪ از 

**Tokenization and Converting dataset to model features**

In [None]:
max_len = 384
stride = 128

In [None]:
def encode(examples):
    enc = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_len,
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map   = enc.pop("overflow_to_sample_mapping")
    offset_map   = enc["offset_mapping"]

    ids, starts, ends, ctxs, ex_ids = [], [], [], [], []

    for i, offsets in enumerate(offset_map):
        orig_idx   = sample_map[i]
        ex_ids.append(examples["id"][orig_idx])
        ctxs.append(examples["context"][orig_idx])

        token_start = token_end = 0

        if not examples["is_impossible"][orig_idx]:
            seq_ids   = enc.sequence_ids(i)
            ctx_start = seq_ids.index(1)
            ctx_end   = len(seq_ids) - 1 - seq_ids[::-1].index(1)

            ans_char_start = examples["answer_start"][orig_idx]
            ans_char_end   = ans_char_start + len(examples["answer_text"][orig_idx])

            if offsets[ctx_start][0] <= ans_char_start < offsets[ctx_end][1]:
                for idx in range(ctx_start, ctx_end + 1):
                    if offsets[idx][0] <= ans_char_start < offsets[idx][1]:
                        token_start = idx
                    if offsets[idx][0] < ans_char_end <= offsets[idx][1]:
                        token_end = idx
                        break

        ids.append(examples["id"][orig_idx])
        starts.append(token_start)
        ends.append(token_end)

    enc["id"] = ids
    enc["start_positions"] = starts
    enc["end_positions"] = ends
    enc["context"] = ctxs
    enc["example_id"] = ex_ids

    return enc

In [None]:
train_feat = train_ds.map(
    encode,
    batched=True,
    remove_columns=train_ds.column_names
)
test_feat = test_ds.map(
    encode,
    batched=True,
    remove_columns=test_ds.column_names
)

Map:   0%|          | 0/9008 [00:00<?, ? examples/s]

Map:   0%|          | 0/930 [00:00<?, ? examples/s]

In [None]:
print(train_feat[0].keys())

**Convert model logits to answer text**

In [None]:
def postprocess(predictions, features):
    start_logits, end_logits = predictions
    answers = []
    for i, (s_log, e_log) in enumerate(zip(start_logits, end_logits)):
        s = int(np.argmax(s_log)); e = int(np.argmax(e_log))
        if s == 0 and e == 0:
            answers.append({"id": str(features["id"][i]), "prediction_text": ""})
        else:
            text = tokenizer.decode(
                features["input_ids"][i][s:e+1],
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )
            answers.append({"id": str(features["id"][i]), "prediction_text": text.strip()})
    return answers

**Compact EM & F1 computation**

In [None]:
fast_metric = evaluate.load("squad_v2")

In [None]:
def build_fast_metrics_fn(examples, features):
    ctxs        = features["context"]
    offsets     = features["offset_mapping"]
    ex_ids      = features["example_id"]
    cls_indices = [ids.index(tokenizer.cls_token_id) for ids in features["input_ids"]]

    def compute_fast(pred_pack):
        start_logits, end_logits = pred_pack.predictions
        best = {}

        for i in range(len(start_logits)):
            s = int(np.argmax(start_logits[i]))
            e = int(np.argmax(end_logits[i]))
            cls = cls_indices[i]

            if s == cls or e == cls or s > e:
                score = start_logits[i][cls] + end_logits[i][cls]
                text  = ""
            else:
                score = start_logits[i][s] + end_logits[i][e]
                start_char = offsets[i][s][0]
                end_char   = offsets[i][e][1]
                text = ctxs[i][start_char:end_char]

            eid = ex_ids[i]
            if (eid not in best) or (score > best[eid][0]):
                best[eid] = (score, text)

        preds = [
            {"id": k, "prediction_text": v[1], "no_answer_probability": 0.0}
            for k, v in best.items()
        ]
        refs = [
            {"id": ex["id"],
             "answers": {
                 "text": [ex["answer_text"]] if ex["answer_text"] else [],
                 "answer_start": [ex["answer_start"]] if ex["answer_text"] else [],
             },
            }
            for ex in examples
        ]
        return fast_metric.compute(predictions=preds, references=refs)

    return compute_fast

**Log train/val loss to CSV during training**

In [None]:
class LossTrackerCallback(TrainerCallback):
    def __init__(self, csv_path: str):
        self.csv_path = csv_path
        os.makedirs(os.path.dirname(self.csv_path), exist_ok=True)

        with open(self.csv_path, "w", newline="") as f:
            csv.writer(f).writerow(["step", "train_loss", "eval_loss"])

        self.history = {"step": [], "train_loss": [], "eval_loss": []}

    def on_log(self, args, state, control, logs=None, **kw):
        if logs is None:
            return
        step = state.global_step
        train_loss = logs.get("loss")
        eval_loss  = logs.get("eval_loss")
        if train_loss is None and eval_loss is None:
            return

        self.history["step"].append(step)
        self.history["train_loss"].append(train_loss)
        self.history["eval_loss"].append(eval_loss)

        with open(self.csv_path, "a", newline="") as f:
            csv.writer(f).writerow([step, train_loss, eval_loss])

    def on_train_end(self, *a, **kw):
        df = pd.DataFrame(self.history)
        print("\nSample:\n", df.head())

**GPU check**

In [None]:
print("CUDA available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "—")

**5-Fold Training Loop**

In [None]:
kfold   = KFold(n_splits=5, shuffle=True, random_state=42)
results = []

for fold, (tr_idx, vl_idx) in enumerate(kfold.split(train_feat)):
    print(f"Fold {fold+1}/5 — {len(tr_idx)} train • {len(vl_idx)} val")

    vl_set = train_feat.select(vl_idx.tolist())

    val_ids = set(vl_set["id"])
    raw_val = train_ds.filter(lambda ex: ex["id"] in val_ids)

    tr_set = train_feat.select(tr_idx.tolist())

    model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

    model.to("cuda")

    args = TrainingArguments(
        output_dir=f"/kaggle/working/fold{fold}",
        learning_rate=2e-5,
        num_train_epochs=5,
        gradient_checkpointing=False,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        weight_decay=0.01,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
        fp16=True,
        logging_strategy="steps",
        logging_steps=100,
        logging_first_step=True,
        eval_strategy="steps",
        eval_steps=500,
        disable_tqdm=False,
        save_strategy="steps",
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        logging_dir=f"/kaggle/working/fold{fold}/tb_logs",
        logging_nan_inf_filter=True,
        seed=fold,
        report_to=None
    )

    loss_tracker = LossTrackerCallback(
        csv_path=f"/kaggle/working/fold{fold}/loss_log.csv"
    )

    eval_features = vl_set

    metrics_fn = build_fast_metrics_fn(
                 examples = raw_val,
                 features = vl_set
             )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tr_set,
        eval_dataset=vl_set,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
        compute_metrics = metrics_fn,
        callbacks=[loss_tracker],
    )

    trainer.train()
    fold_metrics = trainer.evaluate()
    results.append(fold_metrics)
    print(f"Fold {fold+1} metrics:", fold_metrics)

    del model, trainer
    torch.cuda.empty_cache()

Fold 1/5 — 7230 train • 1808 val


Filter:   0%|          | 0/9008 [00:00<?, ? examples/s]

Some weights of the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be 

Step,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Noans Exact,Noans F1,Noans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
500,2.4143,2.318261,32.190265,47.34336,1808,19.415943,41.039301,1267,62.107209,62.107209,541,32.245575,0.0,47.39867,0.0
1000,1.5408,2.16722,38.938053,53.85958,1808,23.756906,45.049819,1267,74.491682,74.491682,541,38.938053,0.0,53.85958,0.0
1500,1.4039,2.110079,39.823009,54.523607,1808,22.178374,43.156023,1267,81.146026,81.146026,541,39.823009,0.0,54.523607,0.0
2000,0.7838,2.508455,40.431416,55.461659,1808,24.704025,46.152075,1267,77.264325,77.264325,541,40.431416,0.0,55.461659,0.0
2500,0.8312,2.472323,40.099558,55.960686,1808,24.151539,46.785257,1267,77.449168,77.449168,541,40.099558,0.0,55.960686,0.0
3000,0.4008,3.197503,38.440265,55.736728,1808,25.414365,50.096293,1267,68.946396,68.946396,541,38.440265,0.0,55.736728,0.0
3500,0.4056,3.137213,39.988938,56.324661,1808,24.861878,48.172839,1267,75.415896,75.415896,541,39.988938,0.0,56.324661,0.0
4000,0.2759,3.426843,39.988938,56.262986,1808,24.940805,48.163756,1267,75.231054,75.231054,541,39.988938,0.0,56.262986,0.0
4500,0.2734,3.439315,40.044248,56.105845,1808,24.704025,47.62381,1267,75.970425,75.970425,541,40.044248,0.0,56.105845,0.0



Sample:
    step  train_loss  eval_loss
0     1      5.8430        NaN
1   100      5.1491        NaN
2   200      3.3799        NaN
3   300      2.6790        NaN
4   400      2.5737        NaN


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Fold 1 metrics: {'eval_loss': 2.110079288482666, 'eval_exact': 39.823008849557525, 'eval_f1': 54.52360665771219, 'eval_total': 1808, 'eval_HasAns_exact': 22.17837411207577, 'eval_HasAns_f1': 43.15602276017662, 'eval_HasAns_total': 1267, 'eval_NoAns_exact': 81.1460258780037, 'eval_NoAns_f1': 81.1460258780037, 'eval_NoAns_total': 541, 'eval_best_exact': 39.823008849557525, 'eval_best_exact_thresh': 0.0, 'eval_best_f1': 54.52360665771212, 'eval_best_f1_thresh': 0.0, 'eval_runtime': 23.3167, 'eval_samples_per_second': 77.541, 'eval_steps_per_second': 9.693, 'epoch': 5.0}
Fold 2/5 — 7230 train • 1808 val


Filter:   0%|          | 0/9008 [00:00<?, ? examples/s]

Some weights of the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at pedramyazdipoor/parsbert_question_answering_PQuAD and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be 

Step,Training Loss,Validation Loss,Exact,F1,Total,Hasans Exact,Hasans F1,Hasans Total,Noans Exact,Noans F1,Noans Total,Best Exact,Best Exact Thresh,Best F1,Best F1 Thresh
500,2.3598,2.159909,36.579967,49.034226,1807,20.527157,38.502274,1252,72.792793,72.792793,555,36.635307,0.0,49.034226,0.0


RuntimeError: [enforce fail at inline_container.cc:626] . unexpected pos 623411392 vs 623411280

**Note:**

Training stopped because of kaggle disk limitation (Saving the checkpoints) but we reached to F1=56 and EM=40 after 5 epochs on the first folding. Kaggle cleaned the disk so we couldn't test the fine-tuned model on pqa_test.

**Results**

In [None]:
print("5-fold CV Results:")
print(results)