# Notebook Summary

This notebook builds a unified telecom QA dataset by merging multiple sources and fine-tunes a RoBERTa question answering model (pretrained on SQuAD2) for the domain.

Workflow

Load Input Datasets

TeleQuAD-v1 (tabular format), TeleQuAD-v4 (SQuAD-like format), and TeleQnA (MCQ-style).

Convert to SQuAD Format

TeleQuAD-v1: extract question, context, and answer spans, ensuring valid character offsets.

TeleQuAD-v4: directly reuse SQuAD-style qas structure.

TeleQnA: convert multiple-choice questions into SQuAD-style by embedding options + explanation into context, aligning the correct answer span.

Merge & Save Combined Dataset

Merge all QA pairs into a single SQuAD-compliant JSON file.

Group by context with multiple QAs per passage.

Save as combined_telecom_qa.json.

Dataset Preparation

Flatten into a HuggingFace Dataset object.

Split into train (90%) and validation (10%).

Apply RoBERTa tokenizer with stride (128) to handle long contexts.

Align character-based answer spans to token positions for supervised training.

Model Fine-Tuning

Load deepset/roberta-base-squad2 as the base QA model.

Fine-tune for 3 epochs with AdamW optimizer, learning rate 2e-5, batch size 8, and evaluation every epoch.

Use HuggingFace Trainer with DefaultDataCollator.

Evaluation

Use SQuAD metrics (Exact Match, F1) for validation.

Post-process overlapping windows to keep the longest prediction per example.

Track training and validation performance, saving the best model.

Model Export

Save fine-tuned model and tokenizer as ./qa_roberta_telecom.

In [1]:
import json
from pathlib import Path
from collections import defaultdict
import uuid

In [2]:
# STEP 1: Load input files
file_v1 = "/mnt/data/Datasets/TeleQuAD-v1-full-Tabular.json"
file_v4 = "/mnt/data/Datasets/TeleQuAD-v4-full.json"
file_qna = "/mnt/data/Datasets/TeleQnA.json"

with open(file_v1, "r", encoding="utf-8") as f:
    data_v1 = json.load(f)

with open(file_v4, "r", encoding="utf-8") as f:
    data_v4 = json.load(f)

with open(file_qna, "r", encoding="utf-8") as f:
    data_qna = json.load(f)

In [3]:
# STEP 2: Convert TeleQuAD-v1 to SQuAD format
v1_qas = []
for entry in data_v1["data"]:
    for qa in entry.get("questions", []):
        if all(k in qa for k in ("question", "answer", "context")):
            context = qa["context"]
            answer_text = qa["answer"]
            start = context.lower().find(answer_text.lower())
            if start == -1:
                continue
            v1_qas.append({
                "id": f'v1_{qa.get("id", str(uuid.uuid4()))}',
                "context": context,
                "question": qa["question"],
                "answers": {
                    "text": [answer_text],
                    "answer_start": [start]
                }
            })

In [4]:
# STEP 3: Convert TeleQuAD-v4 to SQuAD format
v4_qas = []
for doc in data_v4["data"]:
    for para in doc["paragraphs"]:
        context = para["context"]
        for qa in para["qas"]:
            v4_qas.append({
                "id": f'v4_{qa["id"]}',
                "context": context,
                "question": qa["question"],
                "answers": {
                    "text": [a["text"] for a in qa["answers"]],
                    "answer_start": [a["answer_start"] for a in qa["answers"]],
                }
            })

In [5]:
# STEP 4: Convert TeleQnA (MCQ-style) to SQuAD format
qna_qas = []
for key, entry in data_qna.items():
    if "question" not in entry or "answer" not in entry:
        continue
    question = entry["question"]
    answer_text = entry["answer"].split(":", 1)[-1].strip()
    explanation = entry.get("explanation", "")
    options = " ".join([entry[k] for k in entry if k.startswith("option")])
    context = f"{question} {options} {explanation}"
    start = context.lower().find(answer_text.lower())
    if start == -1:
        continue
    qna_qas.append({
        "id": f'qna_{key}',
        "context": context,
        "question": question,
        "answers": {
            "text": [answer_text],
            "answer_start": [start]
        }
    })

In [6]:
# STEP 5: Merge and format into SQuAD-compliant structure
all_qas = v1_qas + v4_qas + qna_qas
grouped = defaultdict(list)
for qa in all_qas:
    grouped[qa["context"]].append({
        "id": qa["id"],
        "question": qa["question"],
        "answers": qa["answers"],
        "is_impossible": False
    })

squad_data = {
    "version": "telecom-combined-v1",
    "data": [
        {
            "title": "telecom-doc",
            "paragraphs": [{"context": ctx, "qas": qas}]
        }
        for ctx, qas in grouped.items()
    ]
}

In [7]:
# STEP 6: Save to JSON file
output_path = "/mnt/data/Datasets/combined_telecom_qa.json"
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(squad_data, f, ensure_ascii=False, indent=2)

print(f"✅ Saved {len(all_qas)} QA pairs to {output_path}")

✅ Saved 14623 QA pairs to /mnt/data/Datasets/combined_telecom_qa.json


In [8]:
# STEP 8: Imports
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer,
    DefaultDataCollator,
)
import evaluate

In [9]:
import json
from datasets import Dataset

# Load combined QA JSON file
with open("/mnt/data/Datasets/combined_telecom_qa.json", "r", encoding="utf-8") as f:
    raw_data = json.load(f)

# Flatten from SQuAD format
flat_data = []
for article in raw_data["data"]:
    for para in article["paragraphs"]:
        context = para["context"]
        for qa in para["qas"]:
            flat_data.append({
                "id": qa["id"],
                "question": qa["question"],
                "context": context,
                "answers": qa["answers"]
            })

# Convert to HuggingFace Dataset and split
dataset = Dataset.from_list(flat_data).train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]
print(dataset.column_names)

{'train': ['id', 'question', 'context', 'answers'], 'test': ['id', 'question', 'context', 'answers']}


In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

def preprocess(example):
    # Tokenize with stride to handle long contexts
    tokenized = tokenizer(
        example["question"],
        example["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    sample_mapping = tokenized.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized.pop("offset_mapping")

    tokenized["start_positions"] = []
    tokenized["end_positions"] = []
    tokenized["example_id"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = example["answers"][sample_index]
        tokenized["example_id"].append(example["id"][sample_index])

        if len(answers["answer_start"]) == 0:
            tokenized["start_positions"].append(cls_index)
            tokenized["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            if offsets[token_start_index][0] > end_char or offsets[token_end_index][1] < start_char:
                tokenized["start_positions"].append(cls_index)
                tokenized["end_positions"].append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_pos = token_start_index - 1

                while token_end_index >= 0 and offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_pos = token_end_index + 1

                tokenized["start_positions"].append(start_pos)
                tokenized["end_positions"].append(end_pos)

    return tokenized


In [11]:
tokenized_train = train_dataset.map(preprocess, batched=True, remove_columns=train_dataset.column_names)
tokenized_val = val_dataset.map(preprocess, batched=True, remove_columns=val_dataset.column_names)

Map:   0%|          | 0/13160 [00:00<?, ? examples/s]

Map:   0%|          | 0/1463 [00:00<?, ? examples/s]

In [12]:
# STEP 6: Load model
model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")

In [13]:
import evaluate
from transformers import Trainer, TrainingArguments, AutoModelForQuestionAnswering, DefaultDataCollator

# Load SQuAD-style metric
squad_metric = evaluate.load("squad")

def compute_metrics(pred):
    start_logits, end_logits = pred.predictions
    start_preds = start_logits.argmax(-1)
    end_preds = end_logits.argmax(-1)

    # Group by example_id
    formatted_preds = {}
    for i, (start, end) in enumerate(zip(start_preds, end_preds)):
        input_ids = tokenized_val[i]["input_ids"]
        example_id = tokenized_val[i]["example_id"]
        pred_text = tokenizer.decode(input_ids[start:end+1], skip_special_tokens=True)

        # Use the longest prediction for overlapping windows
        if example_id not in formatted_preds or len(pred_text) > len(formatted_preds[example_id]["prediction_text"]):
            formatted_preds[example_id] = {
                "id": str(example_id),
                "prediction_text": pred_text
            }

    # Format references
    formatted_references = [{
        "id": str(example["id"]),
        "answers": example["answers"]
    } for example in val_dataset]

    predictions = list(formatted_preds.values())

    return squad_metric.compute(predictions=predictions, references=formatted_references)

In [14]:
training_args = TrainingArguments(
    output_dir="./qa_telecom_roberta",
    eval_strategy ="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    logging_steps=50,
    save_total_limit=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=DefaultDataCollator(),
    compute_metrics=compute_metrics,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer = Trainer(


In [15]:
trainer.train()

Epoch,Training Loss,Validation Loss,Exact Match,F1
1,0.4813,0.533057,82.570062,89.812978
2,0.3378,0.542357,85.509228,91.775944
3,0.2724,0.651309,85.714286,92.123177


TrainOutput(global_step=8886, training_loss=0.4161456815568064, metrics={'train_runtime': 1531.5211, 'train_samples_per_second': 46.403, 'train_steps_per_second': 5.802, 'total_flos': 1.3927182458217984e+16, 'train_loss': 0.4161456815568064, 'epoch': 3.0})

In [16]:
model.save_pretrained("./qa_roberta_telecom")
tokenizer.save_pretrained("./qa_roberta_telecom")

('./qa_roberta_telecom/tokenizer_config.json',
 './qa_roberta_telecom/special_tokens_map.json',
 './qa_roberta_telecom/vocab.json',
 './qa_roberta_telecom/merges.txt',
 './qa_roberta_telecom/added_tokens.json',
 './qa_roberta_telecom/tokenizer.json')

In [17]:
!rm -rf /home/ec2-user/qa_telecom_roberta

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
