# Notebook Overview

This notebook loads a LoRA-fine-tuned LLaMA-2 model and evaluates it on an extractive QA test set using SQuAD-style metrics (Exact Match and F1). The workflow is:

Model loading (4-bit quantized):
The fine-tuned causal LM is loaded from "/mnt/data/llama2_qa_lora_output/final" with 4-bit NF4 quantization (BitsAndBytesConfig, bfloat16 compute) and device_map="auto" for efficient GPU memory use. A text-generation pipeline is created for inference.

Dataset loading and split:
A JSONL dataset at "/mnt/data/testing_85.jsonl" is loaded via datasets. It’s shuffled and split into train/val/test (10% held out, then split evenly → ~5% test). Only the test portion is used for evaluation.

Structured field extraction (regex):
Each example’s text field is parsed to extract Context, Question, and Answer using regular expressions that target the notebook’s templated markers:

### Context: ...

### Question: question: ...

### Answer: [/INST] ... </s>

Prompt construction (LLaMA chat format):
For each test example, a chat-style prompt is built:

System block sets role: “You are a helpful assistant specialized in telecommunications.”

Task header ### Task: extractive_qa, followed by the Context and Question.

The model is asked to produce the Answer after [/INST].

Batched, deterministic inference:
Prompts are run in batches (size = 96) through the generation pipeline with max_new_tokens=64 and do_sample=False (greedy/argmax decoding). The generated answer is taken as the text after [/INST].

Evaluation (SQuAD metrics):
Using evaluate’s squad metric:

Aggregate EM and F1 are computed over all predictions.

Per-example EM/F1 are also computed and assembled with the original context/question/reference/prediction.

Outputs:

Console printout of Exact Match and F1 (aggregate).

A CSV file with per-sample results saved to "/mnt/data/qa_eval_results.csv" for detailed error analysis and inspection.

Key parameters: 4-bit NF4 quantization; max_new_tokens=64; batch_size=96; deterministic decoding; regex-based field extraction; SQuAD-style scoring for both overall and per-example evaluation.

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline

model_path = "/mnt/data/llama2_qa_lora_output/final"

tokenizer = AutoTokenizer.from_pretrained(model_path)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    device_map="auto"
)

qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [2]:
from datasets import load_dataset

data_path = "/mnt/data/testing_85.jsonl"
dataset = load_dataset("json", data_files=data_path, split="train")
dataset = dataset.shuffle(seed=42)

# Same 3-way split
split = dataset.train_test_split(test_size=0.10, seed=42)
val_test = split["test"].train_test_split(test_size=0.5, seed=42)

test_dataset = val_test["test"]

In [3]:
import re

def extract_qa(example):
    text = example["text"]

    # Extract context
    c_match = re.search(r"### Context:\s*(.*?)\n\n### Question:", text, re.DOTALL)
    context = c_match.group(1).strip() if c_match else ""

    # Extract question
    q_match = re.search(r"### Question:\s*question:\s*(.*?)\n", text, re.DOTALL)
    question = q_match.group(1).strip() if q_match else ""

    # Extract answer
    a_match = re.search(r"### Answer: \[/INST\]\s*(.*?)</s>", text, re.DOTALL)
    answer = a_match.group(1).strip() if a_match else ""

    return {"context": context, "question": question, "answer": answer}

test_dataset = test_dataset.map(extract_qa)

Map:   0%|          | 0/2839 [00:00<?, ? examples/s]

In [4]:
batch_size = 96
batched_prompts = []

for ex in test_dataset:
    if ex["question"].strip() == "" or ex["answer"].strip() == "":
        continue

    
    prompt = (
        "<s>[INST] <<SYS>>\n"
        "You are a helpful assistant specialized in telecommunications.\n"
        "[/SYS]\n\n"
        "### Task: extractive_qa\n"
        f"### Context:\n{ex['context']}\n\n"
        f"### Question:\nquestion: {ex['question']}\n\n"
        "### Answer: [/INST]"
    )

    batched_prompts.append(prompt)

In [5]:
#Sanity check
print(" Prompt example:\n", batched_prompts[0])
print(" Reference answer:", test_dataset[0]["answer"])

🔎 Prompt example:
 <s>[INST] <<SYS>>
You are a helpful assistant specialized in telecommunications.
[/SYS]

### Task: extractive_qa
### Context:
the Up Link (UL) Non Access Stratum (NAS) TRANSPORT message or Data Link (DL) Non Access Stratum (NAS) TRANSPORT message is not included when the lcs-SLMOLR invoke component, lcs-SLMOLR return result component, related return error component or related reject component is transported in the Payload container. The User Equipment (UE) invokes an Signal Level (SL)-Management Object (MO)-Location Registration (LR) by sending a REGISTER message to the network containing an lcs-SLMOLR invoke component. Supplementary Services (SS) Version Indicator value 1 or above shall be used. The receiving network entity shall initiate the handling of location request in the network. The network shall pass the result of the location procedure to the User Equipment (UE) by sending a FACILITY message to the User Equipment (UE) containing an lcs-SLMOLR return result

In [6]:
predictions = []
references = []

from tqdm import tqdm

for i in tqdm(range(0, len(batched_prompts), batch_size)):
    batch_prompts = batched_prompts[i:i + batch_size]
    batch_refs = [test_dataset[i + j]["answer"] for j in range(len(batch_prompts))]

    outputs = qa_pipeline(batch_prompts, max_new_tokens=64, do_sample=False)

    for out, ref in zip(outputs, batch_refs):
        try:
            generated = out[0]["generated_text"].split("[/INST]")[-1].strip()
        except IndexError:
            generated = ""
        predictions.append(generated)
        references.append(ref)

  0%|                                                    | 0/29 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignore

In [7]:
!pip install evaluate

Defaulting to user installation because normal site-packages is not writeable


In [8]:
!pip install evaluate pandas
import evaluate

squad_metric = evaluate.load("squad")

formatted_references = [
    {"id": str(i), "answers": {"text": [ref], "answer_start": [0]}}
    for i, ref in enumerate(references)
]

results = squad_metric.compute(
    predictions=[{"id": str(i), "prediction_text": pred} for i, pred in enumerate(predictions)],
    references=formatted_references
)

import pandas as pd
from evaluate import load
from tqdm import tqdm

# Load squad metric for per-example scoring
squad = load("squad")

sample_results = []

for i in tqdm(range(len(predictions))):
    pred = predictions[i]
    ref = references[i]
    
    pred_obj = {"id": str(i), "prediction_text": pred}
    ref_obj = {"id": str(i), "answers": {"text": [ref], "answer_start": [0]}}

    metrics = squad.compute(predictions=[pred_obj], references=[ref_obj])
    
    sample_results.append({
        "id": i,
        "context": test_dataset[i]["context"],
        "question": test_dataset[i]["question"],
        "reference": ref,
        "prediction": pred,
        "exact_match": metrics["exact_match"],
        "f1": metrics["f1"]
    })

# Convert to DataFrame
df = pd.DataFrame(sample_results)

# Save to CSV
csv_path = "/mnt/data/qa_eval_results.csv"
df.to_csv(csv_path, index=False)

print(f" Evaluation results saved to: {csv_path}")

print(f" Exact Match (EM): {results['exact_match']:.2f}")
print(f" F1 Score: {results['f1']:.2f}")

Defaulting to user installation because normal site-packages is not writeable


100%|██████████████████████████████████████| 2748/2748 [00:13<00:00, 207.76it/s]


✅ Evaluation results saved to: /mnt/data/qa_eval_results.csv
📊 Exact Match (EM): 0.00
📈 F1 Score: 1.68
