# Notebook Summary

This notebook implements the Fourth Implementation of the telecom QA fine-tuning pipeline using LLaMA-2 with LoRA adapters. It follows an end-to-end process of dataset preparation, model training, and evaluation:

Dataset Preprocessing and Filtering

Parse raw JSONL data into context–question–answer triples.

Apply a sliding window chunking approach to ensure answer spans remain in the selected context even when sequences exceed the 2048-token limit.

Build LLaMA-2 style instruction prompts with strict formatting.

Save a filtered dataset, removing examples where answers are lost.

Golden Dataset Creation

Clean answers by removing artefacts (e.g., </s>).

Produce a high-quality “golden” dataset for training.

Model Fine-Tuning with LoRA

Load the golden dataset into HuggingFace Dataset objects and split into train/validation/test (90/5/5).

Tokenize with the LLaMA-2 tokenizer (max length 2048, EOS padding).

Configure 4-bit quantized LoRA adapters with gradient checkpointing and cosine LR scheduling.

Train for 6 epochs with evaluation per epoch, logging, and speed monitoring.

Save training curves (loss vs. steps) and the final fine-tuned model.

Evaluation and Inference

Reload the fine-tuned model for testing.

Generate predictions with custom post-processing (stopping criteria and answer cleaning).

Evaluate performance using SQuAD metrics (Exact Match, F1) along with ROUGE-L and BLEU for robustness.

Save detailed per-example results (prompts, references, predictions, scores) into CSV.

In [None]:
import json
from pathlib import Path
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm

# Configuration
MAX_TOKEN_LENGTH = 2048 

# Paths
input_path = Path("/mnt/data/Second Implementation/telequad_v4_reformatted.jsonl")
output_path = Path(f"/mnt/data/Fourth Implementation/telequad_v4_filtered_semantic_{MAX_TOKEN_LENGTH}.jsonl")

# Load tokenizer and sentence encoder
print(" Loading tokenizer and sentence encoder...")
tokenizer = AutoTokenizer.from_pretrained("/mnt/data/llama2-model")
encoder = SentenceTransformer("all-MiniLM-L6-v2")
print("✅ Models loaded.")

# Prompt format
SYSTEM_PROMPT = (
    "You are a precise assistant. Extract the exact answer span from the context. "
    "Do not paraphrase, summarize, or add extra information. "
    "The answer must appear exactly in the context."
)

# Semantic Chunking Logic
def select_relevant_chunks(context: str, answer: str, window_size=150, stride=100) -> str:
    """
    Sliding window approach to ensure the answer appears in the selected chunk.
    """
    words = context.split()
    for start in range(0, len(words), stride):
        end = start + window_size
        chunk = " ".join(words[start:end])
        if answer in chunk:
            return chunk
        if end >= len(words):
            break
    return None

# Rebuild prompt from parts
def build_prompt(context: str, question: str, answer: str) -> str:
    user_prompt = (
        f"Context: {context}\n\n"
        f"Question: {question}\n"
        f"Answer:"
    )
    return f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{user_prompt} [/INST] {answer}</s>"

# Process entries
print(" Processing and filtering entries...")
reformatted_entries = []
total_count = 0
filtered_out_count = 0

with open(input_path, "r", encoding="utf-8") as f:
    num_lines = sum(1 for line in f)

with open(input_path, "r", encoding="utf-8") as f:
    for line in tqdm(f, total=num_lines, desc="Processing file"):
        total_count += 1
        try:
            entry = json.loads(line)
            original_text = entry["text"]

            # Parse the original text to extract context, question, and answer
            prompt_part, answer = original_text.split("[/INST]", 1)
            answer = answer.strip().replace("</s>", "")
            
            # Skip if answer is empty
            if not answer:
                continue

            lines = prompt_part.splitlines()
            context_lines, question = [], ""
            inside_context, inside_question = False, False

            for l in lines:
                stripped_line = l.strip()
                if stripped_line.startswith("Context:"):
                    inside_context = True
                    inside_question = False
                    # Capture text on the same line as "Context:"
                    context_lines.append(stripped_line.replace("Context:", "").strip())
                    continue
                elif stripped_line.startswith("Question:"):
                    inside_question = True
                    inside_context = False
                    # Capture text on the same line as "Question:"
                    question = stripped_line.replace("Question:", "").strip()
                    continue
                
                if inside_context:
                    context_lines.append(l) # Append original line to preserve formatting
                elif inside_question and not question:
                    question = stripped_line
            
            full_context = "\n".join(context_lines).strip()

            if not full_context or not question:
                continue

            # Tokenize and decide whether to shorten the context
            temp_prompt = build_prompt(full_context, question, answer)
            input_ids = tokenizer(temp_prompt)["input_ids"]

            final_context = full_context
            
            # If the full prompt is too long, try to shorten it
            if len(input_ids) > MAX_TOKEN_LENGTH:
                short_context = select_relevant_chunks(full_context, answer, window_size=150, stride=100)
                
                if short_context is not None and answer in short_context:
                    final_context = short_context
                else:
                    filtered_out_count += 1
                    continue

            # Build the final, validated prompt and add it to our list.
            final_prompt = build_prompt(final_context, question, answer)
            reformatted_entries.append({"text": final_prompt})

        except (ValueError, KeyError) as e:
            # Catch potential errors from malformed JSON lines or text splitting.
            print(f"Skipping malformed line {total_count}: {e}")
            continue

# Save output
print("💾 Saving the new dataset...")
with open(output_path, "w", encoding="utf-8") as f:
    for e in reformatted_entries:
        f.write(json.dumps(e) + "\n")

print("\n--- 📊 Processing Complete ---")
print(f"Total examples processed: {total_count}")
print(f"Examples kept for training: {len(reformatted_entries)}")
print(f"Examples filtered out (answer lost during chunking): {filtered_out_count}")
print(f"✅ Filtered and reformatted file saved to: {output_path}")

In [None]:
import json
from pathlib import Path
from tqdm import tqdm
import re

# Paths
input_path = Path("/mnt/data/Fourth Implementation/telequad_v4_filtered_semantic_2048.jsonl")
# The output is our new, clean "golden" dataset
output_path = Path("/mnt/data/Fourth Implementation/telequad_v4_golden.jsonl")

print(f"Input dataset: {input_path}")
print(f"Output dataset: {output_path}")

cleaned_entries = []

print("⏳ Reading and cleaning the dataset...")
with open(input_path, "r", encoding="utf-8") as f:
    for line in tqdm(f.readlines()):
        try:
            entry = json.loads(line)
            text = entry.get("text", "")
            
            # Split the entry into the prompt and the answer
            prompt_part, answer_part = text.split("[/INST]", 1)
            
            # Clean the original answer
            original_answer = answer_part.strip().replace("</s>", "")
            
            # The Cleaning Logic
            clean_answer = original_answer
            
            # Rebuild the text entry with the clean answer
            new_text = f"{prompt_part}[/INST] {clean_answer}</s>"
            cleaned_entries.append({"text": new_text})

        except (ValueError, KeyError) as e:
            print(f"Skipping malformed line: {e}")
            continue

print(f"💾 Saving {len(cleaned_entries)} cleaned entries to the golden dataset...")
with open(output_path, "w", encoding="utf-8") as f:
    for entry in cleaned_entries:
        f.write(json.dumps(entry) + "\n")

print("✅ Golden dataset created successfully!")

In [1]:
import torch
from pathlib import Path
import json
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    TrainingArguments, Trainer, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load and Combine Datasets
def load_jsonl(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line.strip()) for line in f if line.strip()]

v4_path = "/mnt/data/Fourth Implementation/telequad_v4_golden.jsonl"

v4_data = load_jsonl(v4_path)

combined_data = v4_data  
dataset = Dataset.from_list(combined_data).shuffle(seed=42)

# 90/5/5 Split
split = dataset.train_test_split(test_size=0.10, seed=42)
val_test = split["test"].train_test_split(test_size=0.5, seed=42)
train_dataset = split["train"]
val_dataset = val_test["train"]
test_dataset = val_test["test"]

In [None]:
# Load Tokenizer
model_path = "/mnt/data/llama2-model"  
print("🔤 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token  # Padding with eos token

In [None]:
# Tokenize Data
def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=2048 
    )

train_dataset = train_dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
val_dataset = val_dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])

In [None]:
# Data Collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=64  # Padding efficiency boost
)

In [None]:
# Load Model with LoRA
print(" Loading LLaMA-2 with LoRA...")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16
)

base_model = prepare_model_for_kbit_training(base_model)
base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

In [None]:
output_dir = "/mnt/data/llama2_qa_lora_output4"

In [None]:
# Training Arguments
print(" Setting up training...")
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=6,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=1e-5,
    lr_scheduler_type="cosine",
    logging_dir=f"{output_dir}/logs",
    logging_steps=50,
    bf16=True,
    report_to="tensorboard",
    remove_unused_columns=False,
    dataloader_num_workers=4,
    group_by_length=True,
    optim="paged_adamw_8bit",
    max_grad_norm=1,
    warmup_ratio=0.03
)

In [None]:
# Trainer Setup
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [None]:
# Optional Speed Logging
import time
from transformers import TrainerCallback

class SpeedCallback(TrainerCallback):
    def __init__(self):
        self.last_time = time.time()
    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step % 20 == 0:
            now = time.time()
            duration = now - self.last_time
            print(f"⚡ Step {state.global_step} — {20/duration:.3f} it/s")
            self.last_time = now

trainer.add_callback(SpeedCallback())

In [None]:
# Start Training
print("🚀 Starting fine-tuning...")
torch.cuda.empty_cache()
trainer.train()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(trainer.state.log_history)

# Optional: Save to CSV
df.to_csv("/mnt/data/Fourth Implementation/loss_history1.csv", index=False)

# Plotting
plt.plot(df["step"], df["loss"], label="Training Loss", marker='o')
plt.plot(df["step"], df["eval_loss"], label="Validation Loss", marker='x')
plt.xlabel("Step")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.grid(True)
plt.legend()
plt.show()

In [None]:
# Save Final Model
print("💾 Saving model...")
trainer.save_model(f"{output_dir}/final")
tokenizer.save_pretrained(f"{output_dir}/final")

In [2]:
def clean_prediction(raw_text):
    answer = raw_text.split("[/INST]")[-1].strip()
    answer = re.sub(r"[^\w\s\-.,:/()]", "", answer)

    # Clip to first sentence-ending punctuation
    sentence_end = re.search(r'[.?!]', answer)
    if sentence_end:
        answer = answer[:sentence_end.end()]

    return answer.strip()

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, StoppingCriteria, StoppingCriteriaList
from datasets import Dataset
import evaluate
import pandas as pd
from tqdm import tqdm
import re

# Reload model and tokenizer
model_path = "/mnt/data/llama2_qa_lora_output4/final" 
print("🧠 Loading fine-tuned model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
).to("cuda")

# Define a custom stopping criteria
class StopOnNewline(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        return input_ids[0, -1] == 13 # Token ID for newline

# Create the pipeline
qa_pipeline = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer
)

# Load the test dataset

def extract_prompt_and_answer(entry):
    try:
        text = entry["text"]
        parts = text.split("[/INST]")
        prompt = parts[0] + "[/INST]"
        reference = parts[1].strip().replace("</s>", "")
        return {"prompt": prompt, "reference": reference}
    except Exception:
        return {"prompt": "", "reference": ""}

print(" Processing test set...")
processed = [extract_prompt_and_answer(ex) for ex in test_dataset]
processed = [ex for ex in processed if ex["prompt"].strip() and ex["reference"].strip()]

# --- Inference ---
print(" Generating predictions with aggressive post-processing...")
predictions = []
batch_size = 4

for i in tqdm(range(0, len(processed), batch_size)):
    batch_prompts = [ex["prompt"] for ex in processed[i:i + batch_size]]
    
    batch_outputs = qa_pipeline(
        batch_prompts, 
        max_new_tokens=40,
        do_sample=False,
        eos_token_id=tokenizer.encode("</s>")[0],
        pad_token_id=tokenizer.eos_token_id
    )

    for out in batch_outputs:
        gen_text = out[0]["generated_text"]
        cleaned_answer = clean_prediction(gen_text)
        predictions.append(cleaned_answer)
# Evaluation using SQuAD metric
print("📊 Calculating metrics...")
references = [ex["reference"] for ex in processed]
squad_metric = evaluate.load("squad")

formatted_preds = [{"id": str(i), "prediction_text": p} for i, p in enumerate(predictions)]
formatted_refs = [{"id": str(i), "answers": {"text": [r], "answer_start": [0]}} for i, r in enumerate(references)]

results = squad_metric.compute(predictions=formatted_preds, references=formatted_refs)
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

rouge_result = rouge.compute(predictions=predictions, references=references)
bleu_result = bleu.compute(predictions=predictions, references=references)

print(f"🟩 ROUGE-L: {rouge_result['rougeL']:.2f}")
print(f"🟦 BLEU: {bleu_result['bleu']:.2f}")
print(f"\n✅ Exact Match (EM): {results['exact_match']:.2f}")
print(f"📈 F1 Score: {results['f1']:.2f}")

# Save Detailed CSV
df = pd.DataFrame({
    "id": list(range(len(predictions))),
    "prompt": [ex["prompt"] for ex in processed],
    "reference": references,
    "prediction": predictions
})
df["exact_match"] = [squad_metric.compute(predictions=[formatted_preds[i]], references=[formatted_refs[i]])["exact_match"] for i in range(len(predictions))]
df["f1"] = [squad_metric.compute(predictions=[formatted_preds[i]], references=[formatted_refs[i]])["f1"] for i in range(len(predictions))]
csv_path = "/mnt/data/Fourth Implementation/test_dataset_eval_results_FINAL.csv"
df.to_csv(csv_path, index=False)
print(f"✅ Detailed test set results saved to: {csv_path}")

🧠 Loading fine-tuned model and tokenizer...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


📝 Processing test set...
🔮 Generating predictions with aggressive post-processing...


  0%|                                                    | 0/54 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  2%|▊                                           | 1/54 [00:07<06:44,  7.63s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` 

📊 Calculating metrics...


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

🟩 ROUGE-L: 0.59
🟦 BLEU: 0.37

✅ Exact Match (EM): 25.82
📈 F1 Score: 58.16
✅ Detailed test set results saved to: /mnt/data/Fourth Implementation/test_dataset_eval_results_FINAL.csv
