# **Project: COVID-19 Vaccine QA Fine-Tuning & Deployment Pipeline**

**Overview**

This notebook implements a modular, end-to-end pipeline to collect, process, and fine-tune a Small Language Model (SLM) on domain-specific medical data (COVID-19 Vaccine Safety).

**Key Features**:

1. **Automated Data Extraction**  
   - Processes PDFs and extracts structured text.  
   - Attaches metadata for source attribution and traceability.  

2. **Synthetic QA Dataset Generation**  
   - Generates high-quality question-answer pairs using LLMs.  
   - Ensures each answer is strictly grounded in the source text.  
   - Validates output against structured Pydantic schemas for consistency.

3. **Model Fine-Tuning**  
   - Adapts a base causal LLM to domain-specific knowledge using LoRA/PEFT.  
   - Supports CPU/GPU setups and reproducible workflows.

4. **Evaluation & Benchmarking**  
   - Measures model performance using automated metrics (ROUGE, BLEU).  
   - Tracks inference latency and throughput for production readiness.

**Goal**:  
Enable accurate, domain-specific question answering on COVID-19 vaccine safety literature with a reproducible and production-ready pipeline.


# **1. Environment Setup & Configuration**

This block prepares the notebook for the entire pipeline:

- Mounts Google Drive for data access
- Installs required libraries
- Loads the project configuration (`config.yaml`)
- Sets global variables (model ID, device, paths)
- Prepares utility functions for the pipeline

> This ensures the pipeline is modular, reproducible, and ready for any domain.
## 1.1 Dependency Installation

In [6]:
# Run once
!pip install -qU transformers==4.45.2 peft==0.12.0 accelerate==0.34.2
!pip install -qU datasets==3.1.0 optimum==1.22.0
!pip install -qU openai==1.61.0 wandb json-repair==0.30.0
!pip install -qU pdfplumber PyYAML instructor
!pip install -qU google-generativeai
!pip install evaluate rouge_score absl-py

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m84.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m97.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 1.2 import all libraries

In [1]:
# Standard libraries
import os
import random
from pathlib import Path
from datetime import datetime
import hashlib
from pprint import pprint
import json
import csv
import yaml
import math
from typing import List, Dict, Optional

# PDF parsing
import pdfplumber

# Data manipulation
import pandas as pd
import numpy as np

# QA / LLM processing
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Dataset & evaluation utilities
from datasets import Dataset, DatasetDict, load_dataset
import evaluate

# Schema validation
from pydantic import BaseModel, Field

# Model tracking / MLOps
import wandb

# Optional: for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: for saving Excel reports
import openpyxl


## 1.3 Drive Mounting and Working Directory & Dependency Installation

In [2]:
# Google Drive Mount
from google.colab import drive
drive.mount('/content/drive')
# Project Paths & Config
import os
PROJECT_PATH = "/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/"
os.chdir(PROJECT_PATH)
CONFIG_PATH = os.path.join(PROJECT_PATH, "config.yaml")
# Load Configuration
import yaml
with open(CONFIG_PATH, "r") as f:
    config = yaml.safe_load(f)

print(f"Target Domain: {config.get('domain')}")
print(f"Data Directory: {config.get('data_directory')}")
print(f"PDF Files: {config.get('pdfs')}")

import torch
BASE_MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")

import json_repair

def parse_json(text):
    """
    Safely parse LLM output using json_repair.
    Returns a dictionary or None if invalid JSON.
    """
    try:
        return json_repair.loads(text)
    except:
        return None

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Target Domain: COVID-19 vaccine side effects
Data Directory: data/raw/
PDF Files: ['Clinical Analysis of Long Post-COVID Vaccination Syndrome (LPCVS).pdf', 'covid_side_effects.pdf', 'COVID_Vaccine_Neuro_Effects.pdf', 'COVID_Vaccine_Safety_Facts.pdf', 'jj_vaccine_risks.pdf', 'Long_Term_Side_Effects_Medics.pdf', 'Pfizer_Platelet_LongTerm_Study.pdf', 'Rare_Vaccine_Side_Effects.pdf', 'Vaccine_Efficacy_Omicron_vs_Delta.pdf', 'WHO_XEC_Variant_Risk.pdf']
Device: cuda


In [3]:
PROJECT_PATH = "/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/"
config_path=PROJECT_PATH+"config.yaml"
config_path = os.path.join(PROJECT_PATH, "config.yaml")
with open(config_path, "r") as f:
    config = yaml.safe_load(f)

print(f"Target Domain: {config['domain']}")

Target Domain: COVID-19 vaccine side effects


# **2. Data Extraction / Collection**
This is responsible for turning raw PDFs into structured, machine-readable text.
## 2.1 PDF Parsing & Metadata extraction and source attribution

In [4]:
raw_dir = Path(PROJECT_PATH) / config["data_directory"]
pdf_files = [raw_dir / pdf for pdf in config["pdfs"]]

documents=[]
for pdf_path in pdf_files:
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
          text = page.extract_text()
          if text and text.strip():
              doc = {
                  "id": f"{pdf_path}_page_{page_num}",
                  "text": text.strip(),
                  "metadata": {
                      "source": pdf_path.name,
                      "page": page_num,
                      "domain": config["domain"],
                      "created_at": datetime.now().isoformat()
                  }
              }
              documents.append(doc)






In [5]:
documents[0]

{'id': '/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/data/raw/Clinical Analysis of Long Post-COVID Vaccination Syndrome (LPCVS).pdf_page_1',
 'text': 'Journal of Clinical and Translational Research 2022; 8(6): 506-508\nJournal of Clinical and Translational Research\nJournal homepage: http://www.jctres.com/en/home\nORIGINAL ARTICLE\nA retrospective analysis of clinically confirmed long post-COVID vaccination\nsyndrome\nJosef Finterer1*, Fulvio A. Scorza2\n1Neurology and Neurophysiology Center, Vienna, Austria, 2Department of Neurological, Federal University of Sao Paolo Rua Pedro de Toledo,\nSão Paulo, Brasil\nARTICLE INFO ABSTRACT\nArticle history: Background and Aim: Long post-COVID vaccination syndrome (LPCVS) is an increasingly\nReceived: August 17, 2022 recognized disease that occurs after SARS-CoV-2 vaccinations and lasts >4 weeks. However, little\nRevised: September 13, 2022 is known about the clinical presentation, underlying pathophysiology, treatment, and ou

In [9]:
# Shuffle the documents in-place
random.seed(42)
random.shuffle(documents)

# **3. Structured Output Schema (Pydantic)**
We enforce a strict JSON schema for the LLM responses using Pydantic.
This guarantees:
- Consistent output format
- Easy parsing
- Validation of model responses
- Production-readiness

In [10]:
class GenerateAnswer(BaseModel):
    answer: str = Field(..., description="Final answer to the user question")
    # source: str = Field(..., description="Source document name")
    # page: int = Field(..., description="Page number in the document")
    confidence: float = Field(..., description="Model confidence between 0 and 1")


In [11]:
user_question ="Based on the longitudinal retrospective analysis of platelet activity and the 2022 clinical analysis of LPCVS, what specific biomarker concentrations are suggested for evaluating long-term vaccine activity, and what was the recorded recovery timeline for the three patients who received only symptomatic therapy?"

qa_extraction_messages = [
    {
        "role": "system",
        "content": "\n".join([
            "You are a professional medical data analyst.",
            "You will be provided with a user question.",
            "Your goal is to answer the question accurately based ONLY on the provided text.",
            "Follow the provided Pydantic Scheme to generate a valid JSON output.",
            "If the answer is not in the text, state that you don't know and set confidence to 0.",
            "Do not generate any introduction or conclusion."
        ])
    },
    {
        "role": "user",
        "content": "\n".join([
            "## User Question:",
            user_question,
            "",
            "## Pydantic Details (Output Schema):",
            json.dumps(GenerateAnswer.model_json_schema(), ensure_ascii=False),
            "",
            "## Structured Answer:",
            "```json"
        ])
    }
]

# **4. Evaluation on Base Model**
This block tests the pre-trained LLM on your structured pipeline, using the previously defined Pydantic schema to validate outputs.
## 4.1 Load Base Model

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = config["base_model"]  # read from config
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

## **4.2 Prepare Input**
Using qa_extraction_messages from Block 3 (Pydantic schema)

In [13]:
text = tokenizer.apply_chat_template(qa_extraction_messages, tokenize=False, aydd_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

## 4.3 Generate Response


In [14]:
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1024,
    do_sample=False,
    top_k=None,
    temperature=None,
    top_p=None
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [15]:
generated_ids = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]


In [16]:
print(response)

system
{
  "answer": "The specific biomarker concentrations suggested for evaluating long-term vaccine activity include IL-6, TNF-alpha, and IFN-gamma.",
  "confidence": 1.0
}


Base model hallucinates generic medical patterns due to over-generalization, failing to capture the specific evidence-based details present in our niche corpus.
# **5. QA Dataset Generation**
This block is responsible for generating high-quality question-answer pairs from your documents, forming the core dataset for fine-tuning your LLM.

.

In [None]:
!pip install -q -U google-generativeai

In [None]:
## Remmber to put it in config
import google.generativeai as genai
from google.colab import userdata

genai.configure(api_key=userdata.get('covid_key'))
model = genai.GenerativeModel("models/gemini-2.5-flash-lite")

## 5.1 Single Page QA Generation

Generate QA pairs from one page at a time using the Gemini/LLM model.


In [None]:
# Install and configure Gemini API
!pip install -q -U google-generativeai

import google.generativeai as genai
from google.colab import userdata
import json
from typing import List
from pydantic import BaseModel, Field

# Load your API key from Colab userdata
genai.configure(api_key=userdata.get('covid_key'))
model = genai.GenerativeModel("models/gemini-2.5-flash-lite")

# Pydantic schemas
class QAPair(BaseModel):
    question: str = Field(..., description="A clear question based on the context")
    answer: str = Field(..., description="An accurate answer supported by the context")
    confidence: float = Field(..., description="Model confidence between 0 and 1")
    source: str = Field(..., description="The filename of the PDF")
    page: int = Field(..., description="The page number where the answer was found")

class PageExtraction(BaseModel):
    qa_pairs: List[QAPair] = Field(..., description="List of 5 QA pairs from the text")

# Example for a single document
test_doc = documents[0]

qa_generation_messages = [
    {
        "role": "system",
        "content": "\n".join([
            "You are a dataset generator for a Covid19 Question Answering system.",
            "Generate FIVE high-quality questions and answers strictly based on the provided context.",
            "Each answer MUST be strictly grounded in the text.",
            "Include the source filename and page number for each QA pair.",
            "Return ONLY valid JSON following the provided Pydantic schema.",
            "Do not add explanations or extra text."
        ])
    },
    {
        "role": "user",
        "content": f"""
## Context:
{test_doc['text'].strip()}

## Source:
{test_doc['metadata']['source']}

## Page:
{test_doc['metadata']['page']}

## Pydantic Schema:
{json.dumps(PageExtraction.model_json_schema(), ensure_ascii=False)}

## QA Pair:
```json
"""
    }
]

prompt = "\n".join([
    qa_generation_messages[0]["content"],
    "",
    qa_generation_messages[1]["content"],
    "",
    "```json"
])


response = model.generate_content(prompt)
print(response.text)


ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (::1) 1968.68ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (::1) 2448.41ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (::1) 706.81ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-2.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (::1) 809.83ms


```json
{
  "qa_pairs": [
    {
      "question": "What is the overall risk and antigenic advantage associated with the XEC variant?",
      "answer": "Overall risk XEC is growing rapidly but possesses minimal antigenic advantage in evading previous immunity.",
      "confidence": 1.0,
      "source": "WHO_XEC_Variant_Risk.pdf",
      "page": 3
    },
    {
      "question": "Does the XEC variant cause higher disease severity compared to other circulating variants?",
      "answer": "There are no reports to suggest that the associated disease severity is higher as compared to other circulating variants.",
      "confidence": 1.0,
      "source": "WHO_XEC_Variant_Risk.pdf",
      "page": 3
    },
    {
      "question": "How many XEC sequences were available globally in epidemiological week 47 (11 to 17 November 2024), and what percentage did they represent?",
      "answer": "There are currently 13,331 XEC sequences available from 50 countries, representing 36.8% of the globally availa

## 5.2 Batch QA Generation

Generate QA pairs across multiple pages with batching.

In [None]:
import json
import time
from tqdm.auto import tqdm
from google.api_core.exceptions import TooManyRequests
class QAPair(BaseModel):
    question: str = Field(..., description="A clear question based on the context")
    answer: str = Field(..., description="An accurate answer supported by the context")
    confidence: float = Field(..., description="Model confidence between 0 and 1")
    source: str = Field(..., description="The filename of the PDF")
    page: int = Field(..., description="The page number where the answer was found")

class PageExtraction(BaseModel):
# I wanna to extract 20 different questions from each page (5 for each page)
    qa_pairs: List[QAPair] = Field(...,  description="List of 20 QA pairs from the text")

# Output file
output_path = Path("covid_qa_dataset.jsonl")
output_path.write_text("", encoding="utf-8")  # empty file before starting

RETRY_DELAY = 40  # seconds to wait on rate limit
MAX_RETRIES = 3
BATCH_SIZE = 4   # number of pages per request

# Build prompt from a batch of documents
def build_prompt(batch):
    context_parts = []
    for doc in batch:
        context_parts.append(f"## Page {doc['metadata']['page']} from {doc['metadata']['source']}:")
        context_parts.append(doc['text'].strip())
    context_text = "\n\n".join(context_parts)

    qa_generation_messages = [
        {
            "role": "system",
            "content": "\n".join([
                "You are a dataset generator for a Covid19 Question Answering system.",
                "Generate Twenty high-quality questions and answers strictly based on the provided context.",
                "Each answer MUST be strictly grounded in the text.",
                "Include the source filename and page number for each QA pair.",
                "Return ONLY valid JSON following the provided Pydantic schema.",
                "Do not add explanations or extra text."
            ])
        },
        {
            "role": "user",
            "content": f"""
## Context:
{context_text}

## Pydantic Schema:
{json.dumps(PageExtraction.model_json_schema(), ensure_ascii=False)}

## QA Pair:
```json
"""
        }
    ]
    prompt = "\n".join([qa_generation_messages[0]["content"], "", qa_generation_messages[1]["content"], "", "```json"])
    return prompt

# Process in batches
batched_documents = [documents[i:i+BATCH_SIZE] for i in range(0, len(documents), BATCH_SIZE)]

#Save QA pairs into JSONL format as a frozen artifact for reuse in fine-tuning.
for batch in tqdm(batched_documents, desc="Processing batches"):
    prompt = build_prompt(batch)

    for attempt in range(MAX_RETRIES):
        try:
            response = model.generate_content(prompt)
            qa_batch = parse_json(response.text)

            if qa_batch and "qa_pairs" in qa_batch:
                # Save QA pairs to JSONL
                with open(output_path, "a", encoding="utf-8") as f:
                    for pair in qa_batch["qa_pairs"]:
                        f.write(json.dumps(pair, ensure_ascii=False) + "\n")
                break  # success, move to next batch
            else:
                print(f"Bad JSON in batch starting with page {batch[0]['metadata']['page']}")
                print(response.text)
                break  # skip if JSON invalid

        except TooManyRequests:
            print(f"Rate limit hit. Retrying in {RETRY_DELAY}s...")
            time.sleep(RETRY_DELAY)
    else:
        print(f"Skipping batch starting with page {batch[0]['metadata']['page']} after {MAX_RETRIES} retries.")

print(f"All QA pairs saved to {output_path} successfully!")


## 5.3 Display Generated QA Dataset

In [None]:
with open(output_path, "r", encoding="utf-8") as f:
    for line_num, line in enumerate(f, 1):
        if line.strip():
            data = json.loads(line)
            print(f"--- QA Pair {line_num} ---")
            print(json.dumps(data, indent=2, ensure_ascii=False))
            print("\n")

--- QA Pair 1 ---
{
  "question": "What is the current assessment of the overall risk posed by the XEC variant?",
  "answer": "The overall risk of XEC is growing rapidly, but it possesses minimal antigenic advantage in evading previous immunity. There is a significant increase in cases attributable to XEC infections, but the associated disease severity is not reported to be higher compared to other circulating variants.",
  "confidence": 1,
  "source": "WHO_XEC_Variant_Risk.pdf",
  "page": 3
}


--- QA Pair 2 ---
{
  "question": "Are there any indications that XEC infections are more severe than those caused by other Omicron descendent lineages?",
  "answer": "No, the available evidence on XEC does not suggest additional public health risks relative to the other currently circulating Omicron descendent lineages, nor are there reports to suggest that the associated disease severity is higher.",
  "confidence": 1,
  "source": "WHO_XEC_Variant_Risk.pdf",
  "page": 3
}


--- QA Pair 3 ---


# **6. Fine-tuning Preparation**
## 6.1 Install & Setup LLaMA-Factory

Clone and install LLaMA-Factory for fine-tuning.

In [None]:
!git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
!cd LLaMA-Factory && pip install -e .

In [None]:
## Configure API Keys
# Load WandB and HuggingFace tokens from Colab userdata for experiment logging and model authentication.

from google.colab import userdata
import wandb

wandb.login(key=userdata.get('wandb'))
hf_token = userdata.get('huggingface')
!huggingface-cli login --token {hf_token}

## 6.2 Define Output Schema

Use GenerateAnswer Pydantic schema to enforce structured output for each QA pair.
Comment: Guarantees JSON consistency during fine-tuning.

In [None]:
class GenerateAnswer(BaseModel):
    answer: str = Field(..., description="Final answer to the user question")
    confidence: float = Field(..., description="Model confidence between 0 and 1")
    source: str = Field(..., description="Source document name")
    page: int = Field(..., description="Page number in the document")
system_message = "\n".join([
    "You are a professional NLP data parser.",
    "Answer the question provided by the user following the `Output Scheme` to generate the `Output JSON`.",
    "Do not generate any introduction or conclusion."
])

## 6.3 Load QA Dataset

Convert JSONL QA pairs into LLaMA-Factory fine-tuning format.

In [None]:
sft_train_path = os.path.join(PROJECT_PATH, "covid_qa_dataset.jsonl")
llm_finetuning_data = []



for line in open(sft_train_path, "r", encoding="utf-8"):
  if line.strip() == "":
        continue
  qa_pair = json.loads(line.strip())
  llm_finetuning_data.append({
        "system": system_message,
        "instruction": "\n".join([
            "## User Question:",
            qa_pair["question"],
            "",
            "## Pydantic Details (Output Schema):",
            json.dumps(GenerateAnswer.model_json_schema(), ensure_ascii=False),
            "",
            "## Structured Answer:",
        ]),
        "input": "",
        "output": "\n".join([
            "```json",
            json.dumps(qa_pair, ensure_ascii=False),
            "```"
        ]),
        "history": []
    })


random.Random(42).shuffle(llm_finetuning_data)
len(llm_finetuning_data)

340

## 6.4 Train/Validation Split

In [None]:
train_sample_sz = 300
train_ds = llm_finetuning_data[:train_sample_sz]
eval_ds = llm_finetuning_data[train_sample_sz:]
import os
from os.path import join


# 5.5 Save LLaMA-Factory JSON Files
# train.json and val.json inside the fine-tuning folder.
save_path = join(PROJECT_PATH, "data", "llamafactory-finetune-data")
os.makedirs(save_path, exist_ok=True)
with open(join(save_path, "train.json"), "w", encoding="utf-8") as f:
    json.dump(train_ds, f, ensure_ascii=False, default=str)

with open(join(save_path, "val.json"), "w", encoding="utf-8") as f:
    json.dump(eval_ds, f, ensure_ascii=False, default=str)

In [None]:
join(PROJECT_PATH, "data", "llamafactory-finetune-data", "val.json")

'/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/data/llamafactory-finetune-data/val.json'

# **7. LoRA Fine-tuning**

Fine-tune the small LLM with LoRA using the dataset prepared in Section 6

## 6.1 Configure YAML for Training
Define model, dataset, and training hyperparameters in YAML.

In [None]:
%%writefile "/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/LLaMA-Factory/examples/train_lora/qa_finetune.yaml"

### model
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 32
lora_target: all

### dataset
dataset: qa_finetune_train
eval_dataset: qa_finetune_val
template: qwen
cutoff_len: 3500
# max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
#resume_from_checkpoint: /content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/models/checkpoint-1500
output_dir: /content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/models/
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 20
learning_rate: 1.0e-4
num_train_epochs: 4.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
per_device_eval_batch_size: 2
eval_strategy: steps
eval_steps: 20


Overwriting /content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/LLaMA-Factory/examples/train_lora/qa_finetune.yaml


# 7.2 Run LoRA Fine-tuning

Execute LLaMA-Factory training CLI with the YAML config

In [None]:
!cd "/content/drive/.shortcut-targets-by-id/1_GzXpxDDuF7W4kQw4wsC-4QCgbmBihNW/COVID-19 Vaccine Side Effects and Safety/LLaMA-Factory/" && \
llamafactory-cli train examples/train_lora/qa_finetune.yaml

2026-01-20 18:35:26.741119: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768934126.773461   33804 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768934126.783231   33804 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768934126.806265   33804 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768934126.806298   33804 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768934126.806307   33804 computation_placer.cc:177] computation placer alr

# **8. Evaluation**
## 8.1 Load Fine-tuned Model


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

## Configure API Keys
# Load WandB and HuggingFace tokens from Colab userdata for experiment logging and model authentication.

from google.colab import userdata
import wandb

wandb.login(key=userdata.get('wandb'))
hf_token = userdata.get('huggingface')
!huggingface-cli login --token {hf_token}

BASE_MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
FINETUNED_MODEL_PATH = "/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/models/"

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)

# Load LoRA-adapter
model.load_adapter(FINETUNED_MODEL_PATH)

device = "cuda" if torch.cuda.is_available() else "cpu"


## 8.2 Define Response Generation Function

Prepare a helper to convert LLaMA-Factory style messages into model prediction

In [None]:
def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512,
        do_sample=False
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response


## 8.3 Excute Benchmark
Prepare a helper to convert LLaMA-Factory style messages into model prediction

In [None]:
!pip install evaluate
!pip install rouge_score absl-py

Collecting rouge_score
  Using cached rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=a4830232f145e3e5de8cc7a6e77a537183377f4288ba98ba79b1be89f1538487
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import json
import time
import pandas as pd
import torch
import evaluate
from tqdm import tqdm

# Initialize automated evaluation metrics
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

def run_benchmark(eval_path, output_excel="/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/evaluation_results.xlsx"):

    # Load your validation dataset
    with open(eval_path, "r", encoding="utf-8") as f:
        eval_data = json.load(f)

    results = []
    all_predictions = []
    all_references = []

    print(f Starting Evaluation on {len(eval_data)} samples...")

    for item in tqdm(eval_data):
        # Prepare the input prompt following your trained template
        # 'instruction' usually contains the User Question in LLaMA-Factory format
        user_query = item.get('instruction', "")
        ground_truth = item.get('output', "")

        messages = [
            {"role": "system", "content": system_message},
            {"role": "user", "content": f"## User Question:\n{user_query}\n\n## Structured Answer:\n```json"}
        ]

        # Measure Inference Latency (Time taken for a single request)
        start_time = time.time()
        try:
            prediction = generate_resp(messages)
        except Exception as e:
            prediction = f"Error: {str(e)}"
        end_time = time.time()

        latency = end_time - start_time

        # Store individual sample results
        results.append({
            "Question": user_query,
            "Ground_Truth": ground_truth,
            "Model_Prediction": prediction,
            "Latency_Seconds": latency
        })

        all_predictions.append(prediction)
        all_references.append(ground_truth)


    rouge_results = rouge.compute(predictions=all_predictions, references=all_references)
    bleu_results = bleu.compute(predictions=all_predictions, references=all_references)

    # Convert results to a DataFrame for analysis and reporting
    df = pd.DataFrame(results)

    # Calculate Latency and Throughput (Inference speed metrics)
    avg_latency = df["Latency_Seconds"].mean()
    total_time = df["Latency_Seconds"].sum()
    throughput = len(eval_data) / total_time # Samples processed per second

    print("\n" + "="*30)
    print("BENCHMARK REPORT")
    print("="*30)
    print(f"Average Latency:  {avg_latency:.2f} sec/sample")
    print(f"Throughput:       {throughput:.2f} samples/sec")
    print(f"ROUGE-L Score:    {rouge_results['rougeL']:.4f}")
    print(f"BLEU Score:       {bleu_results['bleu']:.4f}")
    print("="*30)

    # Save to Excel for project documentation/presentation
    df.to_excel(output_excel, index=False)
    print(f"Results saved to {output_excel}")

# Execute the benchmark
VAL_DATA_PATH = "/content/drive/MyDrive/COVID-19 Vaccine Side Effects and Safety/data/llamafactory-finetune-data/val.json"
run_benchmark(VAL_DATA_PATH)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

🚀 Starting Evaluation on 40 samples...


100%|██████████| 40/40 [05:05<00:00,  7.65s/it]



📊 BENCHMARK REPORT
Average Latency:  7.65 sec/sample
Throughput:       0.13 samples/sec
ROUGE-L Score:    0.6489
BLEU Score:       0.6121
✅ Results saved to evaluation_results.xlsx
