<a href="https://colab.research.google.com/github/Innocente0/LLMs_Fine-Tuning_Summative/blob/main/LLMs_Fine_Tuning_Summative_Aline_Innocente.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Medical Assistant via LLMs Fine-Tuning
This cell checks if a GPU is available in Colab and shows its type and memory. It helps confirm that fine-tuning can run efficiently on available hardware.

In [None]:
!nvidia-smi

**Install Dependencies**

Installs all required libraries (Transformers, PEFT, BitsAndBytes, Datasets, Gradio, etc.). This ensures the Colab environment has everything needed for loading, fine-tuning, evaluating, and deploying the LLM.

In [None]:
!pip -q install -U transformers datasets accelerate peft trl bitsandbytes evaluate rouge_score nltk sentencepiece gradio

**Imports and NLTK Download**

In [None]:
import os, time, math, random, zipfile
import pandas as pd
import numpy as np
import torch

from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    BitsAndBytesConfig, TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

import evaluate
import nltk
nltk.download("punkt")

## Loading Data

In [None]:
ZIP_PATH = "/content/medquad.csv.zip"  # your uploaded file
EXTRACT_DIR = "/mnt/data/medquad_extracted"

os.makedirs(EXTRACT_DIR, exist_ok=True)

with zipfile.ZipFile(ZIP_PATH, "r") as z:
    z.extractall(EXTRACT_DIR)

print("Extracted files:", os.listdir(EXTRACT_DIR))

Loads the extracted MedQuAD CSV into a pandas DataFrame and prints its shape. This is for inspecting columns and cleaning the dataset.

In [None]:
CSV_PATH = os.path.join(EXTRACT_DIR, "medquad.csv")
df = pd.read_csv(CSV_PATH)

print(df.shape)
df.head()

**Clean and Filter Dataset**
Removes missing values, duplicates, and extremely short questionâ€“answer pairs, then normalizes text. This improves data quality and ensures the model trains on meaningful, consistent medical examples.


In [None]:
# Basic cleanup
df = df.dropna(subset=["question", "answer"]).copy()
df["question"] = df["question"].astype(str).str.strip()
df["answer"] = df["answer"].astype(str).str.strip()

# Remove empties
df = df[(df["question"].str.len() > 5) & (df["answer"].str.len() > 10)]

# Drop duplicates
df = df.drop_duplicates(subset=["question", "answer"])

print("After cleaning:", df.shape)
df.head(3)

Summarized the datasets from 16k to 3k samples, with questions and answers.

In [None]:
# 3000 samples questions and answers
TRAIN_SIZE = 3000

if TRAIN_SIZE < len(df):
    df = df.sample(TRAIN_SIZE, random_state=42).reset_index(drop=True)

print("Using dataset size:", len(df))

**Formatting the Datasets into Instruction-Response Template**

Converts each row into a single formatted string with **Instruction** and **Response** sections.

In [None]:
def format_example(row):
    topic = row.get("focus_area", "")
    topic_str = f"[Topic: {topic}] " if isinstance(topic, str) and len(topic.strip()) > 0 else ""
    return (
        "### Instruction:\n"
        f"{topic_str}{row['question']}\n\n"
        "### Response:\n"
        f"{row['answer']}"
    )

df["text"] = df.apply(format_example, axis=1)
df[["question", "answer", "focus_area", "text"]].head(2)

**Train and Validation Split**
Converts the DataFrame to a Hugging Face Dataset and splits it into training and validation subsets. This enables proper evaluation of fine-tuning performance on unseen examples.

In [None]:
dataset = Dataset.from_pandas(df[["text"]])

split = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = split["train"]
eval_ds  = split["test"]

print(train_ds, eval_ds)

**Load Base Model and Tokenizer**

Configures TinyLlama 1.1B in 4-bit with tokenizer and pad token, fitting the model into Colab GPU memory while enabling efficient tokenization and LoRA fine-tuning for domain-specific medical text.

In [None]:
MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto"
)

model.config.use_cache = False  # important for training

**Adding LoRA Adapters**

Prepares the quantized model for k-bit training and inserts LoRA adapters into attention projections. This enables parameter-efficient fine-tuning while freezing most original model weights.


In [None]:
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    # TinyLlama/Llama-like blocks usually use these:
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

**Training Setup and Experiments Tracking Table**

Defines a function to append hyperparameters and metric results into a CSV file. This creates an experiment log for tracking different runs and comparing performance systematically.

In [None]:
EXPERIMENTS_CSV = "/mnt/data/experiments_log.csv"

def log_experiment(row_dict):
    df_log = pd.DataFrame([row_dict])
    if os.path.exists(EXPERIMENTS_CSV):
        old = pd.read_csv(EXPERIMENTS_CSV)
        df_log = pd.concat([old, df_log], ignore_index=True)
    df_log.to_csv(EXPERIMENTS_CSV, index=False)
    return df_log

Tokenizes training text, pads batches with a collator that sets labels from inputs, and defines learning and logging hyperparameters, preparing token IDs and stable loss for causal language model training.

In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding

LR = 5e-5
BATCH_SIZE = 2
GRAD_ACCUM = 8
EPOCHS = 2
MAX_SEQ_LEN = 512

# Tokenize WITHOUT labels (important)
def tokenize_fn(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding=False,  # we pad in the collator
    )

train_tok = train_ds.map(tokenize_fn, batched=True, remove_columns=train_ds.column_names)
eval_tok  = eval_ds.map(tokenize_fn, batched=True, remove_columns=eval_ds.column_names)

# Custom collator: pads then creates labels = input_ids, and masks pad tokens
base_pad_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

def causal_lm_collator(features):
    batch = base_pad_collator(features)  # pads input_ids + attention_mask
    labels = batch["input_ids"].clone()
    labels[batch["attention_mask"] == 0] = -100  # ignore padding in loss
    batch["labels"] = labels
    return batch

# Training arguments (your transformers uses eval_strategy)
training_args = TrainingArguments(
    output_dir="/mnt/data/medquad_tinyllama_lora",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=GRAD_ACCUM,
    num_train_epochs=EPOCHS,
    learning_rate=LR,
    warmup_steps=10,
    logging_steps=20,
    eval_strategy="steps",
    eval_steps=200,
    save_steps=200,
    save_total_limit=2,
    fp16=True,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tok,
    eval_dataset=eval_tok,
    data_collator=causal_lm_collator,
)

trainer.train()

In [None]:
start_time = time.time()
train_result = trainer.train()
train_time_sec = time.time() - start_time

train_time_sec, train_result.metrics

**Save Fine-Tuned Adapter and Tokenizer**

Saves the fine-tuned LoRA adapter weights and tokenizer configuration to disk. This allows reloading the specialized medical assistant later without retraining.

In [None]:
SAVE_DIR = "/mnt/data/medquad_tinyllama_lora_adapter"
trainer.model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

print("Saved to:", SAVE_DIR)

**Helper: generate_response Function**

Defines a generation helper that tokenizes prompts, calls model.generate, and decodes outputs. It optionally extracts only the part after **Response** for clean responses.

In [None]:
def generate_response(model_obj, prompt, max_new_tokens=160, temperature=0.7, top_p=0.9):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_SEQ_LEN).to(model_obj.device)
    with torch.no_grad():
        out = model_obj.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=tokenizer.eos_token_id
        )
    text = tokenizer.decode(out[0], skip_special_tokens=True)
    # return only the part after "### Response:" if present
    if "### Response:" in text:
        return text.split("### Response:", 1)[-1].strip()
    return text.strip()

**Build a Small Eval for Metrics**

Constructs prompts and reference answers from the validation dataset by splitting **Instruction** and **Response**. These are used to fairly compare base and fine-tuned models.

In [None]:
# 100 samples for evaluation metrics
EVAL_N = min(100, len(eval_ds))
eval_samples = eval_ds.select(range(EVAL_N))

# Extract references (true answers) from formatted text
def extract_ref(formatted_text):
    if "### Response:" in formatted_text:
        return formatted_text.split("### Response:", 1)[-1].strip()
    return formatted_text

def extract_prompt(formatted_text):
# prompt is everything up to Response
    if "### Response:" in formatted_text:
        return formatted_text.split("### Response:", 1)[0].strip() + "\n\n### Response:\n"
    return formatted_text.strip() + "\n\n### Response:\n"

prompts = [extract_prompt(x["text"]) for x in eval_samples]
refs = [extract_ref(x["text"]) for x in eval_samples]

**Load Base Model for Comparison**

Reloads the original TinyLlama base model in 4-bit quantized form. This enables direct side-by-side comparison between the untrained base model and the fine-tuned medical assistant.

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto"
)
base_model.config.use_cache = False

**Generate Predictions**

Uses both base and fine-tuned models to generate answers for the same validation prompts. These predictions are later evaluated with ROUGE, BLEU, and perplexity.

In [None]:
ft_preds = [generate_response(trainer.model, p) for p in prompts]
base_preds = [generate_response(base_model, p) for p in prompts]

print("Sample prompt:\n", prompts[0])
print("\nBASE:\n", base_preds[0][:400])
print("\nFINE-TUNED:\n", ft_preds[0][:400])
print("\nREF:\n", refs[0][:400])

**Compute ROUGE and BLEU Scores**

Loads ROUGE and BLEU metrics, computes them for base and fine-tuned predictions versus references, and returns metric dictionaries. This quantifies lexical and structural similarity improvements.

In [None]:
rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

def compute_metrics(preds, refs):
    # ROUGE expects raw strings for predictions and references
    rouge_scores = rouge.compute(predictions=preds, references=refs)
    # BLEU expects raw strings for predictions and a list of raw strings for references (per prediction)
    bleu_scores = bleu.compute(predictions=preds, references=[[r] for r in refs])
    return rouge_scores, bleu_scores

ft_rouge, ft_bleu = compute_metrics(ft_preds, refs)
base_rouge, base_bleu = compute_metrics(base_preds, refs)

ft_rouge, ft_bleu, base_rouge, base_bleu

**Compute Perplexity on Validation Texts**

Calculates perplexity for base and fine-tuned models on validation examples using language model loss. Lower perplexity indicates the fine-tuned model better models medical text distributions.

In [None]:
import torch.nn.functional as F

def perplexity_on_texts(model_obj, texts, max_len=512):
    model_obj.eval()
    losses = []
    for t in texts:
        enc = tokenizer(t, return_tensors="pt", truncation=True, max_length=max_len).to(model_obj.device)
        with torch.no_grad():
            out = model_obj(**enc, labels=enc["input_ids"])
            losses.append(out.loss.item())
    return float(math.exp(np.mean(losses)))

# Evaluate perplexity on the formatted eval examples
eval_texts = [x["text"] for x in eval_samples]
ft_ppl = perplexity_on_texts(trainer.model, eval_texts, max_len=MAX_SEQ_LEN)
base_ppl = perplexity_on_texts(base_model, eval_texts, max_len=MAX_SEQ_LEN)

ft_ppl, base_ppl

**Log Experiments Row**

Collects hyperparameters, metrics, training time, and GPU memory usage into a dictionary and appends it to the experiment log CSV. This documents the final best run clearly.

In [None]:
gpu_mem = torch.cuda.max_memory_allocated() / (1024**3) if torch.cuda.is_available() else None

exp_row = {
    "model": MODEL_ID,
    "dataset": "MedQuAD",
    "train_size": len(train_ds),
    "eval_size": len(eval_ds),
    "max_seq_len": MAX_SEQ_LEN,
    "lr": LR,
    "batch_size": BATCH_SIZE,
    "grad_accum": GRAD_ACCUM,
    "epochs": EPOCHS,
    "lora_r": lora_config.r,
    "lora_alpha": lora_config.lora_alpha,
    "lora_dropout": lora_config.lora_dropout,
    "train_time_sec": train_time_sec,
    "gpu_max_mem_gb": gpu_mem,
    "ft_rougeL": ft_rouge.get("rougeL", None),
    "base_rougeL": base_rouge.get("rougeL", None),
    "ft_bleu": ft_bleu.get("bleu", None),
    "base_bleu": base_bleu.get("bleu", None),
    "ft_ppl": ft_ppl,
    "base_ppl": base_ppl,
    "notes": "LoRA 4-bit TinyLlama, MedQuAD formatted template"
}

log_experiment(exp_row).tail(5)

**Qualitative Comparison Table**

Runs both base and fine-tuned models on a small set of custom medical and non-medical questions, then builds a comparison DataFrame. This showcases behavioral differences beyond numeric metrics.

In [None]:
def compare_on_questions(questions):
    rows = []
    for q in questions:
        prompt = f"### Instruction:\n{q}\n\n### Response:\n"
        base_ans = generate_response(base_model, prompt)
        ft_ans = generate_response(trainer.model, prompt)
        rows.append({"question": q, "base_answer": base_ans, "fine_tuned_answer": ft_ans})
    return pd.DataFrame(rows)

test_questions = [
    "What are common symptoms of anemia?",
    "How is hypertension typically treated?",
    "What causes asthma?",
    "What is the recommended action for a high fever in a child?",
    # out-of-domain checks
    "Write a Python function to sort a list.",
    "Who won the 2022 World Cup?"
]

compare_df = compare_on_questions(test_questions)
compare_df

**Gradio Chatbot Interface and Deployment**

Defines the chat function with disclaimer and safety formatting, then builds and launches a Gradio Interface. This provides an interactive web UI for users to test the medical assistant.

In [None]:
import gradio as gr

def chat(user_question):
    prompt = f"### Instruction:\n{user_question}\n\n### Response:\n"
    answer = generate_response(trainer.model, prompt, max_new_tokens=220, temperature=0.7, top_p=0.9)
    return answer

demo = gr.Interface(
    fn=chat,
    inputs=gr.Textbox(lines=3, placeholder="Ask a medical question..."),
    outputs="text",
    title="Medical Q&A Assistant (MedQuAD fine-tuned with LoRA)",
    description="Domain-specific assistant fine-tuned from TinyLlama using MedQuAD. For education/demo only."
)

demo.launch(share=True)

**Save to Google Drive**

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
import os

DRIVE_DIR = "/content/drive/MyDrive/medquad_lora_run"
os.makedirs(DRIVE_DIR, exist_ok=True)

!cp -r "{SAVE_DIR}" "{DRIVE_DIR}/adapter"
!cp "{EXPERIMENTS_CSV}" "{DRIVE_DIR}/experiments_log.csv"
print("Saved adapters + logs to:", DRIVE_DIR)