# Colab QLoRA Fine-Tuning Tool

This notebook walks you through training instruction-following adapters (LoRA) on top of open-source chat models such as `meta-llama/Llama-2-7b-chat-hf` using the QLoRA technique. Each section explains **what** to run and **why it matters** so you can confidently adapt the workflow to new datasets.


## Runtime checklist

1. Go to **Runtime ▸ Change runtime type** and pick `T4 GPU` (L4/A100 if you have Colab Pro/Pro+).
2. Toggle **GPU** and keep the rest default.
3. Connect the runtime before running the cells below.

> **Why:** QLoRA loads the base model in 4-bit precision, so a T4's 16 GB VRAM is sufficient for 7B chat models when you keep batch sizes modest.


In [None]:
!nvidia-smi


In [None]:
%%capture
%pip install -U accelerate==0.27.2 bitsandbytes==0.43.0 datasets==2.17.0 evaluate==0.4.1 huggingface_hub==0.21.4 peft==0.8.2 sentencepiece==0.1.99 transformers==4.38.2 trl==0.7.10 wandb


In [None]:
from huggingface_hub import login

login(token=input("Paste your Hugging Face access token: ").strip(), add_to_git_credential=True)


## Data ingestion options

- **Upload text files** (left sidebar ▸ Files ▸ Upload) and set `DATASET_SOURCE="text_folder"`.
- **Mount Google Drive** for larger corpora:
  ```python
  from google.colab import drive
  drive.mount('/content/drive')
  ```
- **Use a public Hugging Face dataset** by supplying its repo id (e.g., `tatsu-lab/alpaca`).

The helper below consolidates these flows and creates a `datasets.Dataset` with `instruction`, `response`, and optional `system` fields.


In [None]:
from dataclasses import asdict, dataclass
from pathlib import Path


@dataclass
class Config:
    project_name: str = "qlora-custom-data"
    base_model: str = "meta-llama/Llama-2-7b-chat-hf"
    dataset_source: str = "text_folder"  # text_folder | hf_dataset
    text_folder: str = "/content/data"   # ignored when using hf_dataset
    hf_dataset: str | None = None         # e.g. "tatsu-lab/alpaca"
    max_samples: int | None = None
    chunk_tokens: int = 1024
    chunk_overlap: int = 128
    system_prompt: str = "You are a helpful assistant that responds with concise, domain-specific answers."
    output_dir: str = "/content/qlora-output"
    wandb_project: str | None = None

    # Training hyperparams
    micro_batch_size: int = 4
    gradient_accumulation_steps: int = 4
    epochs: float = 3.0
    learning_rate: float = 2e-4
    warmup_ratio: float = 0.03
    weight_decay: float = 0.0
    cutoff_len: int = 2048
    lora_r: int = 64
    lora_alpha: int = 16
    lora_dropout: float = 0.1
    seed: int = 42


cfg = Config()
print(asdict(cfg))


In [None]:
import json
import random
import re

import pandas as pd
from datasets import Dataset, load_dataset


def _normalize(text: str) -> str:
    text = text.replace("\r", " ").strip()
    text = re.sub(r"\s+", " ", text)
    return text


def _chunk_words(text: str, chunk_tokens: int, overlap: int) -> list[str]:
    words = text.split()
    if not words:
        return []
    step = max(chunk_tokens - overlap, 1)
    chunks = []
    for start in range(0, len(words), step):
        segment = words[start:start + chunk_tokens]
        if len(segment) < 32:
            continue
        chunks.append(" ".join(segment))
    return chunks or [" ".join(words[:chunk_tokens])]


def _load_local_texts(folder: str, cfg: Config) -> list[dict]:
    folder_path = Path(folder)
    rows = []
    for path in folder_path.rglob("*.txt"):
        text = path.read_text(encoding="utf-8")
        for chunk in _chunk_words(_normalize(text), cfg.chunk_tokens, cfg.chunk_overlap):
            rows.append({
                "instruction": f"Answer the following based on {path.stem}:",
                "response": chunk,
                "system": cfg.system_prompt,
            })
    return rows


def _load_hf_dataset(repo_id: str, cfg: Config) -> list[dict]:
    ds = load_dataset(repo_id, split="train")
    fields = set(ds.column_names)
    rows = []
    for row in ds:
        if {"instruction", "output"}.issubset(fields):
            response = row["output"]
            instruction = row["instruction"]
        else:
            # Fallback to single text field
            response = row.get("text", row.get("response", ""))
            instruction = row.get("instruction", "Summarize the passage:")
        rows.append({
            "instruction": _normalize(str(instruction)),
            "response": _normalize(str(response)),
            "system": row.get("system", cfg.system_prompt),
        })
    return rows


def build_dataset(cfg: Config) -> Dataset:
    if cfg.dataset_source == "hf_dataset" and cfg.hf_dataset:
        rows = _load_hf_dataset(cfg.hf_dataset, cfg)
    else:
        rows = _load_local_texts(cfg.text_folder, cfg)

    if cfg.max_samples:
        random.seed(cfg.seed)
        rows = random.sample(rows, min(cfg.max_samples, len(rows)))

    clean_rows = [r for r in rows if r["instruction"].strip() and r["response"].strip()]
    dataset = Dataset.from_pandas(pd.DataFrame(clean_rows))
    print(f"Dataset has {len(dataset)} rows after cleaning")
    return dataset


dataset = build_dataset(cfg)
dataset[:2]


## Prompt templating

We rely on the tokenizer's built-in chat template (when available) so that formatting always matches the base model's expectations. If the checkpoint lacks a template, we fall back to a simple system/instruction/response triple.


In [None]:
from transformers import AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained(cfg.base_model, use_fast=False)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


def build_prompt(example: dict) -> dict:
    messages = [
        {"role": "system", "content": example.get("system") or cfg.system_prompt},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    if tokenizer.chat_template:
        prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    else:
        prompt = (
            f"[SYSTEM]\n{messages[0]['content']}\n\n"
            f"[USER]\n{messages[1]['content']}\n\n"
            f"[ASSISTANT]\n{messages[2]['content']}"
        )
    return {"text": prompt}


processed_dataset = dataset.map(build_prompt, remove_columns=dataset.column_names)
processed_dataset = processed_dataset.shuffle(seed=cfg.seed)
splits = processed_dataset.train_test_split(test_size=0.05, seed=cfg.seed)
splits


## QLoRA training

We load the base model in 4-bit NF4 using `BitsAndBytesConfig`, attach LoRA adapters to attention + MLP modules, and fine-tune with `trl.SFTTrainer` so padding/truncation are handled automatically.


In [None]:
import math
import os

import torch
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer

os.makedirs(cfg.output_dir, exist_ok=True)
if cfg.wandb_project:
    os.environ["WANDB_PROJECT"] = cfg.wandb_project
else:
    os.environ["WANDB_DISABLED"] = "true"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    cfg.base_model,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    r=cfg.lora_r,
    lora_alpha=cfg.lora_alpha,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=cfg.lora_dropout,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    output_dir=cfg.output_dir,
    per_device_train_batch_size=cfg.micro_batch_size,
    gradient_accumulation_steps=cfg.gradient_accumulation_steps,
    num_train_epochs=cfg.epochs,
    learning_rate=cfg.learning_rate,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    evaluation_strategy="steps",
    eval_steps=100,
    warmup_ratio=cfg.warmup_ratio,
    weight_decay=cfg.weight_decay,
    max_grad_norm=0.3,
    report_to=([] if os.environ.get("WANDB_DISABLED") == "true" else ["wandb"]),
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    peft_config=peft_config,
    train_dataset=splits["train"],
    eval_dataset=splits["test"],
    dataset_text_field="text",
    max_seq_length=cfg.cutoff_len,
    packing=True,
    args=training_args,
)

trainer.train()
trainer.save_model(cfg.output_dir)
tokenizer.save_pretrained(cfg.output_dir)


In [None]:
from torch.utils.data import DataLoader


def compute_perplexity(eval_dataset, max_batches: int = 32) -> float:
    model.eval()
    ppl_scores = []
    loader = DataLoader(eval_dataset["text"], batch_size=1)
    for idx, batch in enumerate(loader):
        if idx >= max_batches:
            break
        encoded = tokenizer(batch[0], return_tensors="pt").to(model.device)
        with torch.no_grad():
            outputs = model(**encoded, labels=encoded["input_ids"])
        ppl_scores.append(math.exp(outputs.loss.item()))
    return sum(ppl_scores) / len(ppl_scores)


perplexity = compute_perplexity(splits["test"], max_batches=32)
print(f"Approximate perplexity: {perplexity:.2f}")


In [None]:
def chat(prompt: str, system: str | None = None, max_new_tokens: int = 512) -> str:
    messages = [
        {"role": "system", "content": system or cfg.system_prompt},
        {"role": "user", "content": prompt},
    ]
    template = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(template, return_tensors="pt").to(model.device)
    with torch.no_grad():
        generated = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            top_p=0.9,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
        )
    output = generated[0][inputs["input_ids"].shape[-1]:]
    return tokenizer.decode(output, skip_special_tokens=True)


chat("Summarize the three most important facts from our dataset.")


In [None]:
from peft import AutoPeftModelForCausalLM


MERGED_DIR = Path(cfg.output_dir) / "merged"
MERGED_DIR.mkdir(parents=True, exist_ok=True)

merged_model = AutoPeftModelForCausalLM.from_pretrained(cfg.output_dir, device_map="auto")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained(MERGED_DIR, safe_serialization=True)
tokenizer.save_pretrained(MERGED_DIR)
print("Merged checkpoint saved to", MERGED_DIR)

upload_to_hub = False  # flip to True to push
if upload_to_hub:
    repo_id = input("Target HF repo (e.g. username/project-name): ")
    merged_model.push_to_hub(repo_id, private=True)
    tokenizer.push_to_hub(repo_id, private=True)


## Next steps

- Duplicate this notebook per project so you can version-control configs.
- Replace the evaluation prompts with task-specific checklists (safety, compliance, tone).
- Deploy LoRA adapters with frameworks like Text Generation Inference (TGI) or vLLM by loading the base model and calling `PeftModel.from_pretrained`. For serverless workflows, merge the adapters (cell above) and upload the full checkpoint.
