# QLoRA Training Notebook (Judge Model)

This trains a lightweight **scoring/judge** model using **QLoRA** on your `sft.jsonl` labels.

### What you need to do before running
1. **Colab Runtime → GPU** (T4 is OK; A100 is faster).
2. **Accept the model license** on Hugging Face (e.g. `meta-llama/Llama-3.2-1B-Instruct`).
3. Obtain a **Hugging Face access token** with read access and be ready to paste it when prompted.
4. Upload your data files to Colab:
   - `/content/data/sft.jsonl` (train)
   - `/content/data/val.jsonl` (optional val)

### Outputs
- Trained LoRA adapter at `/content/qlora-judge-ckpt/` (zipped for download)
- (Optional) quick validation generation on a small subset



In [1]:
%%bash
set -e

# Be explicit & quiet
python -V
pip -q install --upgrade pip

# Remove conflicting preinstalls (ok if some aren't present)
pip -q uninstall -y numpy scipy numba opencv-python opencv-contrib-python opencv-python-headless || true

# Pin versions that satisfy Colab deps (opencv wants numpy>=2,<2.3; numba<2.1; gcsfs wants newer fsspec)
pip -q install "numpy==2.0.2" "scipy==1.14.1" "fsspec==2025.3.0"

# Torch + CUDA 12.1 wheels
pip -q install --upgrade --index-url https://download.pytorch.org/whl/cu121 \
  torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1

# bitsandbytes GPU build + triton providing triton.ops
pip -q install bitsandbytes==0.43.3 triton==2.3.0

# HF training stack
pip -q install transformers==4.45.0 accelerate==0.34.2 peft==0.11.1 trl==0.9.6 datasets==2.20.0

# Sanity print
python - << 'PY'
import numpy, scipy, torch, bitsandbytes as bnb, triton, glob
print("NumPy", numpy.__version__, "SciPy", scipy.__version__)
print("Torch", torch.__version__, "CUDA", torch.version.cuda)
print("bnb file:", bnb.__file__)
print("bnb libs:", glob.glob("/usr/local/lib/python*/dist-packages/bitsandbytes/libbitsandbytes_cuda*.so"))
PY

Python 3.12.12
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 25.6 MB/s eta 0:00:00
NumPy 1.26.4 SciPy 1.14.1
Torch 2.3.1+cu121 CUDA 12.1
bnb file: /usr/local/lib/python3.12/dist-packages/bitsandbytes/__init__.py
bnb libs: ['/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda125_nocublaslt.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda124_nocublaslt.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda120.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda125.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda121.so', '/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbit

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.24 requires opencv-python-headless>=4.9.0.80, which is not installed.
cuml-cu12 25.6.0 requires numba<0.62.0a0,>=0.59.1, which is not installed.
cudf-cu12 25.6.0 requires numba<0.62.0a0,>=0.59.1, which is not installed.
albumentations 2.0.8 requires opencv-python-headless>=4.9.0.80, which is not installed.
dask-cuda 25.6.0 requires numba<0.62.0a0,>=0.59.1, which is not installed.
librosa 0.11.0 requires numba>=0.51.0, which is not installed.
stumpy 1.13.0 requires numba>=0.57.1, which is not installed.
umap-learn 0.5.9.post2 requires numba>=0.51.2, which is not installed.
pynndescent 0.5.13 requires numba>=0.51.2, which is not installed.
distributed-ucxx-cu12 0.44.0 requires numba<0.62.0a0,>=0.59.1, which is not installed.
shap 0.48.0 requires numba>=0.54, which is not installed.
dopamine-rl 4.1.2 req

In [2]:
!pip -q install --upgrade "transformers>=4.45.0" "accelerate>=0.33.0"

In [None]:
import os, json, random, math, gc
import torch
from datasets import Dataset
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
major_cc = torch.cuda.get_device_capability(0)[0] if torch.cuda.is_available() else 0
DTYPE = torch.bfloat16 if (DEVICE=='cuda' and major_cc>=8) else torch.float16  # bf16 on A100/L4; fp16 on T4
print('Device:', DEVICE, ' dtype:', DTYPE)

# Choose a small instruction-tuned base. Accept license on HF first.
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"  # you can switch to Qwen/Qwen2.5-1.5B-Instruct if preferred
DATA_TRAIN = "/content/data/sft.jsonl"
DATA_VAL   = "/content/data/val.jsonl"  # optional
OUTPUT_DIR = "/content/qlora-judge-ckpt"
os.makedirs('/content/data', exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print('If you have not accepted the model license on HF, please do that now before logging in.')
login(token="<REPLACE_WITH_YOUR_HF_TOKEN>")

Device: cuda  dtype: torch.float16
If you have not accepted the model license on HF, please do that now before logging in.


## Load dataset (SFT JSONL)
Expected JSONL per line:
```json
{"instruction": "...", "input": "...", "output": "..."}
```
The model learns to map `(instruction + input) → output`.


In [2]:
def load_jsonl(path):
    rows = []
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line: continue
            rows.append(json.loads(line))
    return rows

train_rows = load_jsonl(DATA_TRAIN)
val_rows = load_jsonl(DATA_VAL) if os.path.exists(DATA_VAL) else []
print('Train examples:', len(train_rows), ' Val examples:', len(val_rows))

def format_example(o):
    intro = "You are a scoring model. Output STRICT JSON only — no extra text."
    return {
        'text': intro + "\n\n### Instruction:\n" + o['instruction'] + "\n\n### Input:\n" + o['input'] + "\n\n### Response:\n" + o['output']
    }

train_ds = Dataset.from_list([format_example(x) for x in train_rows])
val_ds = Dataset.from_list([format_example(x) for x in val_rows]) if val_rows else None
train_ds[0]['text'][:500]

Train examples: 1407  Val examples: 156


'You are a scoring model. Output STRICT JSON only — no extra text.\n\n### Instruction:\nGiven a business JSON and the executive summary text, produce STRICT JSON with fields:\n{\n  "sections": { <section>: { "answer_relevancy":0..1, "hallucination":0..1, "summarization":0..1, "toxicity":0..1, "bias":0..1 }, ... },\n  "overall":  { "answer_relevancy":0..1, "hallucination":0..1, "summarization":0..1, "toxicity":0..1, "bias":0..1 }\n}\nOnly include sections that are present in the business JSON (e.g., no ma'

## Load tokenizer & 4-bit base (BitsAndBytes nf4)

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"   # or your model with access granted
DTYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=DTYPE,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    quantization_config=bnb_config,
    torch_dtype=DTYPE,
    attn_implementation="sdpa",   # safe default
    trust_remote_code=True,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## PEFT LoRA config

In [4]:
# from peft import LoraConfig, get_peft_model
# lora_config = LoraConfig(
#     r=16,
#     lora_alpha=32,
#     lora_dropout=0.05,
#     bias='none',
#     task_type='CAUSAL_LM',
#     target_modules=['q_proj','k_proj','v_proj','o_proj','gate_proj','up_proj','down_proj']
# )
# model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()

## Train with SFTTrainer

In [5]:
from peft import prepare_model_for_kbit_training

# must be False when using gradient checkpointing
model.config.use_cache = False

# enable grad flow with ckpt
model.enable_input_require_grads()

# cast norms, set grad ckpt hooks, etc.
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True
)

from peft import LoraConfig, get_peft_model

target_modules = [
    "q_proj","k_proj","v_proj","o_proj",   # attention
    "gate_proj","up_proj","down_proj"      # mlp
]

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=target_modules,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # <-- should show non-zero trainable params

trainable params: 12,615,680 || all params: 1,112,664,064 || trainable%: 1.1338


In [19]:
import time; time.sleep(24000)

KeyboardInterrupt: 

In [22]:
import torch
from datasets import Dataset
from transformers import DefaultDataCollator, TrainingArguments, Trainer

MAX_SEQ_LEN = 2048
EPOCHS      = 1
GRAD_ACCUM  = 12
OVERLAP     = 0           # set to 128 later if time still < 3h

BATCH  = 1
LR     = 1e-4
DTYPE  = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
torch.backends.cuda.matmul.allow_tf32 = True

# tokenizer safety
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side    = "right"
tokenizer.truncation_side = "right"

# tokenize WITHOUT truncation; we will split docs into chunks
def tok_rows_no_trunc(ds):
    return ds.map(
        lambda b: tokenizer(
            b["text"],
            truncation=False,
            padding=False,
            return_attention_mask=False
        ),
        batched=True,
        remove_columns=ds.column_names,
        desc="Tokenizing (no truncation)"
    )

train_tok = tok_rows_no_trunc(train_ds)
val_tok   = tok_rows_no_trunc(val_ds)

# slice each doc into fixed-size blocks (optional overlap)
def chunk_per_example(tok_ds, seq_len, eos_id, min_tail=16, overlap=0):
    rows = []
    step = max(1, seq_len - overlap)
    for ex in tok_ds:
        ids = ex.get("input_ids", [])
        if not ids:
            continue
        ids = ids + [eos_id]     # boundary marker
        L = len(ids)
        i = 0
        while i < L:
            block = ids[i:i+seq_len]
            if len(block) < min_tail:
                break
            rows.append({
                "input_ids": block,
                "labels": block,
                "attention_mask": [1]*len(block)
            })
            i += step
    return Dataset.from_list(rows)

train_packed = chunk_per_example(train_tok, MAX_SEQ_LEN, tokenizer.eos_token_id, overlap=OVERLAP)
val_packed   = chunk_per_example(val_tok,   MAX_SEQ_LEN, tokenizer.eos_token_id, overlap=OVERLAP)
print("train_packed rows:", len(train_packed), "val_packed rows:", len(val_packed))

collator = DefaultDataCollator()

args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH,
    gradient_accumulation_steps=GRAD_ACCUM,
    learning_rate=LR,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    logging_steps=50,
    save_total_limit=1,
    bf16=(DTYPE==torch.bfloat16),
    fp16=(DTYPE==torch.float16),
    optim="paged_adamw_8bit",
    gradient_checkpointing=True,      # ON so 2048 fits on T4
    remove_unused_columns=True,
    report_to="none",
    group_by_length=True,
    dataloader_num_workers=2,
)

trainer = Trainer(
    model=model,
    args=args,
    data_collator=collator,
    train_dataset=train_packed,
    eval_dataset=val_packed,          # will run at end of each epoch
)

# quick batch sanity
batch = collator([train_packed[i] for i in range(min(2, len(train_packed)))])
for k,v in batch.items():
    print(k, type(v), getattr(v, "shape", None))

trainer.train()
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("✅ Saved to:", OUTPUT_DIR)

Map:   0%|          | 0/1407 [00:00<?, ? examples/s]

Map:   0%|          | 0/156 [00:00<?, ? examples/s]

train_packed rows: 1408 val_packed rows: 156
input_ids <class 'torch.Tensor'> torch.Size([2, 1024])
labels <class 'torch.Tensor'> torch.Size([2, 1024])
attention_mask <class 'torch.Tensor'> torch.Size([2, 1024])


Step,Training Loss
50,0.7973
100,0.6534
150,0.6475


✅ Saved to: /content/drive/MyDrive/qlora-judge


## Zip for download

In [23]:
!cd /content && zip -r qlora-judge-ckpt.zip qlora-judge-ckpt
!ls -lah /content/*.zip

  adding: qlora-judge-ckpt/ (stored 0%)
-rw-r--r-- 1 root root 184 Oct 16 07:17 /content/qlora-judge-ckpt.zip


### Notes & Precautions
- Keep the tab active. Save checkpoints to Drive if the run is long.
- If OOM: reduce `MAX_SEQ_LEN`, use smaller `BATCH`, increase `GRAD_ACCUM`.
- Your output JSON must be **STRICT**; include a few format-enforcer samples if needed.
