# LoRA/QLoRA Fine-Tuning – DistilBERT on GPU (FP16 LoRA, 4-bit QLoRA)

This notebook benchmarks the **fine-tuning performance** of a [DistilBERT](https://huggingface.co/distilbert-base-uncased) model on the SST-2 sentiment classification task using a T4 GPU in Google Colab.

We fine-tune and compare two adapter-based approaches using Hugging Face Transformers, [PEFT](https://github.com/huggingface/peft), and the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) quantization library:

- **FP16 LoRA fine-tuning** (adapters only on a full-precision base)  
- **4-bit QLoRA fine-tuning** (4-bit base weights + LoRA adapters)

Each version is **trained on the SST-2 train split** and then **evaluated** on a subset of 100 samples from the validation set for apples-to-apples inference comparison.

### Reported metrics:
- Accuracy
- Average latency per sample (in milliseconds)
- GPU memory usage (total VRAM after model load, and increase during inference)
- System RAM usage during inference
- Training cost (wall time in minutes) and peak VRAM during training
- Trainable parameters (in millions)

All experiments are executed in a GPU-only setup without CPU fallback.

**References:**  
[1] SST-2 validation set from the [GLUE benchmark](https://huggingface.co/datasets/glue/viewer/sst2)  
[2] Dettmers, T. et al., *QLoRA: Efficient Finetuning of Quantized LLMs* (2023)  
[3] PEFT: *Parameter-Efficient Fine-Tuning* (Hugging Face)

In [1]:
!pip install -q transformers datasets evaluate bitsandbytes accelerate peft psutil pynvml

import torch
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    BitsAndBytesConfig,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import time
import numpy as np
import pynvml
import psutil
import os

## 1. Load Pretrained DistilBERT and SST-2 Dataset

We initialize from the **base DistilBERT** checkpoint (not already fine-tuned).  
- Use the **SST-2 train split** for fine-tuning.  
- Use the **first 100 validation samples** for benchmarking inference to match prior notebooks.  
- Tokenize with max length 128 for consistency.

In [2]:
base_model_id = "distilbert-base-uncased"  # base model for fine-tuning (not pre-finetuned)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)

# Datasets
raw_train = load_dataset("glue", "sst2", split="train")
raw_val_100 = load_dataset("glue", "sst2", split="validation[:100]")

# Tokenization
def tok(batch):
    return tokenizer(batch["sentence"], padding="max_length", truncation=True, max_length=128)

tok_train = raw_train.map(tok, batched=True).remove_columns(["sentence", "idx"])
tok_val_100 = raw_val_100.map(tok, batched=True).remove_columns(["sentence", "idx"])

# HF Trainer expects 'labels'
tok_train = tok_train.rename_column("label", "labels")
tok_val_100 = tok_val_100.rename_column("label", "labels")

print("✅ Train examples:", len(tok_train))
print("✅ Val (benchmark) examples:", len(tok_val_100))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

✅ Train examples: 67349
✅ Val (benchmark) examples: 100


## 2. Convert to PyTorch-compatible dataset (for GPU inference benchmarking)

We define a small dataset wrapper (like in your previous notebooks) that:
- Returns CUDA tensors for `input_ids` and `attention_mask`
- Returns Python `int` for `label` (for accuracy calculation)

In [3]:
class SST2DatasetCUDA(Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = [torch.tensor(x) for x in hf_dataset["input_ids"]]
        self.attention_mask = [torch.tensor(x) for x in hf_dataset["attention_mask"]]
        self.labels = [int(x) for x in hf_dataset["labels"]]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx].to("cuda"),
            "attention_mask": self.attention_mask[idx].to("cuda"),
            "label": self.labels[idx],
        }

bench_dataset = SST2DatasetCUDA(tok_val_100)
print("✅ Benchmark dataset ready with", len(bench_dataset), "samples")

✅ Benchmark dataset ready with 100 samples


## 3. Define evaluation function

This function evaluates a model on the 100-sample SST-2 benchmark using the GPU.

It measures:
- Accuracy
- Average latency per sample (in seconds)
- GPU VRAM used after model load (total footprint)
- GPU VRAM increase during inference (in MB)
- System RAM usage increase during inference (in MB)

In [4]:
def evaluate_model(model, dataset):
    model.eval()
    process = psutil.Process(os.getpid())

    # Initialize GPU memory tracker
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    # Total VRAM used after model is loaded
    model_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used / (1024 ** 2)  # MB

    # RAM & VRAM before inference
    start_ram = process.memory_info().rss
    start_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used

    correct = 0
    latencies = []

    with torch.no_grad():
        for sample in dataset:
            inputs = {
                "input_ids": sample["input_ids"].unsqueeze(0),
                "attention_mask": sample["attention_mask"].unsqueeze(0),
            }
            label = sample["label"]

            t0 = time.time()
            outputs = model(**inputs)
            t1 = time.time()

            pred = torch.argmax(outputs.logits, dim=1).item()
            correct += (pred == label)
            latencies.append(t1 - t0)

    # Memory after inference
    end_ram = process.memory_info().rss
    end_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used
    pynvml.nvmlShutdown()

    # Metrics
    delta_ram = (end_ram - start_ram) / (1024 ** 2)      # MB
    delta_vram = (end_vram - start_vram) / (1024 ** 2)   # MB
    avg_latency = float(np.mean(latencies))
    accuracy = correct / len(dataset)

    return accuracy, avg_latency, delta_ram, delta_vram, model_vram

## 4. Prepare FP16 LoRA model (adapters on full-precision base)

We set up a DistilBERT classifier with **LoRA adapters** (no quantization).
- Base: `distilbert-base-uncased`
- LoRA config: `r=8`, `alpha=16`, `dropout=0.05`, `bias="none"`
- Target modules (DistilBERT attention): `["q_lin","k_lin","v_lin","out_lin"]`

This step **builds** the LoRA-enabled model and reports the **number of trainable parameters (M)** before training.

In [5]:
# Load base model (FP16 on GPU) for LoRA
model_lora = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
).to("cuda")
model_lora.half()  # FP16 for faster training on T4

# Configure LoRA (attention projections)
lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_lin","k_lin","v_lin","out_lin"],
)

# Wrap with PEFT
model_lora = get_peft_model(model_lora, lora_cfg)

# Count trainable parameters (in millions)
trainable = sum(p.numel() for p in model_lora.parameters() if p.requires_grad) / 1e6
total = sum(p.numel() for p in model_lora.parameters()) / 1e6
print(f"✅ LoRA model prepared. Trainable params: {trainable:.3f}M / {total:.3f}M total")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ LoRA model prepared. Trainable params: 0.295M / 67.250M total


## 5. Train FP16 LoRA on SST-2 (T4)

We fine-tune the **LoRA adapters** (base model stays frozen, FP16) on the SST-2 train split.

**Settings:**
- `r=8, alpha=16, dropout=0.05` (from previous step)
- Batch size: 16
- Epochs: 2
- Learning rate: 2e-4
- `fp16=True` on T4
- Evaluation each epoch on the 100-sample benchmark split (for quick sanity checks)

**We record:**
- **Wall time (minutes)** for training
- **Peak VRAM (MB)** during training (via `torch.cuda.max_memory_allocated`)
- Post-training **inference metrics** on the 100-sample benchmark (accuracy, latency, RAM/VRAM deltas, total VRAM)

In [None]:
# Data collator for padding
collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Accuracy metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = (preds == labels).mean()
    return {"accuracy": float(acc)}

# Training arguments (no eval_strategy for compatibility)
train_args = TrainingArguments(
    output_dir="n7_lora_dbert_sst2",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-4,
    num_train_epochs=2,
    logging_steps=50,
    save_strategy="no",
    report_to="none",
    fp16=True,
)

trainer = Trainer(
    model=model_lora,
    args=train_args,
    train_dataset=tok_train,
    eval_dataset=tok_val_100,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

# Measure training wall time + peak VRAM
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
t0 = time.time()
train_out = trainer.train()
t1 = time.time()

peak_vram_mb = (torch.cuda.max_memory_allocated() / (1024**2)) if torch.cuda.is_available() else 0.0
wall_minutes = (t1 - t0) / 60.0

print(f"✅ LoRA training done. Wall time: {wall_minutes:.2f} min | Peak VRAM: {peak_vram_mb:.2f} MB")

# Post-training inference evaluation on the 100-sample benchmark
acc_lora, lat_lora, ram_lora, vram_delta_lora, vram_total_lora = evaluate_model(model_lora, bench_dataset)

print(f"\n🔎 Post-training Inference (FP16 LoRA adapters):")
print(f"Accuracy: {acc_lora:.2%}")
print(f"Latency per sample: {lat_lora*1000:.2f} ms")
print(f"System RAM Δ: {ram_lora:.2f} MB")
print(f"GPU VRAM Δ (inference): {vram_delta_lora:.2f} MB")
print(f"Total VRAM after model load: {vram_total_lora:.2f} MB")


## 5.1 Sanity check — Evaluate FP16 LoRA on the full SST-2 validation set (872)

Why: the 100-sample slice is noisy. We compute a stable accuracy on the full dev set.
**What to expect:** After 2 epochs with LoRA (r=8, α=16), accuracy typically lands around **90–92%**.  
If it's <90%, plan to re-train with **3 epochs**, **lr=1e-4**, **warmup_ratio=0.06**, **weight_decay=0.01** (same batch size).

In [8]:
# Load full validation set and evaluate the trained LoRA model
raw_val_full = load_dataset("glue", "sst2", split="validation")

def tok(batch):
    return tokenizer(batch["sentence"], padding="max_length", truncation=True, max_length=128)

tok_val_full = raw_val_full.map(tok, batched=True).remove_columns(["sentence", "idx"])
tok_val_full = tok_val_full.rename_column("label", "labels")

class SST2DatasetCUDA(Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = [torch.tensor(x) for x in hf_dataset["input_ids"]]
        self.attention_mask = [torch.tensor(x) for x in hf_dataset["attention_mask"]]
        self.labels = [int(x) for x in hf_dataset["labels"]]
    def __len__(self): return len(self.labels)
    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx].to("cuda"),
            "attention_mask": self.attention_mask[idx].to("cuda"),
            "label": self.labels[idx],
        }

full_val_ds = SST2DatasetCUDA(tok_val_full)
acc_full, lat_full, ram_full, vram_delta_full, vram_total_full = evaluate_model(model_lora, full_val_ds)

print(f"✅ Full-val Accuracy (FP16 LoRA): {acc_full:.2%}")
print(f"🕒 Latency per sample (full-val): {lat_full*1000:.2f} ms")
print(f"💾 System RAM Δ: {ram_full:.2f} MB | 🟪 GPU VRAM Δ: {vram_delta_full:.2f} MB | 🟪 Total VRAM: {vram_total_full:.2f} MB")

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

✅ Full-val Accuracy (FP16 LoRA): 90.48%
🕒 Latency per sample (full-val): 10.52 ms
💾 System RAM Δ: 0.00 MB | 🟪 GPU VRAM Δ: 0.00 MB | 🟪 Total VRAM: 1021.88 MB


## 6. Prepare 4-bit QLoRA model (4-bit base + LoRA adapters)

We set up DistilBERT with a **4-bit quantized base** using bitsandbytes and add LoRA adapters on top:

- Quantization: `load_in_4bit=True`, `nf4` quant type, **double quantization** enabled  
- Compute dtype: `float16` (T4-friendly)  
- LoRA: `r=8, alpha=16, dropout=0.05, bias="none"`  
- Target modules: `["q_lin","k_lin","v_lin","out_lin"]`

This step builds the QLoRA-enabled model and reports the **trainable parameters (M)**.

In [9]:
# Clear cache before loading a new model
torch.cuda.empty_cache()

bnb_4bit_cfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

# Load 4-bit base classifier on GPU
model_qlora = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    quantization_config=bnb_4bit_cfg,
    device_map={"": 0},
)

# Prepare for k-bit training and add LoRA adapters
model_qlora = prepare_model_for_kbit_training(model_qlora)

lora_cfg_4bit = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_lin","k_lin","v_lin","out_lin"],
)
model_qlora = get_peft_model(model_qlora, lora_cfg_4bit)

# Parameter report
trainable_q = sum(p.numel() for p in model_qlora.parameters() if p.requires_grad) / 1e6
total_q = sum(p.numel() for p in model_qlora.parameters()) / 1e6
print(f"✅ QLoRA model prepared. Trainable params: {trainable_q:.3f}M / {total_q:.3f}M total")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ QLoRA model prepared. Trainable params: 0.295M / 46.016M total


## 7. Train QLoRA on SST-2 (T4) + Benchmark on 100-sample split

We fine-tune **LoRA adapters** on top of a **4-bit base** (QLoRA).  
**Settings:** epochs=2, batch size=16, lr=2e-4, fp16=True.  
We record **wall time (min)**, **peak VRAM (MB)** during training, then run our 100-sample benchmark (accuracy, latency, RAM/VRAM deltas, total VRAM).

In [13]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer

collator = DataCollatorWithPadding(tokenizer=tokenizer)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {"accuracy": float((preds == labels).mean())}

qlora_args = TrainingArguments(
    output_dir="n7_qlora_dbert_sst2",
    per_device_train_batch_size=16,   # lower to 8 if you OOM
    per_device_eval_batch_size=32,
    learning_rate=2e-4,
    num_train_epochs=2,
    logging_steps=50,
    save_strategy="no",
    report_to="none",
    fp16=True,
)

qlora_trainer = Trainer(
    model=model_qlora,
    args=qlora_args,
    train_dataset=tok_train,
    eval_dataset=tok_val_100,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

# Train with timers and VRAM tracking
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
t0 = time.time()
_ = qlora_trainer.train()
t1 = time.time()

peak_vram_mb_qlora = (torch.cuda.max_memory_allocated() / (1024**2)) if torch.cuda.is_available() else 0.0
wall_minutes_qlora = (t1 - t0) / 60.0
print(f"✅ QLoRA training done. Wall time: {wall_minutes_qlora:.2f} min | Peak VRAM: {peak_vram_mb_qlora:.2f} MB")

# Post-training inference evaluation on the 100-sample benchmark
acc_qlora, lat_qlora, ram_qlora, vram_delta_qlora, vram_total_qlora = evaluate_model(model_qlora, bench_dataset)

print(f"\n🔎 Post-training Inference (4-bit QLoRA):")
print(f"Accuracy: {acc_qlora:.2%}")
print(f"Latency per sample: {lat_qlora*1000:.2f} ms")
print(f"System RAM Δ: {ram_qlora:.2f} MB")
print(f"GPU VRAM Δ (inference): {vram_delta_qlora:.2f} MB")
print(f"Total VRAM after model load: {vram_total_qlora:.2f} MB")

  qlora_trainer = Trainer(
  return fn(*args, **kwargs)


AssertionError: 

## Conclusion – DistilBERT + QLoRA

This experiment **failed due to incompatibility** between DistilBERT's reduced architecture and bitsandbytes' 4-bit quantization kernels.  
The error arises from mismatched layer shapes during quantization state recovery, which is a known issue for smaller models.

- ✅ LoRA FP16 training worked reliably (≈90% accuracy).  
- ❌ QLoRA caused `AssertionError` in bitsandbytes and cannot be applied cleanly to DistilBERT.  

➡️ For working QLoRA results, see **n8_bert_qlora_gpu_t4.ipynb** (BERT-base).