# BERT-base on SST-2: FP32 vs QLoRA (4-bit)

This notebook trains and evaluates the **BERT-base-uncased** model (~110M params) on the **SST-2** sentiment classification task using a **T4 GPU** in Google Colab.

We compare two regimes using Hugging Face Transformers, **bitsandbytes**, and **PEFT/LoRA**:
- **FP32 baseline:** full-precision fine-tuning  
- **QLoRA (4-bit):** 4-bit quantized backbone + low-rank adapters (LoRA) trained

To avoid Colab GPU limits, we use a time-capped setup by default:
- Train subset: first **20,000** training examples
- **2 epochs**
- Max sequence length = 128

### Reported metrics (for both models)
- Trainable vs total parameters  
- Training wall time (min) & peak VRAM (MB)  
- **Full validation** (872 samples): accuracy, latency per sample (ms)  
- System RAM Δ (MB), GPU VRAM Δ (MB), total VRAM after model load (MB)

This notebook evaluates whether **QLoRA**, designed for **very large LLMs**, offers practical efficiency benefits at the **BERT** scale.

**References:**  
[1] Devlin, J. et al. (2019). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.* https://arxiv.org/abs/1810.04805  
[2] Hu, E. et al. (2021). *LoRA: Low-Rank Adaptation of Large Language Models.* https://arxiv.org/abs/2106.09685  
[3] Dettmers, T. et al. (2023). *QLoRA: Efficient Finetuning of Quantized LLMs.* https://arxiv.org/abs/2305.14314  
[4] SST-2 (GLUE) dataset viewer: https://huggingface.co/datasets/glue/viewer/sst2  
[5] bitsandbytes library: https://github.com/TimDettmers/bitsandbytes  
[6] PEFT (Parameter-Efficient Fine-Tuning): https://github.com/huggingface/peft


In [19]:
# Install dependencies
!pip -q install transformers datasets bitsandbytes accelerate peft psutil pynvml

# Imports
import os, time, sys, gc
import numpy as np
import pandas as pd
import psutil
import torch
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    TaskType,
    PeftModel,
)
import transformers

In [9]:
print("Python:", sys.version.split()[0])
print("PyTorch:", torch.__version__)
print("Transformers:", transformers.__version__)
if torch.cuda.is_available():
    print("CUDA:", torch.version.cuda, "| GPU:", torch.cuda.get_device_name(0))

Python: 3.12.11
PyTorch: 2.8.0+cu126
Transformers: 4.55.3
CUDA: 12.6 | GPU: Tesla T4


In [2]:
MODEL_ID   = "bert-base-uncased"
TRAIN_SLICE = "train[:20000]"
EPOCHS      = 2
BATCH_TRAIN = 16
BATCH_EVAL  = 32
LR          = 2e-5
MAX_LEN     = 128

# Reproducibility
SEED = 42
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Device info
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", DEVICE)
if DEVICE == "cuda":
    print("GPU:", torch.cuda.get_device_name(0))

Device: cuda
GPU: Tesla T4


## 1. Load & Tokenize SST-2

We load:
- **Train split (subset):** `train[:20,000]` (configurable via `TRAIN_SLICE`)
- **Validation split (full):** 872 samples

Tokenization uses `bert-base-uncased` with **max length = 128** to match the other notebooks.

In [3]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Load SST-2 splits
train_raw = load_dataset("glue", "sst2", split=TRAIN_SLICE)
val_raw   = load_dataset("glue", "sst2", split="validation")

# Tokenization function
def tok_fn(batch):
    return tokenizer(
        batch["sentence"],
        padding="max_length",
        truncation=True,
        max_length=MAX_LEN,
    )

# Apply tokenization
tok_train = train_raw.map(tok_fn, batched=True)
tok_val   = val_raw.map(tok_fn, batched=True)

# Clean columns for Trainer
tok_train = tok_train.remove_columns(["sentence", "idx"]).rename_column("label", "labels")
tok_val   = tok_val.remove_columns(["sentence", "idx"]).rename_column("label", "labels")

print(f"Train examples: {len(tok_train)} | Val examples: {len(tok_val)}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Train examples: 20000 | Val examples: 872


## 2. FP32 Baseline Fine-Tuning

We fine-tune **BERT-base-uncased** in FP32 on the SST-2 subset.

**Hyperparameters:** batch size **16**, **epochs = 2**, **learning rate = 2e-5**, **max length = 128**.  
We report **trainable vs total parameters**, **training wall time (min)**, and **peak GPU VRAM (MB)**.


In [4]:
# Data collator
collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load FP32 model
model_fp32 = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID, num_labels=2
).to(DEVICE)

# Parameter counts
trainable_params = sum(p.numel() for p in model_fp32.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model_fp32.parameters())
print(f"Trainable params: {trainable_params:,} | All params: {all_params:,} | Trainable%: {100*trainable_params/all_params:.2f}%")

# Training arguments (FP32)
args_fp32 = TrainingArguments(
    output_dir="n8_fp32_baseline",
    per_device_train_batch_size=BATCH_TRAIN,
    per_device_eval_batch_size=BATCH_EVAL,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="no",
    report_to="none",
    fp16=False,
    bf16=False,
)

# Simple accuracy metric
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {"accuracy": float((preds == labels).mean())}

# Trainer
trainer_fp32 = Trainer(
    model=model_fp32,
    args=args_fp32,
    train_dataset=tok_train,
    eval_dataset=tok_val,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

# Train and measure time/VRAM
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
t0 = time.time()
_ = trainer_fp32.train()
t1 = time.time()

peak_vram_fp32 = (torch.cuda.max_memory_allocated() / (1024**2)) if torch.cuda.is_available() else 0.0
wall_minutes_fp32 = (t1 - t0) / 60.0

# Save the trained FP32 model (to ensure reproducibility if the runtime resets)
os.makedirs("n8_fp32_baseline_ckpt", exist_ok=True)
model_fp32.save_pretrained("n8_fp32_baseline_ckpt")
tokenizer.save_pretrained("n8_fp32_baseline_ckpt")

print(f" FP32 training complete | Wall time: {wall_minutes_fp32:.2f} min | Peak VRAM: {peak_vram_fp32:.2f} MB")
print(" Saved checkpoint: n8_fp32_baseline_ckpt")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 109,483,778 | All params: 109,483,778 | Trainable%: 100.00%


  trainer_fp32 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2336,0.238257,0.902523
2,0.1387,0.358879,0.916284


 FP32 training complete | Wall time: 14.59 min | Peak VRAM: 2526.11 MB
 Saved checkpoint: n8_fp32_baseline_ckpt


## 3. QLoRA (4-bit) Fine-Tuning

We fine-tune **BERT-base-uncased** with **QLoRA**: the backbone is loaded in **4-bit (nf4)** and we train **LoRA adapters**.

> **Note:** QLoRA is a *different optimization regime* (tiny fraction of parameters are trainable), so we use **QLoRA-appropriate hyperparameters**:
- LoRA rank **r = 16**, `target_modules=["query","value"]`, `lora_alpha=32`
- **learning rate = 2e-4** (higher than FP32 to compensate for few trainable params)
- **epochs = 2**, batch sizes as FP32
- **4-bit compute dtype = fp16** (suited for T4)
- **gradient checkpointing disabled** (for speed; uses a bit more VRAM)

In [7]:
# 4-bit quantization config (nf4)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,  # <- change from bfloat16 to float16 for T4
)

# Load model in 4-bit
model_qlora = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID,
    num_labels=2,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training + attach LoRA adapters
model_qlora = prepare_model_for_kbit_training(model_qlora)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=16,                 # <- was 8
    lora_alpha=32,        # <- was 16
    lora_dropout=0.1,
    target_modules=["query", "value"],  # <- add this
    bias="none",
)
model_qlora = get_peft_model(model_qlora, lora_config)


# Parameter counts
trainable_params_qlora = sum(p.numel() for p in model_qlora.parameters() if p.requires_grad)
all_params_qlora = sum(p.numel() for p in model_qlora.parameters())

# Training arguments
args_qlora = TrainingArguments(
    output_dir="n8_qlora_output_r16",
    per_device_train_batch_size=BATCH_TRAIN,
    per_device_eval_batch_size=BATCH_EVAL,
    learning_rate=2e-4,           # <- higher LR for LoRA
    num_train_epochs=EPOCHS,      # 2
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="no",
    report_to="none",
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    fp16=False,
    bf16=False,
    gradient_checkpointing=False,  # <- speed up (uses a bit more VRAM)
)

# Reuse the same collator and compute_metrics as FP32
trainer_qlora = Trainer(
    model=model_qlora,
    args=args_qlora,
    train_dataset=tok_train,
    eval_dataset=tok_val,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

# Train and measure time/VRAM
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
t0 = time.time()
_ = trainer_qlora.train()
t1 = time.time()

peak_vram_qlora = (torch.cuda.max_memory_allocated() / (1024**2)) if torch.cuda.is_available() else 0.0
wall_minutes_qlora = (t1 - t0) / 60.0

# Save QLoRA checkpoint
os.makedirs("n8_qlora_ckpt", exist_ok=True)
model_qlora.save_pretrained("n8_qlora_ckpt")
tokenizer.save_pretrained("n8_qlora_ckpt")

print(f" QLoRA training complete | Wall time: {wall_minutes_qlora:.2f} min | Peak VRAM: {peak_vram_qlora:.2f} MB")
print(" Saved checkpoint: n8_qlora_ckpt")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 591,362 | All params: 67,312,900 | Trainable%: 0.88%


  trainer_qlora = Trainer(
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2865,0.278059,0.893349
2,0.2378,0.255994,0.905963


 QLoRA training complete | Wall time: 7.55 min | Peak VRAM: 1638.87 MB
 Saved checkpoint: n8_qlora_ckpt


In [22]:
# QLoRA trainable % using both denominators
# Assumes model_qlora (trained) and all_params (FP32 total) are in memory.
trainable_params_qlora = sum(p.numel() for p in model_qlora.parameters() if p.requires_grad)
all_params_qlora = sum(p.numel() for p in model_qlora.parameters())  # QLoRA tensor count

pct_vs_qlora_total = 100 * trainable_params_qlora / all_params_qlora
pct_vs_fp32_total  = 100 * trainable_params_qlora / all_params  # FP32 total from earlier

print(
    f"Trainable params: {trainable_params_qlora:,} | "
    f"All params (QLoRA tensors): {all_params_qlora:,} | "
    f"Trainable% vs QLoRA total: {pct_vs_qlora_total:.2f}% | "
    f"Trainable% vs FP32 arch total: {pct_vs_fp32_total:.2f}%"
)

Trainable params: 591,362 | All params (QLoRA tensors): 67,312,900 | Trainable% vs QLoRA total: 0.88% | Trainable% vs FP32 arch total: 0.54%


## 4. Evaluation (reload-based, per-model isolation)

We reload each checkpoint and evaluate **one model at a time** to get correct:
- Accuracy
- Latency per sample (ms)
- **VRAM after load (MB)** (only this model is on GPU)
- **Peak VRAM during eval (MB)**

In [20]:
MB = 1024**2
collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Accurate per-model eval by reloading checkpoints (no other model on GPU)
def _eval_loop(model, dataset, batch_size):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.eval().to(device)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, collate_fn=collator)
    n, correct = 0, 0
    t0 = time.perf_counter()
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(device) for k, v in batch.items()}
            logits = model(**batch).logits
            preds = logits.argmax(-1)
            correct += (preds == batch["labels"]).sum().item()
            n += batch["labels"].size(0)
    t1 = time.perf_counter()
    return (correct / n), (t1 - t0) * 1000.0 / n  # acc, ms/sample

def eval_fp32_from_ckpt(ckpt_dir="n8_fp32_baseline_ckpt"):
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache(); torch.cuda.synchronize()
        baseline = torch.cuda.memory_allocated()
    model = AutoModelForSequenceClassification.from_pretrained(ckpt_dir, num_labels=2)
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    acc, ms = _eval_loop(model, tok_val, BATCH_EVAL)
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        vram_after_load = (torch.cuda.memory_allocated() - baseline) / MB
        peak_eval = torch.cuda.max_memory_allocated() / MB
    else:
        vram_after_load = peak_eval = 0.0
    del model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    return {
        "accuracy": round(acc, 4),
        "latency_ms_per_sample": round(ms, 4),
        "vram_after_load_MB": round(vram_after_load, 2),
        "peak_vram_eval_MB": round(peak_eval, 2),
    }

def eval_qlora_from_ckpt(ckpt_dir="n8_qlora_ckpt"):
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache(); torch.cuda.synchronize()
        baseline = torch.cuda.memory_allocated()
    # Reload 4-bit base then attach LoRA adapters
    base = AutoModelForSequenceClassification.from_pretrained(
        MODEL_ID, num_labels=2, quantization_config=bnb_config, device_map="auto"
    )
    model = PeftModel.from_pretrained(base, ckpt_dir)
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
    acc, ms = _eval_loop(model, tok_val, BATCH_EVAL)
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        vram_after_load = (torch.cuda.memory_allocated() - baseline) / MB
        peak_eval = torch.cuda.max_memory_allocated() / MB
    else:
        vram_after_load = peak_eval = 0.0
    del model
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    return {
        "accuracy": round(acc, 4),
        "latency_ms_per_sample": round(ms, 4),
        "vram_after_load_MB": round(vram_after_load, 2),
        "peak_vram_eval_MB": round(peak_eval, 2),
    }

# Run
e_fp32 = eval_fp32_from_ckpt()
e_qlor = eval_qlora_from_ckpt()
e_fp32, e_qlor

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


({'accuracy': 0.9163,
  'latency_ms_per_sample': 7.5969,
  'vram_after_load_MB': 420.99,
  'peak_vram_eval_MB': 1513.67},
 {'accuracy': 0.906,
  'latency_ms_per_sample': 3.2502,
  'vram_after_load_MB': 99.0,
  'peak_vram_eval_MB': 1124.68})

In [21]:
# Fair totals: use FP32 architecture total for both rows
total_params_arch = all_params  # from FP32
qlora_trainable_pct_arch = 100.0 * (trainable_params_qlora / total_params_arch)

df_results = pd.DataFrame([
    {
        "Model": "BERT FP32",
        "Trainable / Total Params": f"{trainable_params:,} / {total_params_arch:,} (100%)",
        "Training Wall Time (min)": round(wall_minutes_fp32, 2),
        "Peak VRAM (MB, train)": round(peak_vram_fp32, 2),
        "Val Acc": e_fp32["accuracy"],
        "Latency (ms/sample)": e_fp32["latency_ms_per_sample"],
        "VRAM after load (MB)": e_fp32["vram_after_load_MB"],
        "Peak VRAM (MB, eval)": e_fp32["peak_vram_eval_MB"],
    },
    {
        "Model": "BERT QLoRA (4-bit, r=16, q/v, fp16)",
        "Trainable / Total Params": f"{trainable_params_qlora:,} / {total_params_arch:,} ({qlora_trainable_pct_arch:.2f}%)",
        "Training Wall Time (min)": round(wall_minutes_qlora, 2),
        "Peak VRAM (MB, train)": round(peak_vram_qlora, 2),
        "Val Acc": e_qlor["accuracy"],
        "Latency (ms/sample)": e_qlor["latency_ms_per_sample"],
        "VRAM after load (MB)": e_qlor["vram_after_load_MB"],
        "Peak VRAM (MB, eval)": e_qlor["peak_vram_eval_MB"],
    },
])

display(df_results)
os.makedirs("results", exist_ok=True)
df_results.to_csv("results/bert_sst2_fp32_vs_qlora.csv", index=False)
print("Saved: results/bert_sst2_fp32_vs_qlora.csv")

Unnamed: 0,Model,Trainable / Total Params,Training Wall Time (min),"Peak VRAM (MB, train)",Val Acc,Latency (ms/sample),VRAM after load (MB),"Peak VRAM (MB, eval)"
0,BERT FP32,"109,483,778 / 109,483,778 (100%)",14.59,2526.11,0.9163,7.5969,420.99,1513.67
1,"BERT QLoRA (4-bit, r=16, q/v, fp16)","591,362 / 109,483,778 (0.54%)",7.55,1638.87,0.906,3.2502,99.0,1124.68


Saved: results/bert_sst2_fp32_vs_qlora.csv


Notes.
- FP32 total params used for both rows (QLoRA’s quantized layers don’t sum cleanly).
- VRAM metrics come from per-model reload so each model is measured alone on GPU.
- The “newly initialized classifier” warning can appear when loading the base model; it’s expected and does not affect results here.

## Results

**Setup:** SST-2 `train[:20,000]`, 2 epochs, max length 128, T4 GPU.

| Model | Trainable / Total Params | Training Wall Time (min) | Peak VRAM (MB, train) | Val Acc | Latency (ms/sample) | VRAM after load (MB) | Peak VRAM (MB, eval) |
|---|---|---:|---:|---:|---:|---:|---:|
| BERT FP32 | 109,483,778 / 109,483,778 (100%) | 14.59 | 2526.11 | 0.9163 | 7.5969 | 420.99 | 1513.67 |
| BERT QLoRA (4-bit, r=16, q/v, fp16) | 591,362 / 109,483,778 (0.54%) | 7.55 | 1638.87 | 0.9060 | 3.2502 | 99.00 | 1124.68 |

**Notes.**
- We report **FP32 total parameters** for both rows (QLoRA’s quantized layers don’t sum cleanly).
- Evaluation memory metrics are from **per-model reload** so each model is measured alone on GPU.
- The “newly initialized classifier” warning can appear when reloading the **base** model; it’s expected and does not affect results here.

## Conclusion

- **Accuracy:** QLoRA reaches **0.9060**, within **~1.0 point** of FP32 (**0.9163**).
- **Training time:** QLoRA is ~**1.9× faster** (7.55 vs 14.59 min).
- **Peak VRAM (train):** QLoRA saves ~**0.87 GB** (1638.87 vs 2526.11 MB).
- **Inference:** QLoRA has ~**2.3× lower latency** (3.25 vs 7.60 ms/sample) and lower eval VRAM.

**Takeaway:** At the BERT scale, **QLoRA** delivers strong **efficiency** (speed + memory) while achieving **near-FP32 accuracy**. FP32 remains the accuracy ceiling; QLoRA is ideal when VRAM or training time is constrained.
