# Quantized Evaluation – DistilBERT on GPU (FP32, 8-bit, 4-bit)

This notebook benchmarks the inference performance of a fine-tuned DistilBERT model on the SST-2 sentiment classification task using a T4 GPU in Google Colab.

We compare:

- Full-precision (FP32) inference  
- 8-bit quantized inference using bitsandbytes  
- 4-bit quantized inference using bitsandbytes  

Each version is evaluated on a subset of 100 samples from the SST-2 validation set.

### Reported metrics:
- Accuracy
- Average latency per sample (in milliseconds)
- GPU memory usage (total VRAM after model load, and increase during inference)
- System RAM usage during inference

All experiments use Hugging Face Transformers and bitsandbytes, executed in a GPU-only setup without CPU fallback.

In [1]:
!pip install -q transformers datasets evaluate bitsandbytes accelerate

import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig
from datasets import load_dataset
import time
import numpy as np
import pynvml
import psutil
import os

## 1. Load Pretrained DistilBERT and SST-2 Dataset

We use the fine-tuned DistilBERT model for SST-2 from Hugging Face.

The model is already trained and ready for inference.  
We use the first 100 validation samples from SST-2 for benchmarking.

In [2]:
model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load SST-2 validation split (first 100 examples)
raw_dataset = load_dataset("glue", "sst2", split="validation[:100]")

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["sentence"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = raw_dataset.map(tokenize_function)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 2. Convert to PyTorch-compatible dataset

We define a custom dataset class that wraps the tokenized Hugging Face dataset and returns input tensors for PyTorch inference.

In [3]:
class SST2Dataset(Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = [torch.tensor(x) for x in hf_dataset["input_ids"]]
        self.attention_mask = [torch.tensor(x) for x in hf_dataset["attention_mask"]]
        self.labels = [torch.tensor(x) for x in hf_dataset["label"]]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx].to("cuda"),
            "attention_mask": self.attention_mask[idx].to("cuda"),
            "label": self.labels[idx].item()
        }

dataset = SST2Dataset(tokenized_dataset)

## 3. Define evaluation function

This function evaluates the model on the SST-2 validation set using GPU.

It measures:

- Accuracy (on 100 SST-2 samples)
- Average latency per sample (in seconds)
- GPU VRAM used after model load (total footprint)
- GPU VRAM increase during inference (in MB)
- System RAM usage increase during inference (in MB)

In [4]:
def evaluate_model(model, dataset):
    model.eval()
    process = psutil.Process(os.getpid())

    # Initialize GPU memory tracker
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    # Total VRAM used after model is loaded
    model_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used / (1024 ** 2)  # MB

    # RAM usage before inference
    start_ram = process.memory_info().rss
    start_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used

    correct = 0
    latencies = []

    with torch.no_grad():
        for sample in dataset:
            inputs = {
                "input_ids": sample["input_ids"].unsqueeze(0),
                "attention_mask": sample["attention_mask"].unsqueeze(0),
            }
            label = sample["label"]

            start_time = time.time()
            outputs = model(**inputs)
            end_time = time.time()

            pred = torch.argmax(outputs.logits, dim=1).item()
            correct += (pred == label)
            latencies.append(end_time - start_time)

    # Memory after inference
    end_ram = process.memory_info().rss
    end_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used
    pynvml.nvmlShutdown()

    # Metrics
    delta_ram = (end_ram - start_ram) / (1024 ** 2)      # in MB
    delta_vram = (end_vram - start_vram) / (1024 ** 2)   # in MB
    avg_latency = np.mean(latencies)
    accuracy = correct / len(dataset)

    return accuracy, avg_latency, delta_ram, delta_vram, model_vram

## 4. Run Full-Precision (FP32) Evaluation on GPU

We now evaluate the full-precision (FP32) DistilBERT model using the T4 GPU.

This serves as our performance baseline before applying any quantization.

In [5]:
# Load full-precision model to GPU
model_fp32 = AutoModelForSequenceClassification.from_pretrained(model_id).to("cuda")

# Run evaluation
accuracy_fp32, latency_fp32, ram_fp32, vram_delta_fp32, vram_model_fp32 = evaluate_model(model_fp32, dataset)

print(f"Accuracy (FP32): {accuracy_fp32:.2%}")
print(f"Latency per sample (FP32): {latency_fp32:.4f} seconds")
print(f"System RAM usage increase (FP32): {ram_fp32:.2f} MB")
print(f"GPU VRAM usage increase during inference (FP32): {vram_delta_fp32:.2f} MB")
print(f"Total VRAM after model load (FP32): {vram_model_fp32:.2f} MB")

Accuracy (FP32): 94.00%
Latency per sample (FP32): 0.0124 seconds
System RAM usage increase (FP32): 213.52 MB
GPU VRAM usage increase during inference (FP32): 30.00 MB
Total VRAM after model load (FP32): 659.88 MB


## 5. Quantize the Model (8-bit)

We apply 8-bit quantization to the full-precision DistilBERT model using bitsandbytes.

This reduces the size of the model by compressing the linear layers to 8-bit precision. The quantized model remains compatible with GPU inference through Hugging Face Transformers.

In [6]:
# Clear GPU cache before loading new model
torch.cuda.empty_cache()

# Load 8-bit quantized model fully on GPU
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    device_map={"": 0}  # Force all layers onto GPU 0
)

model_int8 = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit
)

## 6. Evaluate Quantized Model (INT8)

We evaluate the 8-bit quantized DistilBERT model using the same procedure as in the FP32 baseline.


In [7]:
accuracy_int8, latency_int8, ram_int8, vram_delta_int8, vram_model_int8 = evaluate_model(model_int8, dataset)

print(f"Accuracy (INT8): {accuracy_int8:.2%}")
print(f"Latency per sample (INT8): {latency_int8:.4f} seconds")
print(f"System RAM usage increase (INT8): {ram_int8:.2f} MB")
print(f"GPU VRAM usage increase during inference (INT8): {vram_delta_int8:.2f} MB")
print(f"Total VRAM after model load (INT8): {vram_model_int8:.2f} MB")

Accuracy (INT8): 94.00%
Latency per sample (INT8): 0.0772 seconds
System RAM usage increase (INT8): 130.85 MB
GPU VRAM usage increase during inference (INT8): 14.00 MB
Total VRAM after model load (INT8): 843.88 MB


## 7. Quantize the Model (4-bit)

We apply 4-bit quantization using bitsandbytes by enabling `load_in_4bit=True`.

This uses QLoRA-style 4-bit quantization with `bnb_4bit` kernels, compressing linear layers more aggressively than 8-bit. We expect a further reduction in memory footprint, potentially at the cost of slightly slower inference or reduced accuracy.

In [8]:
# Clear GPU cache before loading 4-bit model
torch.cuda.empty_cache()

bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    device_map={"": 0}
)

model_4bit = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    quantization_config=bnb_config_4bit
)

## 8. Evaluate Quantized Model (4-bit)

We evaluate the 4-bit quantized DistilBERT model using the same procedure.

The 4-bit setup is more aggressive than 8-bit and is typically used for large models with QLoRA. On small models like DistilBERT, results may vary.

In [9]:
accuracy_4bit, latency_4bit, ram_4bit, vram_delta_4bit, vram_model_4bit = evaluate_model(model_4bit, dataset)

print(f"Accuracy (4-bit): {accuracy_4bit:.2%}")
print(f"Latency per sample (4-bit): {latency_4bit:.4f} seconds")
print(f"System RAM usage increase (4-bit): {ram_4bit:.2f} MB")
print(f"GPU VRAM usage increase during inference (4-bit): {vram_delta_4bit:.2f} MB")
print(f"Total VRAM after model load (4-bit): {vram_model_4bit:.2f} MB")

Accuracy (4-bit): 93.00%
Latency per sample (4-bit): 0.0165 seconds
System RAM usage increase (4-bit): 5.73 MB
GPU VRAM usage increase during inference (4-bit): 2.00 MB
Total VRAM after model load (4-bit): 953.88 MB


## 9. Summary: FP32 vs Quantized DistilBERT on GPU (T4)

The following table summarizes the results from the experimentations.

In [11]:
import pandas as pd

df = pd.DataFrame({
    "Metric": [
        "Accuracy (%)",
        "Latency (ms)",
        "RAM Increase (MB)",
        "VRAM Increase (MB)",
        "Total VRAM (MB)"
    ],
    "FP32": [94.00, 12.40, 213.52, 30.00, 659.88],
    "8-bit": [94.00, 77.20, 130.85, 14.00, 843.88],
    "4-bit": [93.00, 16.50, 5.73, 2.00, 953.88],
    "Change (vs FP32)": [
        "-1.0% (4-bit)",
        "8-bit slower",
        "Lower with quantized",
        "Lower with quantized",
        "Higher in all cases"
    ]
})

df

Unnamed: 0,Metric,FP32,8-bit,4-bit,Change (vs FP32)
0,Accuracy (%),94.0,94.0,93.0,-1.0% (4-bit)
1,Latency (ms),12.4,77.2,16.5,8-bit slower
2,RAM Increase (MB),213.52,130.85,5.73,Lower with quantized
3,VRAM Increase (MB),30.0,14.0,2.0,Lower with quantized
4,Total VRAM (MB),659.88,843.88,953.88,Higher in all cases


**Observations:**

- Despite reducing inference memory deltas, bitsandbytes **increased total VRAM usage** on GPU for DistilBERT.
- Latency increased in 8-bit mode – likely due to overhead from fused CUDA kernels and submodule offloading.
- Accuracy remained stable across all configurations.
- These findings suggest that **bitsandbytes quantization is not beneficial for small models like DistilBERT** on GPU, despite being effective on CPU.
- For meaningful GPU compression gains, 4-bit/8-bit quantization should be evaluated on **larger models** (e.g., BERT, LLaMA) where benefits outweigh overhead.