<a href="https://colab.research.google.com/github/KyriakosTop/distilbert-compression/blob/main/bert_quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT-base Quantization on GPU – FP32, 8-bit, and 4-bit Inference

This notebook benchmarks the inference performance of the BERT-base model fine-tuned on the SST-2 sentiment classification task using a T4 GPU in Google Colab.

We evaluate:

- Full-precision (FP32) inference
- 8-bit quantized inference using bitsandbytes
- 4-bit quantized inference using bitsandbytes

Each configuration is tested on 100 validation samples from SST-2.  
We measure:

- Accuracy
- Average latency per sample (in milliseconds)
- System RAM usage (MB)
- GPU VRAM usage (total and delta in MB)

This notebook is designed to assess whether quantization is more effective for larger models like BERT-base, compared to smaller architectures such as DistilBERT.

In [None]:
# Install required packages
!pip install -q transformers datasets evaluate bitsandbytes accelerate pynvml psutil

# Re-imports if kernel was reset or skipped
import torch
import numpy as np
import time
import os
import psutil
import pynvml
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 1. Load BERT-base and SST-2 Validation Samples

We use the `bert-base-uncased` model from Hugging Face, which has 12 transformer layers and ~110 million parameters.  
To maintain consistency with earlier experiments, we evaluate on the first 100 samples from the SST-2 validation set.  
Each sample is tokenized to a fixed length of 128 tokens, and wrapped in a PyTorch-compatible dataset class.

In [None]:
# Define model ID and tokenizer
model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load SST-2 validation set (first 100 samples)
dataset = load_dataset("glue", "sst2", split="validation[:100]")

# Tokenize with max length 128
def tokenize_function(example):
    return tokenizer(example["sentence"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function)

# Wrap into PyTorch-style dataset
class SST2Dataset(torch.utils.data.Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = [torch.tensor(x) for x in hf_dataset["input_ids"]]
        self.attention_mask = [torch.tensor(x) for x in hf_dataset["attention_mask"]]
        self.labels = [torch.tensor(x) for x in hf_dataset["label"]]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx].to("cuda"),
            "attention_mask": self.attention_mask[idx].to("cuda"),
            "label": self.labels[idx].item()
        }

dataset = SST2Dataset(tokenized_dataset)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

## 2. Define Evaluation Function

This function evaluates the model on GPU using the 100-tokenized SST-2 samples.  
It measures:

- Accuracy
- Average latency per sample (in seconds)
- System RAM usage increase (MB)
- GPU VRAM usage increase during inference (MB)
- Total VRAM usage after model load (MB)

The function assumes the model is already loaded onto the GPU and runs inference sample-by-sample without batching, to allow fine-grained latency measurement.

In [None]:
def evaluate_model(model, dataset):
    model.eval()
    process = psutil.Process(os.getpid())

    # Initialize GPU memory tracker
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    model_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used / (1024 ** 2)  # MB
    start_ram = process.memory_info().rss
    start_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used

    correct = 0
    latencies = []

    with torch.no_grad():
        for sample in dataset:
            inputs = {
                "input_ids": sample["input_ids"].unsqueeze(0),
                "attention_mask": sample["attention_mask"].unsqueeze(0),
            }
            label = sample["label"]

            start_time = time.time()
            outputs = model(**inputs)
            end_time = time.time()

            pred = torch.argmax(outputs.logits, dim=1).item()
            correct += (pred == label)
            latencies.append(end_time - start_time)

    end_ram = process.memory_info().rss
    end_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used
    pynvml.nvmlShutdown()

    delta_ram = (end_ram - start_ram) / (1024 ** 2)
    delta_vram = (end_vram - start_vram) / (1024 ** 2)
    avg_latency = np.mean(latencies)
    accuracy = correct / len(dataset)

    return accuracy, avg_latency, delta_ram, delta_vram, model_vram

## 3. Run Full-Precision (FP32) Inference

We load the full-precision BERT-base model fine-tuned on SST-2 and run inference on the 100 validation samples using a T4 GPU.

This serves as our performance baseline for comparison with quantized versions.

In [None]:
# Clear GPU memory before loading
torch.cuda.empty_cache()

# Load BERT-base FP32 model to GPU
model_fp32 = AutoModelForSequenceClassification.from_pretrained(
    "textattack/bert-base-uncased-SST-2"
).to("cuda")

# Evaluate
accuracy_fp32, latency_fp32, ram_fp32, vram_delta_fp32, vram_model_fp32 = evaluate_model(model_fp32, dataset)

# Print results
print(f"Accuracy (FP32): {accuracy_fp32:.2%}")
print(f"Latency per sample (FP32): {latency_fp32 * 1000:.2f} ms")
print(f"System RAM usage increase (FP32): {ram_fp32:.2f} MB")
print(f"GPU VRAM usage increase during inference (FP32): {vram_delta_fp32:.2f} MB")
print(f"Total VRAM after model load (FP32): {vram_model_fp32:.2f} MB")

Accuracy (FP32): 92.00%
Latency per sample (FP32): 12.83 ms
System RAM usage increase (FP32): 63.26 MB
GPU VRAM usage increase during inference (FP32): 32.00 MB
Total VRAM after model load (FP32): 1289.88 MB


## 4. Run 8-bit Quantized Inference (bitsandbytes)

We now evaluate the 8-bit quantized version of BERT-base using the `bitsandbytes` backend.  
All linear layers are quantized to 8-bit precision during model loading.  
This test will help us assess whether quantization improves latency or memory usage on a larger model like BERT.

In [None]:
# Clear GPU memory before loading 8-bit model
torch.cuda.empty_cache()

# Configure 8-bit quantization
bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    device_map={"": 0}  # Force entire model to GPU
)

# Load 8-bit quantized model
model_int8 = AutoModelForSequenceClassification.from_pretrained(
    "textattack/bert-base-uncased-SST-2",
    quantization_config=bnb_config_8bit
)

# Evaluate
accuracy_int8, latency_int8, ram_int8, vram_delta_int8, vram_model_int8 = evaluate_model(model_int8, dataset)

# Print results
print(f"Accuracy (8-bit): {accuracy_int8:.2%}")
print(f"Latency per sample (8-bit): {latency_int8 * 1000:.2f} ms")
print(f"System RAM usage increase (8-bit): {ram_int8:.2f} MB")
print(f"GPU VRAM usage increase during inference (8-bit): {vram_delta_int8:.2f} MB")
print(f"Total VRAM after model load (8-bit): {vram_model_int8:.2f} MB")

Accuracy (8-bit): 92.00%
Latency per sample (8-bit): 94.04 ms
System RAM usage increase (8-bit): 20.96 MB
GPU VRAM usage increase during inference (8-bit): 12.00 MB
Total VRAM after model load (8-bit): 1081.88 MB


## 5. Run 4-bit Quantized Inference (bitsandbytes)

We now evaluate the 4-bit quantized version of BERT-base using bitsandbytes.  
This uses QLoRA-style quantization, which compresses linear layers to 4-bit precision using custom CUDA kernels.  
This experiment helps assess whether 4-bit quantization can reduce memory further while maintaining accuracy and acceptable latency.

In [None]:
# Clear GPU memory before loading 4-bit model
torch.cuda.empty_cache()

# Configure 4-bit quantization
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    device_map={"": 0}
)

# Load 4-bit quantized model
model_4bit = AutoModelForSequenceClassification.from_pretrained(
    "textattack/bert-base-uncased-SST-2",
    quantization_config=bnb_config_4bit
)

# Evaluate
accuracy_4bit, latency_4bit, ram_4bit, vram_delta_4bit, vram_model_4bit = evaluate_model(model_4bit, dataset)

# Print results
print(f"Accuracy (4-bit): {accuracy_4bit:.2%}")
print(f"Latency per sample (4-bit): {latency_4bit * 1000:.2f} ms")
print(f"System RAM usage increase (4-bit): {ram_4bit:.2f} MB")
print(f"GPU VRAM usage increase during inference (4-bit): {vram_delta_4bit:.2f} MB")
print(f"Total VRAM after model load (4-bit): {vram_model_4bit:.2f} MB")

Accuracy (4-bit): 92.00%
Latency per sample (4-bit): 20.75 ms
System RAM usage increase (4-bit): 0.51 MB
GPU VRAM usage increase during inference (4-bit): 6.00 MB
Total VRAM after model load (4-bit): 1195.88 MB


## 6. Summary of Results

The following table summarizes the performance of BERT-base across FP32, 8-bit, and 4-bit configurations, evaluated on 100 SST-2 validation samples using a T4 GPU.

| Model   | Accuracy | Latency (ms) | RAM ↑ (MB) | VRAM ↑ (MB) | Total VRAM (MB) |
|---------|----------|---------------|-------------|--------------|------------------|
| FP32    | 92.00%   | 12.83         | 63.26       | 32.00        | 1289.88          |
| 8-bit   | 92.00%   | 94.04         | 20.96       | 12.00        | 1081.88          |
| 4-bit   | 92.00%   | 20.75         | 0.51        | 6.00         | 1195.88          |

## 7. Observations

- All three model variants achieved identical accuracy (92.00%) on the 100-sample SST-2 validation set, suggesting that post-training quantization does not harm prediction quality on this task.
- The 8-bit model incurred a significant latency increase (94 ms per sample), likely due to bitsandbytes kernel overhead, consistent with previous experiments on smaller models.
- The 4-bit quantized model offered the best memory efficiency and acceptable latency (20.75 ms), making it the most balanced choice for low-memory scenarios.
- These results suggest that quantization benefits scale better with larger models like BERT-base, especially when using 4-bit fused kernels on modern GPUs.