# Quantized Evaluation – DistilBERT on GTX 1050 Ti (FP32, 8-bit, 4-bit)

This notebook benchmarks the inference performance of a fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model on the SST-2 sentiment classification task using a GTX 1050 Ti GPU.

We evaluate the model in three configurations using Hugging Face Transformers and the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) quantization backend:

- Full-precision (FP32) inference  
- 8-bit quantized inference  
- 4-bit quantized inference  

Each model is tested on the first 100 samples from the SST-2 validation set.

### Reported metrics:
- Accuracy
- Average latency per sample (in milliseconds)
- GPU VRAM usage (after model load and during inference)
- System RAM usage increase (in MB)

All tests are executed using GPU only — but on a GTX 1050 Ti, which lacks support for 4-bit fused kernels and may trigger CPU fallback during quantized inference.

**References:**  
[1] Sanh et al., “DistilBERT: A distilled version of BERT,” https://arxiv.org/abs/1910.01108  
[2] Dettmers et al., bitsandbytes library: https://github.com/TimDettmers/bitsandbytes  
[3] GLUE Benchmark – SST-2: https://huggingface.co/datasets/glue/viewer/sst2  

In [None]:
!pip install -q transformers datasets evaluate bitsandbytes accelerate pynvml psutil

import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig
from datasets import load_dataset
import time
import numpy as np
import pynvml
import psutil
import os
import pandas as pd

## 1. Load Pretrained DistilBERT and SST-2 Dataset

We use the fine-tuned DistilBERT model for SST-2 from Hugging Face.

The model is already trained and ready for inference.  
We use the first 100 validation samples from SST-2 for benchmarking.

In [5]:
model_id = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load SST-2 validation split (first 100 examples)
raw_dataset = load_dataset("glue", "sst2", split="validation[:100]")

# Tokenize the dataset
def tokenize_function(example):
    return tokenizer(example["sentence"], padding="max_length", truncation=True, max_length=128)

tokenized_dataset = raw_dataset.map(tokenize_function)

## 2. Convert to PyTorch-compatible dataset

We define a custom dataset class that wraps the tokenized Hugging Face dataset and returns input tensors for PyTorch inference.

All tensors are moved to GPU (`cuda:0`) during loading.

In [6]:
class SST2Dataset(Dataset):
    def __init__(self, hf_dataset):
        self.input_ids = [torch.tensor(x).to("cuda") for x in hf_dataset["input_ids"]]
        self.attention_mask = [torch.tensor(x).to("cuda") for x in hf_dataset["attention_mask"]]
        self.labels = [torch.tensor(x) for x in hf_dataset["label"]]

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "label": self.labels[idx].item()
        }

dataset = SST2Dataset(tokenized_dataset)

## 3. Define Evaluation Function

This function evaluates the model on the SST-2 validation set using GPU.

It measures:
- Accuracy
- Average latency per sample (in seconds)
- GPU VRAM usage after model load
- GPU VRAM increase during inference
- System RAM usage increase during inference

In [7]:
def evaluate_model(model, dataset):
    model.eval()
    process = psutil.Process(os.getpid())

    # Initialize GPU memory tracker
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)

    # Total VRAM used after model is loaded
    model_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used / (1024 ** 2)  # MB

    # RAM usage before inference
    start_ram = process.memory_info().rss
    start_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used

    correct = 0
    latencies = []

    with torch.no_grad():
        for sample in dataset:
            inputs = {
                "input_ids": sample["input_ids"].unsqueeze(0),
                "attention_mask": sample["attention_mask"].unsqueeze(0),
            }
            label = sample["label"]

            start_time = time.time()
            outputs = model(**inputs)
            end_time = time.time()

            pred = torch.argmax(outputs.logits, dim=1).item()
            correct += (pred == label)
            latencies.append(end_time - start_time)

    # Memory after inference
    end_ram = process.memory_info().rss
    end_vram = pynvml.nvmlDeviceGetMemoryInfo(handle).used
    pynvml.nvmlShutdown()

    # Metrics
    delta_ram = (end_ram - start_ram) / (1024 ** 2)      # in MB
    delta_vram = (end_vram - start_vram) / (1024 ** 2)   # in MB
    avg_latency = np.mean(latencies)
    accuracy = correct / len(dataset)

    return accuracy, avg_latency, delta_ram, delta_vram, model_vram

## 4. Run Full-Precision (FP32) Inference on GPU

We now evaluate the full-precision (FP32) DistilBERT model using the GTX 1050 Ti GPU.

This serves as our performance baseline before applying any quantization.

In [None]:
# Load full-precision model to GPU
model_fp32 = AutoModelForSequenceClassification.from_pretrained(model_id).to("cuda")

# Run evaluation
accuracy_fp32, latency_fp32, ram_fp32, vram_delta_fp32, vram_model_fp32 = evaluate_model(model_fp32, dataset)

print(f"Accuracy (FP32): {accuracy_fp32:.2%}")
print(f"Latency per sample (FP32): {latency_fp32:.4f} seconds")
print(f"System RAM usage increase (FP32): {ram_fp32:.2f} MB")
print(f"GPU VRAM usage increase during inference (FP32): {vram_delta_fp32:.2f} MB")
print(f"Total VRAM after model load (FP32): {vram_model_fp32:.2f} MB")

## 5. Quantize the Model (8-bit)

We apply 8-bit quantization to the full-precision DistilBERT model using `bitsandbytes`.

This reduces the model's linear layers to 8-bit precision, keeping everything on GPU.

In [9]:
# Clear GPU cache before loading new model
torch.cuda.empty_cache()

bnb_config_8bit = BitsAndBytesConfig(
    load_in_8bit=True,
    device_map={"": 0}  # Force all layers onto GPU 0
)

model_int8 = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    quantization_config=bnb_config_8bit
)

## 6. Evaluate Quantized Model (8-bit)

We now evaluate the 8-bit quantized DistilBERT model using the same procedure as in the FP32 baseline.

If the model fell back to CPU, latency and RAM usage may reflect that.

In [10]:
accuracy_int8, latency_int8, ram_int8, vram_delta_int8, vram_model_int8 = evaluate_model(model_int8, dataset)

print(f"Accuracy (INT8): {accuracy_int8:.2%}")
print(f"Latency per sample (INT8): {latency_int8:.4f} seconds")
print(f"System RAM usage increase (INT8): {ram_int8:.2f} MB")
print(f"GPU VRAM usage increase during inference (INT8): {vram_delta_int8:.2f} MB")
print(f"Total VRAM after model load (INT8): {vram_model_int8:.2f} MB")

Accuracy (INT8): 94.00%
Latency per sample (INT8): 0.0984 seconds
System RAM usage increase (INT8): 48.71 MB
GPU VRAM usage increase during inference (INT8): -12.38 MB
Total VRAM after model load (INT8): 1574.64 MB


## 7. Attempt 4-bit Quantized Inference (Expected to Fail or Fallback)

We attempt to load a 4-bit quantized version of DistilBERT using `bitsandbytes`.

This typically fails on GPUs below compute capability 7.5 (e.g., GTX 1050 Ti), or falls back to CPU execution.

In [11]:
# Clear GPU cache before loading 4-bit model
torch.cuda.empty_cache()

try:
    bnb_config_4bit = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        device_map={"": 0}
    )

    model_4bit = AutoModelForSequenceClassification.from_pretrained(
        model_id,
        quantization_config=bnb_config_4bit
    )

    acc_4bit, lat_4bit, ram_4bit, vram_delta_4bit, vram_model_4bit = evaluate_model(model_4bit, dataset)

    print(f"Accuracy (4-bit): {acc_4bit:.2%}")
    print(f"Latency per sample (4-bit): {lat_4bit:.4f} seconds")
    print(f"System RAM usage increase (4-bit): {ram_4bit:.2f} MB")
    print(f"GPU VRAM usage increase during inference (4-bit): {vram_delta_4bit:.2f} MB")
    print(f"Total VRAM after model load (4-bit): {vram_model_4bit:.2f} MB")

except Exception as e:
    print("4-bit quantization failed or was not supported on this GPU.")
    print("Error message:")
    print(e)


Accuracy (4-bit): 93.00%
Latency per sample (4-bit): 0.0113 seconds
System RAM usage increase (4-bit): -4.93 MB
GPU VRAM usage increase during inference (4-bit): -3.00 MB
Total VRAM after model load (4-bit): 1715.14 MB


## 8. Summary: FP32 vs Quantized DistilBERT on GTX 1050 Ti

The following table summarizes the results of all compression modes tested on the GTX 1050 Ti.

Note: Both 8-bit and 4-bit quantized models loaded successfully, but likely fell back to CPU execution — as shown by increased latency and negative VRAM deltas.

In [13]:
df = pd.DataFrame({
    "Metric": [
        "Accuracy (%)",
        "Latency (ms)",
        "RAM Increase (MB)",
        "VRAM Increase (MB)",
        "Total VRAM (MB)"
    ],
    "FP32": [94.00, latency_fp32 * 1000, ram_fp32, vram_delta_fp32, vram_model_fp32],
    "8-bit": [94.00, latency_int8 * 1000, ram_int8, vram_delta_int8, vram_model_int8],
    "4-bit": [93.00, lat_4bit * 1000, ram_4bit, vram_delta_4bit, vram_model_4bit],
    "Observation": [
        "Minor drop (4-bit)",
        "INT8/4-bit slower → CPU fallback",
        "INT8 lower than FP32",
        "Negative VRAM deltas → fallback",
        "VRAM usage increased with quantization"
    ]
})

df

Unnamed: 0,Metric,FP32,8-bit,4-bit,Observation
0,Accuracy (%),94.0,94.0,93.0,Minor drop (4-bit)
1,Latency (ms),9.153507,98.441241,11.274576,INT8/4-bit slower → CPU fallback
2,RAM Increase (MB),97.398438,48.707031,-4.933594,INT8 lower than FP32
3,VRAM Increase (MB),20.125,-12.375,-3.0,Negative VRAM deltas → fallback
4,Total VRAM (MB),1402.535156,1574.640625,1715.140625,VRAM usage increased with quantization


## 9. Conclusion

Quantization experiments on the GTX 1050 Ti revealed the following:

- **8-bit and 4-bit quantized models successfully loaded**, but both likely fell back to **CPU execution** during inference. This is evidenced by:
  - Increased latency compared to FP32
  - Negative GPU VRAM deltas during inference
  - No actual acceleration over FP32

- **FP32 inference ran cleanly on the 1050 Ti**, with latency around 9.15 ms/sample and modest VRAM/RAM usage.

- **Quantized model VRAM footprints were higher**, which is expected due to `bitsandbytes` internal structures, even if GPU execution is not fully utilized.

These results highlight that **consumer GPUs like the GTX 1050 Ti are not well-suited for modern quantization workflows**, especially those requiring CUDA compute capability ≥7.5. While models load, true performance benefits are only seen on GPUs like the T4 or newer.

This notebook serves as a realistic benchmark of what to expect from older hardware — and reinforces the importance of hardware-awareness in LLM deployment planning.