<a href="https://colab.research.google.com/github/PraneelUJ/CS_203_Lab11/blob/main/Quant%2Bbert%2Bsst2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install datasets



## What is Model Quantization?
Quantization is the process of reducing the numerical precision of model weights and/or activations, typically from 32-bit floating point (FP32) to lower-precision formats:
- 8-bit integer (INT8)
- 16-bit floating point (FP16)
- Even lower precision formats like FP8 or INT4

## Why Quantize?
1. Reduced memory footprint
2. Faster inference (As the model size is smaller).
3. Better deployment on edge devices

## What Gets Quantized?
- **Weights**: The parameters learned during training
- **Activations**: The outputs of each layer during inference
- **Operations**: The computations themselves (e.g., matrix multiplications)

In [None]:
# -*- coding: utf-8 -*-
"""PyTorch_Quant_BERT_SST.py"""

import os
import numpy as np
import torch
import time
import psutil
import json
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset
import tempfile
from torch.utils.data import DataLoader, TensorDataset

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("PyTorch version:", torch.__version__)

PyTorch version: 2.6.0+cu124


In [None]:
# Load SST-2 dataset from the Hugging Face datasets library
print("Loading SST-2 dataset...")
sst2_dataset = load_dataset("glue", "sst2")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to preprocess the dataset
def preprocess_function(examples):
    return tokenizer(examples['sentence'], truncation=True, padding='max_length', max_length=128)

# Preprocess the validation dataset for evaluation
encoded_val_dataset = sst2_dataset['validation'].map(preprocess_function, batched=True)

Loading SST-2 dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Convert to PyTorch tensors
val_input_ids = torch.tensor(encoded_val_dataset['input_ids'], dtype=torch.long)
val_attention_mask = torch.tensor(encoded_val_dataset['attention_mask'], dtype=torch.long)
val_token_type_ids = torch.tensor(encoded_val_dataset['token_type_ids'], dtype=torch.long)
val_labels = torch.tensor(encoded_val_dataset['label'], dtype=torch.long)

In [None]:
# Create PyTorch Dataset and DataLoader
val_dataset = TensorDataset(val_input_ids, val_attention_mask, val_token_type_ids, val_labels)
val_dataloader = DataLoader(val_dataset, batch_size=8)  # Using smaller batch size to avoid memory issues

# Load pre-trained BERT model fine-tuned on SST-2
print("Loading pre-trained BERT model...")
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.eval()  # Set model to evaluation mode

Loading pre-trained BERT model...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [None]:
# Get model parameter count and size
def count_params(model):
    """Count the total parameters in a model."""
    return sum(p.numel() for p in model.parameters())

def estimate_model_size_mb(model):
    """Estimate the size of a model in MB based on parameter count."""
    total_params = count_params(model)
    # Each parameter is 4 bytes (32 bits) for float32
    size_bytes = total_params * 4
    return size_bytes / (1024 * 1024)
## TODO: A BETTER APPROACH TO FIND THE DATA-TYPE AND THEN COMPUTE THE SIZE. HERE WE ASSUME THE TYPE.

# Measure original model size
original_param_count = count_params(model)
original_model_size_mb = estimate_model_size_mb(model)
print(f"Original BERT model parameter count: {original_param_count:,}")
print(f"Estimated original BERT model size: {original_model_size_mb:.2f} MB")

Original BERT model parameter count: 109,483,778
Estimated original BERT model size: 417.65 MB


In [None]:
# Create a temporary directory to save models
temp_dir = tempfile.mkdtemp()
original_model_path = os.path.join(temp_dir, "bert_sst2_model.pt")

# Save the original model
print("Saving the model...")
torch.save(model.state_dict(), original_model_path)

# Get saved model size
def get_file_size_mb(file_path):
    """Get the size of a file in MB."""
    return os.path.getsize(file_path) / (1024 * 1024)

original_saved_size_mb = get_file_size_mb(original_model_path)
print(f"Saved model size: {original_saved_size_mb:.2f} MB")

Saving the model...
Saved model size: 417.72 MB


In [None]:
# Measure RAM usage
def get_ram_usage():
    """Get current RAM usage in MB."""
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    return memory_info.rss / (1024 * 1024)

In [None]:
# Function to evaluate model accuracy
def evaluate_model(model, dataloader, device='cpu', num_samples=50):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids, attention_mask, token_type_ids, labels = [b.to(device) for b in batch]

            outputs = model(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids)

            _, predicted = torch.max(outputs.logits, 1)
            correct += (predicted == labels).sum().item()
            total += labels.size(0)

            if total >= num_samples:
                break

    return correct / total

In [None]:
# Function to measure inference time
def measure_inference_time(model, text, tokenizer, device='cpu', num_runs=10):
    inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=128)
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    token_type_ids = inputs["token_type_ids"].to(device)

    # Warm-up run
    with torch.no_grad():
        _ = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

    # Measure inference time
    inference_times = []
    for _ in range(num_runs):
        start_time = time.time()

        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            _ = torch.argmax(outputs.logits, dim=1)

        inference_time = (time.time() - start_time) * 1000  # Convert to ms
        inference_times.append(inference_time)

    avg_inference_time = sum(inference_times) / len(inference_times)
    return avg_inference_time

In [None]:
# Function to predict sentiment
def predict_sentiment(model, text, tokenizer, device='cpu'):
    model.eval()
    inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length=128)
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    token_type_ids = inputs["token_type_ids"].to(device)

    start_time = time.time()
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    inference_time = (time.time() - start_time) * 1000  # ms

    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=1)
    prediction = torch.argmax(probabilities, dim=1).item()
    confidence = probabilities[0][prediction].item()

    sentiment = "Positive" if prediction == 1 else "Negative"

    return sentiment, confidence, inference_time

In [None]:
# Evaluate baseline model performance
print("\nEvaluating original model...")
device = 'cpu' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

model = model.to(device)
baseline_accuracy = evaluate_model(model, val_dataloader, device)
print(f"Baseline model accuracy: {baseline_accuracy:.4f}")

sample_text = "This movie was fantastic, I really enjoyed it."
baseline_inference_time = measure_inference_time(model, sample_text, tokenizer, device)
print(f"Baseline model average inference time: {baseline_inference_time:.2f} ms")


Evaluating original model...
Using device: cpu
Baseline model accuracy: 0.4821
Baseline model average inference time: 407.98 ms


## Understanding Quantization Formats

### FP32 (standard)
- 32-bit floating point format
- 1 bit: sign
- 8 bits: exponent
- 23 bits: mantissa (fractional part)
- Used as default in most deep learning training

### FP16 (half precision)
- 16-bit floating point format
- 1 bit: sign
- 5 bits: exponent
- 10 bits: mantissa
- Reduces memory usage by 50% compared to FP32
- Works well for models that don't need extreme precision

### INT8 (quantized integer)
- 8-bit integer format
- Drastically reduces memory by 75% compared to FP32
- Requires mapping floating point values to integers
- Needs careful handling of dynamic range

# 1. Dynamic Quantization
------------------------

## What is Dynamic Quantization?

Dynamic quantization is a post-training quantization technique that:
- Converts weights from FP32 to INT8 (For this example) during model loading.
- Keeps activations in floating point format
- Computes the quantization parameters (scale, zero-point) on-the-fly during inference
- Is the simplest form of quantization to apply in PyTorch

### What gets quantized in Dynamic Quantization?
- ✅ WEIGHTS: Converted from FP32 to INT8 (50-75% size reduction)
- ❌ ACTIVATIONS: Remain in floating point
- ✅ OPERATIONS: Linear operations (matrix multiplies) are performed with INT8 weights

### Key Properties:
- Applied after training (no retraining needed)
- Accuracy impact is usually minimal for NLP models
- Weights are quantized statically, calculations are done with 8-bit arithmetic
- Currently only supported on CPU, not on CUDA (GPU)
- Works best for models where weights are the main memory bottleneck


The primary activations that get quantized are:
- Intermediate layer outputs
- Input tensors to linear/convolutional layers
- Outputs of activation functions


For the weight quantization:

- Weights = [0.5, -0.9, 1.2]
- Range: min = -0.9, max = 1.2
- Scale = (1.2 - (-0.9)) / 255 ≈ 0.00824
- Zero point = round(-min/scale) = round(0.9/0.00824) ≈ 109

Converting weights to int8:

- For 0.5: round((0.5 - (-0.9))/0.00824) ≈ round(170.5) = 171
- For -0.9: round((-0.9 - (-0.9))/0.00824) ≈ round(0) = 0
- For 1.2: round((1.2 - (-0.9))/0.00824) ≈ round(255) = 255

Quantized weights: [171, 0, 255]
For the activation quantization:

- Input = [2.7, -1.3, 0.8]
- Range: min = -1.3, max = 2.7
- Scale = (2.7 - (-1.3)) / 255 ≈ 0.01569
- Zero point = round(-min/scale) = round(1.3/0.01569) ≈ 83

Converting activations to int8:

- For 2.7: round((2.7 - (-1.3))/0.01569) ≈ round(255) = 255
- For -1.3: round((-1.3 - (-1.3))/0.01569) ≈ round(0) = 0
- For 0.8: round((0.8 - (-1.3))/0.01569) ≈ round(134) = 134

Quantized activations: [255, 0, 134]

The computation would then use these quantized values, applying the scale factors when converting back to floating point for the final result.

This is the essence of dynamic quantization - *the activation scale is calculated per inference based on the actual input values, rather than using a pre-determined fixed scale.*

In [None]:
# 1. DYNAMIC QUANTIZATION
# ----------------------
print("\n1. Applying dynamic quantization...")
start_time = time.time()

# Apply dynamic quantization to the model
# This is post-training quantization
dynamic_quantized_model = torch.quantization.quantize_dynamic(
    model,  # The model to quantize
    {torch.nn.Linear},  # Specify which modules to quantize (Linear layers in BERT)
    dtype=torch.qint8  # Quantization data type
)

dynamic_quant_time = time.time() - start_time
print(f"Dynamic quantization time: {dynamic_quant_time:.2f} seconds")

# Save the quantized model
dynamic_quantized_path = os.path.join(temp_dir, "bert_sst2_dynamic_quantized.pt")
torch.save(dynamic_quantized_model.state_dict(), dynamic_quantized_path)
dynamic_quantized_size_mb = get_file_size_mb(dynamic_quantized_path)

print(f"Dynamic quantized model size: {dynamic_quantized_size_mb:.2f} MB")
print(f"Size reduction: {(1 - dynamic_quantized_size_mb/original_saved_size_mb) * 100:.2f}%")

# Evaluate dynamic quantized model
dynamic_quantized_model = dynamic_quantized_model.to(device)
dynamic_quantized_accuracy = evaluate_model(dynamic_quantized_model, val_dataloader, device)
print(f"Dynamic quantized model accuracy: {dynamic_quantized_accuracy:.4f}")

dynamic_quantized_inference_time = measure_inference_time(dynamic_quantized_model, sample_text, tokenizer, device)
print(f"Dynamic quantized model average inference time: {dynamic_quantized_inference_time:.2f} ms")
print(f"Inference speedup: {baseline_inference_time / dynamic_quantized_inference_time:.2f}x")


1. Applying dynamic quantization...
Dynamic quantization time: 1.02 seconds
Dynamic quantized model size: 173.09 MB
Size reduction: 58.56%
Dynamic quantized model accuracy: 0.5357
Dynamic quantized model average inference time: 216.08 ms
Inference speedup: 1.89x


## What is Half Precision (FP16)?

Half precision is not strictly quantization but a format conversion that:
- Reduces 32-bit floating point (FP32) to 16-bit floating point (FP16)
- Halves the memory footprint of weights and activations
- Is well-supported on modern GPUs with tensor cores (like NVIDIA's)

### What gets converted in FP16?
- ✅ WEIGHTS: Converted from FP32 to FP16 (50% size reduction)
- ✅ ACTIVATIONS: Also processed in FP16
- ✅ OPERATIONS: Matrix multiplications use FP16 arithmetic

### Key Properties:
- Simple to implement in PyTorch (just one call to .half())
- Works on both CPU and CUDA (best performance on GPU)
- Good compromise between precision and performance
- No retraining required
- Can be combined with mixed precision training for best results

## Mixed Precision

Mixed precision is a technique where:
- Some operations use FP32 (for stability and precision)
- Most operations use FP16 (for speed and memory efficiency)
- The model automatically determines which operations need higher precision


In [None]:
# 2. HALF PRECISION (FP16)
# ----------------------
print("\n2. Converting to half precision (FP16)...")
start_time = time.time()

# Load a fresh model for half precision
half_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
half_model.eval()

# Convert to half precision
half_model = half_model.half()  # This converts all parameters to float16
fp16_conversion_time = time.time() - start_time

# Save the half precision model
half_precision_path = os.path.join(temp_dir, "bert_sst2_fp16.pt")
torch.save(half_model.state_dict(), half_precision_path)
half_precision_size_mb = get_file_size_mb(half_precision_path)

print(f"Half precision conversion time: {fp16_conversion_time:.2f} seconds")
print(f"Half precision model size: {half_precision_size_mb:.2f} MB")
print(f"Size reduction: {(1 - half_precision_size_mb/original_saved_size_mb) * 100:.2f}%")

# Evaluate half precision model
half_model = half_model.to(device)
half_precision_accuracy = evaluate_model(half_model, val_dataloader, device)
print(f"Half precision model accuracy: {half_precision_accuracy:.4f}")

half_precision_inference_time = measure_inference_time(half_model, sample_text, tokenizer, device)
print(f"Half precision model average inference time: {half_precision_inference_time:.2f} ms")
print(f"Inference speedup: {baseline_inference_time / half_precision_inference_time:.2f}x")


2. Converting to half precision (FP16)...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Half precision conversion time: 0.35 seconds
Half precision model size: 208.90 MB
Size reduction: 49.99%
Half precision model accuracy: 0.5000
Half precision model average inference time: 1975.05 ms
Inference speedup: 0.21x


In [None]:
# Create summary table
print("\n===== SUMMARY =====")
print("Model                   | Size (MB) | Size Reduction | Inference Time (ms) | Accuracy")
print("------------------------|-----------|---------------|---------------------|----------")
print(f"Original BERT           | {original_saved_size_mb:.2f}    | -             | {baseline_inference_time:.2f}              | {baseline_accuracy:.4f}")
print(f"Dynamic Quantized       | {dynamic_quantized_size_mb:.2f}    | {(1 - dynamic_quantized_size_mb/original_saved_size_mb) * 100:.2f}%          | {dynamic_quantized_inference_time:.2f}              | {dynamic_quantized_accuracy:.4f}")
print(f"Half Precision (FP16)   | {half_precision_size_mb:.2f}    | {(1 - half_precision_size_mb/original_saved_size_mb) * 100:.2f}%          | {half_precision_inference_time:.2f}              | {half_precision_accuracy:.4f}")



===== SUMMARY =====
Model                   | Size (MB) | Size Reduction | Inference Time (ms) | Accuracy
------------------------|-----------|---------------|---------------------|----------
Original BERT           | 417.72    | -             | 407.98              | 0.4821
Dynamic Quantized       | 173.09    | 58.56%          | 216.08              | 0.5357
Half Precision (FP16)   | 208.90    | 49.99%          | 1975.05              | 0.5000


In [None]:
# Save comparison results to JSON
comparison_results = {
    "original_model": {
        "parameter_count": int(original_param_count),
        "size_mb": float(original_saved_size_mb),
        "inference_time_ms": float(baseline_inference_time),
        "accuracy": float(baseline_accuracy)
    },
    "dynamic_quantized": {
        "size_mb": float(dynamic_quantized_size_mb),
        "size_reduction_percent": float((1 - dynamic_quantized_size_mb/original_saved_size_mb) * 100),
        "inference_time_ms": float(dynamic_quantized_inference_time),
        "accuracy": float(dynamic_quantized_accuracy)
    },
    "half_precision": {
        "size_mb": float(half_precision_size_mb),
        "size_reduction_percent": float((1 - half_precision_size_mb/original_saved_size_mb) * 100),
        "inference_time_ms": float(half_precision_inference_time),
        "accuracy": float(half_precision_accuracy)
    }
}

In [None]:
comparison_results_path = os.path.join(temp_dir, "pytorch_quantization_comparison_results.json")
with open(comparison_results_path, 'w') as f:
    json.dump(comparison_results, f, indent=2)

print(f"\nComparison results saved to: {comparison_results_path}")
print(f"All models and artifacts saved to: {temp_dir}")

# Test models on example sentences
test_sentences = [
    "I really enjoyed this movie, it was fantastic!",
    "The restaurant was terrible and the service was slow.",
    "The book was neither good nor bad, just average."
]

print("\nTesting quantized models on example sentences:")


Comparison results saved to: /tmp/tmpzqcektnl/pytorch_quantization_comparison_results.json
All models and artifacts saved to: /tmp/tmpzqcektnl

Testing quantized models on example sentences:


In [None]:
print("\n1. Dynamic Quantized Model:")
for sentence in test_sentences:
    sentiment, confidence, inference_time = predict_sentiment(dynamic_quantized_model, sentence, tokenizer, device)
    print(f"Text: '{sentence}'")
    print(f"Prediction: {sentiment} (confidence: {confidence:.4f})")
    print(f"Inference time: {inference_time:.2f} ms")
    print("-" * 50)

print("\n2. Half Precision Model:")
for sentence in test_sentences:
    sentiment, confidence, inference_time = predict_sentiment(half_model, sentence, tokenizer, device)
    print(f"Text: '{sentence}'")
    print(f"Prediction: {sentiment} (confidence: {confidence:.4f})")
    print(f"Inference time: {inference_time:.2f} ms")
    print("-" * 50)

print("\nDone!")


1. Dynamic Quantized Model:
Text: 'I really enjoyed this movie, it was fantastic!'
Prediction: Positive (confidence: 0.5008)
Inference time: 354.82 ms
--------------------------------------------------
Text: 'The restaurant was terrible and the service was slow.'
Prediction: Negative (confidence: 0.5028)
Inference time: 221.48 ms
--------------------------------------------------
Text: 'The book was neither good nor bad, just average.'
Prediction: Negative (confidence: 0.5071)
Inference time: 214.73 ms
--------------------------------------------------

2. Half Precision Model:
Text: 'I really enjoyed this movie, it was fantastic!'
Prediction: Negative (confidence: 0.6641)
Inference time: 1798.91 ms
--------------------------------------------------
Text: 'The restaurant was terrible and the service was slow.'
Prediction: Negative (confidence: 0.6802)
Inference time: 1754.50 ms
--------------------------------------------------
Text: 'The book was neither good nor bad, just average.'


# Other Quantization Techniques (Not Implemented)

## FP8 Quantization
- Emerging format for transformer models
- Used in some NVIDIA GPUs (like H100)
- Reduces precision to 8-bit floating point
- Can achieve up to 4x memory reduction compared to FP32

## INT4 Quantization
- Ultra-low precision for weights only
- Often used for inference-only scenarios
- Requires careful handling of quantization parameters
- Increasingly popular for LLMs in memory-constrained environments

## Mixed-Precision Quantization
- Different components of the model use different precisions
- Example: attention layers in FP16, feed-forward networks in INT8
- Can be optimized based on sensitivity analysis
- Often yields the best accuracy/performance trade-off

## Weight-Only Quantization
- Only quantizes weight matrices, not activations
- Common in transformer models where activation quantization causes accuracy issues
- Less efficient than full quantization but more accurate
- Used by techniques like GPTQ and AWQ for LLMs

## References
- [PyTorch Implementation](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html)