# Exploring Model Compression Techniques with Bitsandbytes

Welcome to this notebook on model compression! In this notebook, we’ll explore state-of-the-art model compression techniques using bitsandbytes and Hugging Face's transformers library. We’ll apply different quantization methods, observe model size reduction, and evaluate each model's effectiveness. Compression techniques like quantization and pruning help reduce memory usage and improve inference speed, especially important for deploying large models on resource-limited devices.


In [2]:
!pip install bitsandbytes
#!pip install huggingface transformers bitsandbytes>=0.39.0 accelerate datasets torch==2.5.0 --index-url https://download.pytorch.org/whl/cu121
# if on Google colab, you need to restart the runtime after the install to reload all the libraries
# If on windows
# !pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1


## Model Setup

Let's start by loading a simple model to work with. We'll use the BERT model from Hugging Face's Transformers library, which is commonly used for natural language processing tasks. We’ll focus on compressing this model using various techniques.


In [None]:
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForMaskedLM
import torch

# Replace sequence classification model with masked language model
model_name = "bert-base-cased"  # or another suitable MLM model
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Quantization with bitsandbytes

Quantization reduces the model’s weight precision, allowing us to store weights in a lower-bit format (e.g., 8-bit, 4-bit) rather than the standard 32-bit float. This can significantly reduce the model size and improve inference speed without severely impacting model performance.

Using `bitsandbytes`, we can easily apply quantization methods like 8-bit and 4-bit quantization. Here, we’ll explore different formats and evaluate the impact on model size and accuracy.


### Default format - 32-bit floating point (FP32)
FP32, or 32-bit floating point, is a standard format for representing real numbers in deep learning models. It uses 32 bits split into three components:
- **Sign bit**: 1 bit
- **Exponent**: 8 bits
- **Mantissa (fraction)**: 23 bits

This configuration provides a high dynamic range and precision, making it suitable for training large models, though it requires more memory and computational power.

**Example**: Representing the number 5.25 in FP32:
- **Binary representation**: `0100 0000 1010 1000 0000 0000 0000 0000`
- **Explanation**:
  - Sign bit = `0` (positive)
  - Exponent = `10000001` (biased exponent of 129, or 2^2)
  - Mantissa = `01010000000000000000000` (representing 1.3125 in normalized format)


### Applying FP16 Quantization

FP16, or 16-bit floating point, reduces the bit-width to 16 bits:
- **Sign bit**: 1 bit
- **Exponent**: 5 bits
- **Mantissa (fraction)**: 10 bits

This format halves the memory requirements compared to FP32, with enough precision for many deep learning tasks. It’s commonly used in mixed-precision training.

**Example**: Representing the number 5.25 in FP16:
- **Binary representation**: `0100 0101 0100 0000`
- **Explanation**:
  - Sign bit = `0` (positive)
  - Exponent = `10001` (biased exponent of 20, or 2^2)
  - Mantissa = `0101000000` (representing 1.313 in normalized format)
  

In [None]:
# Load a model with FP16 precision
quantized_model_fp16 = AutoModelForMaskedLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
)

### Applying 8-Bit Quantization
INT8 uses 8 bits in an integer format, typically signed:
- **Sign bit**: 1 bit
- **Value bits**: 7 bits

Values range from -128 to 127, or -127 to 127, dependening on scheme. INT8 significantly reduces model size and inference time, though precision loss may impact model accuracy.

**Example**: Representing the number 5.25 in INT8 (automatically converted to integer):
- **Binary representation**: `0000 0101`
- **Explanation**:
  - Sign bit = `0` (positive)
  - Value = `0000101` (represents 5)

In [None]:
# Define the quantization configuration for 8-bit
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

# Load an 8-bit quantized version of the model
quantized_model_8bit = AutoModelForMaskedLM.from_pretrained(
    model_name,
    quantization_config=quantization_config
)

### Applying 4-Bit Quantization
INT4 compresses data further by using only 4 bits:
- **Sign bit**: 1 bit
- **Value bits**: 3 bits

With values ranging from -8 to 7, INT4 provides high memory savings and fast computations. However, it has limited precision and can lead to quantization errors.

**Example**: Representing the number 5.25 in INT4:
- **Binary representation**: `0101`
- **Explanation**:
  - Sign bit = `0` (positive)
  - Value = `101` (represents 5)

In [None]:
# Define the quantization configuration for 4-bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load a 4-bit quantized version of the model
quantized_model_4bit = AutoModelForMaskedLM.from_pretrained(
    model_name,
    quantization_config=quantization_config
)

### Applying NF4 Quantization

NF4 is a 4-bit floating-point format optimized for normally distributed data:
- **Sign bit**: 1 bit
- **Exponent**: 2 bits
- **Mantissa (fraction)**: 1 bit

NF4 offers a compact floating-point representation that maintains a dynamic range, suitable for data typically centered around zero, and is useful for quantization while preserving model accuracy.

**Example**: Representing the number 5.25 in NF4:
- **Binary representation**: `01 11`
- **Explanation**:
  - Sign bit = `0` (positive)
  - Exponent = `11` (biased exponent giving a larger scale factor)
  - Mantissa = `1` (representing a higher precision value near 5)

A greater in detail explanation can be found [here](https://www.youtube.com/watch?v=TPcXVJ1VSRI&t=563s)


In [None]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/TPcXVJ1VSRI?si=viO6F-ni-_B1SEyH" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In [None]:
# Load a 4-bit NF4 quantized version of the model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',  # Specify NF4 quantization
)
quantized_model_nf4 = AutoModelForMaskedLM.from_pretrained(
    model_name,
    num_labels=2,
    quantization_config=bnb_config,
)


### Comparing model sizes and effectiveness

In [None]:
original_size = model.get_memory_footprint() / (1024 * 1024)
quantized_8bit_size = quantized_model_8bit.get_memory_footprint() / (1024 * 1024)
quantized_4bit_size = quantized_model_4bit.get_memory_footprint() / (1024 * 1024)
quantized_fp16_size = quantized_model_fp16.get_memory_footprint() / (1024 * 1024)
quantized_nf4_size = quantized_model_nf4.get_memory_footprint() / (1024 * 1024)

print(f"Original Model Size (32-bit): {original_size:.2f} MB")
print(f"8-Bit Quantized Model Size: {quantized_8bit_size:.2f} MB")
print(f"4-Bit Quantized Model Size: {quantized_4bit_size:.2f} MB")
print(f"FP16 Quantized Model Size: {quantized_fp16_size:.2f} MB")
print(f"NF4 Quantized Model Size: {quantized_nf4_size:.2f} MB")


Theoretically, reducing the bit precision of model parameters should lead to a proportional decrease in model size. For example:

- 32-bit to 8-bit: A 4x reduction in size.
- 32-bit to 4-bit: An 8x reduction in size.
- 32-bit to FP16 (16-bit): A 2x reduction in size.

In practice, however, the size reductions achieved are often not exact multiples of the theoretical values, as seen in the models above. This discrepancy arises from additional overhead introduced by quantization schemes, which require extra information to accurately represent the quantized values. This metadata includes scaling factors (such as the alphas in AbsMax) that map high-precision values to lower precision and other essential data needed for quantization. This information is stored alongside the quantized weights, contributing to the overall model size.

In [None]:
from datasets import load_dataset

# Load a dataset suitable for MLM evaluation
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test[:10%]")  # Load a subset for quicker evaluation

import time

def evaluate_model_efficiency(model, dataset, tokenizer):
    """Evaluates the model on a masked language modeling task and measures inference time."""
    model.eval()
    device = model.device

    total_loss = 0.0
    num_batches = 0
    start_time = time.time()

    for i, example in enumerate(dataset):
        # Tokenize with special tokens for MLM
        inputs = tokenizer(
            example["text"],
            return_tensors="pt",
            truncation=True,
            padding="max_length",
            max_length=128,
        ).to(device)

        # Mask the input for MLM evaluation
        labels = inputs["input_ids"].clone()
        mask = torch.rand(inputs["input_ids"].shape).to(device) < 0.15  # Mask 15% of tokens
        labels[~mask] = -100  # Only compute loss for masked tokens

        with torch.no_grad():
            outputs = model(**inputs, labels=labels)
            total_loss += outputs.loss.item()

        num_batches += 1

    # Calculate average loss and inference time
    avg_loss = total_loss / num_batches
    perplexity = torch.exp(torch.tensor(avg_loss))
    avg_inference_time = (time.time() - start_time) / num_batches

    return perplexity.item(), avg_inference_time

# Original Model
original_perplexity, original_time = evaluate_model_efficiency(model, dataset, tokenizer)
print(f"Original Model - Perplexity: {original_perplexity:.2f}, Avg Inference Time: {original_time:.4f} s")

# Quantized Models
# Use only a subset of the dataset for quicker evaluation
quantized_16bit_perplexity, quantized_16bit_time = evaluate_model_efficiency(quantized_model_fp16, dataset.select(range(10)), tokenizer)
print(f"fp16 Quantized Model - Perpleity: {quantized_16bit_perplexity:.2f}, Avg Inference Time: {quantized_16bit_time:.4f} s")

quantized_8bit_perplexity, quantized_8bit_time = evaluate_model_efficiency(quantized_model_8bit, dataset, tokenizer)
print(f"8-Bit Quantized Model - Perplexity: {quantized_8bit_perplexity:.2f}, Avg Inference Time: {quantized_8bit_time:.4f} s")

quantized_4bit_perplexity, quantized_4bit_time = evaluate_model_efficiency(quantized_model_4bit, dataset, tokenizer)
print(f"4-Bit Quantized Model - Perplexity: {quantized_4bit_perplexity:.2f}, Avg Inference Time: {quantized_4bit_time:.4f} s")

quantized_nf4_perplexity, quantized_nf4_time = evaluate_model_efficiency(quantized_model_nf4, dataset, tokenizer)
print(f"NF4 Quantized Model - Perplexity: {quantized_nf4_perplexity:.2f}, Avg Inference Time: {quantized_nf4_time:.4f} s")


- Why does quantizing a model sometimes impact its accuracy?
>>> Write your answer here

- What is the primary trade-off when using a lower bit-width, such as 4-bit or 8-bit quantization, instead of the standard 32-bit?
>>> Write your answer here

- What is the difference between INT8 quantization and FP16 quantization?
>>> Write your answer here

- Which quantization format would you choose based on your results, and why? Does your result match your expectations?
>>> Write your answer here


## Model Pruning

Pruning is a technique to reduce the size of a neural network model by removing parts of the model that have minimal impact on its performance. Common pruning methods include:
- **Unstructured Pruning**: Removes individual weights based on a specified criterion, such as the smallest absolute weights. This results in a sparse network.
- **Structured Pruning**: Removes entire structures, like neurons or channels, making the model smaller in a way that's compatible with hardware acceleration.

Pruning can reduce both the memory and computational requirements of a model, but it may come at the cost of reduced accuracy. In this section, we’ll explore the effects of unstructured pruning on model size and performance. We'll use magnitude pruning, also called L1-norm pruning.

In [None]:
import torch.nn.utils.prune as prune
import copy

# Define functions for unstructured and structured pruning
def apply_unstructured_pruning(model, amount=0.2):
    """
    Applies unstructured pruning to the model's linear layers.

    Args:
        model: The model to prune.
        amount: The proportion of weights to prune (default is 0.2).

    Returns:
        The unstructured-pruned model.
    """
    pruned_model = copy.deepcopy(model)
    for name, module in pruned_model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name="weight", amount=amount)
    return pruned_model

In [None]:
# Set device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the device
model = model.to(device)

# Apply different levels of unstructured pruning
unstructured_pruned_20 = apply_unstructured_pruning(model, amount=0.2)
unstructured_pruned_40 = apply_unstructured_pruning(model, amount=0.4)


### Comparing model sizes and effectiveness

In [None]:
# Calculate model sizes
original_size = model.get_memory_footprint() / (1024 * 1024)
unstructured_20_size = unstructured_pruned_20.get_memory_footprint() / (1024 * 1024)
unstructured_40_size = unstructured_pruned_40.get_memory_footprint() / (1024 * 1024)

# Print model sizes
print(f"Original Model Size: {original_size:.2f} MB")
print(f"Unstructured Pruned Model (20%): {unstructured_20_size:.2f} MB")
print(f"Unstructured Pruned Model (40%): {unstructured_40_size:.2f} MB")


In [None]:
# Original Model
print(f"Original Model - Perplexity: {original_perplexity:.2f}, Avg Inference Time: {original_time:.4f} s")

# Quantized Models
unstructured_20_perplexity, unstructured_20_time = evaluate_model_efficiency(unstructured_pruned_20, dataset, tokenizer)
print(f"Unstructured pruned 20% - Perplexity: {unstructured_20_perplexity:.2f}, Avg Inference Time: {unstructured_20_time:.4f} s")

unstructured_40_perplexity, unstructured_40_time = evaluate_model_efficiency(unstructured_pruned_40, dataset, tokenizer)
print(f"Unstructured pruned 40% - Perplexity: {unstructured_40_perplexity:.2f}, Avg Inference Time: {unstructured_40_time:.4f} s")

- How does pruning impact model size compared to the original model? Do the results meet your expectations?
>>> Write your answer here

- Which technique—pruning or quantization—resulted in a greater reduction in model size? Why might this be the case?
>>> Write your answer here

- In terms of perplexity, did pruning or quantization show a greater impact on model accuracy? Why?
>>> Write your answer here

- If you were looking to optimize both model size and inference time without sacrificing too much accuracy, would you choose pruning or quantization based on your results?
>>> Write your answer here