# 🧪 Lab 2: Hardware–Software Model Co-Design via Post-Training Quantization & Bit-Width Search

### 📚 Introduction

In **Lab 1**, you saw that compressing transformer weights down to **2 bits** reduced model size by ×16 with only a modest accuracy drop. But compression alone is a *software-centric* solution; actual deployment only succeeds when the model cooperates with the underlying silicon.

In this lab, you’ll adopt a **hardware–software co-design** perspective, treating quantization as the critical interface between the network and its deployment hardware. Quantization affects both **model accuracy** and **execution efficiency**, making it the ideal lever for co-design.

Specifically, you will:

- **Wrap every `nn.Linear` in a quantized integer-only `QLinear` module**  
- **Post-quantize both weights and activations layerwise**, selecting precision from **8 → 2 bits**  
- **Measure performance of each quantization configuration** using **KL-divergence** and **memory consumption**  
- **Perform automated layerwise bit-width search** to optimize a hardware-aware objective function

> **Why co-design matters:**  
> A model that looks efficient in software may still bottleneck on real hardware due to memory access patterns, compute throughput, or unsupported bit-widths. Hardware–software co-design ensures the model structure aligns with hardware constraints, enabling deployment that is both **accurate** and **efficient** on edge devices.

---

### 🎯 Lab Objectives

1. **Implement `QLinear`**, a simulated integer GEMM layer with scale-offset dequantization, compatible with PyTorch CPU kernels  
2. **Post-quantize a pretrained model checkpoint** using per-layer {8, 4, 2}-bit precision, and export metadata for downstream hardware cost modeling  
3. **Profile** model size and **KL-divergence from the FP32 teacher model**  
4. **Run a non-linear optimization algorithm** to identify a per-layer quantization configuration that minimizes a joint objective (accuracy vs. efficiency)

---

By the end, you'll produce a deployment-ready language model with a **per-layer optimal quantization configuration**—striking the best trade-off between hardware efficiency and model fidelity.



In [25]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial
from typing import Dict, Any, Tuple

import pytorch_lightning as pl
from hyperopt import fmin, tpe, hp, Trials, space_eval, STATUS_OK

from src.lab1.shakespeare_trainer import ShakespeareModule



 ## 1️⃣ Building `QLinear`

 **Symmetric uniform quantization** maps a float tensor to signed integers
 in the range [ −(2ᵇ⁻¹−1), …, + (2ᵇ⁻¹−1) ] with a single scale factor **s**.

 Forward pass outline:

 1. **Quantize** incoming activations to ints.
 2. **Integer GEMM** with pre-quantized weights.
 3. **De-quantize** the accumulator by multiplying with the two scales.
 4. Add bias (still Floating Point).

 The class below is written for clarity rather than raw speed


In [26]:
class QLinear(nn.Module):
    """
    A fully-connected layer with symmetric uniform quantization for weights and activations.
    """
    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = True,
        weight_bitwidth: int = 8,
        act_bitwidth: int = 8,
    ) -> None:
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight_bitwidth = weight_bitwidth
        self.act_bitwidth = act_bitwidth

        # Buffers to hold quantized weight and quantization scale
        self.register_buffer(
            "qweight",
            torch.zeros(out_features, in_features, dtype=torch.float32),
        )
        self.register_buffer("weight_scale", torch.ones(1))

        # Optional bias stored in float32
        if bias:
            self.register_buffer("bias", torch.zeros(out_features, dtype=torch.float32))
        else:
            self.bias = None

    @staticmethod
    def _quantize_tensor(
        x: torch.Tensor, bitwidth: int
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Quantize a tensor to signed integers in [-2^(b-1), 2^(b-1)-1].
        Returns (quantized_tensor, scale).
        """
        qmax = 2 ** (bitwidth - 1) - 1
        rmax = x.abs().max()
        scale = rmax / qmax if rmax > 0 else torch.tensor(1.0, device=x.device)
        q = torch.clamp(torch.round(x / scale), -qmax, qmax)
        return q, scale

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 1. Quantize activations
        qx, act_scale = self._quantize_tensor(x, self.act_bitwidth)

        # 2. Integer GEMM
        qx = qx.to(self.qweight.dtype)
        acc = qx.matmul(self.qweight.t())

        # 3. Dequantize
        y = acc * act_scale * self.weight_scale

        # 4. Add bias if present
        if self.bias is not None:
            y = y + self.bias

        return y

    def __repr__(self) -> str:
        return (
            f"{self.__class__.__name__}("
            f"in={self.in_features}, out={self.out_features}, "
            f"w_bits={self.weight_bitwidth}, a_bits={self.act_bitwidth})"
        )

 ### 👉 Quick sanity check

 Run the next cell to quantize a random matrix at 2-, 4- and 8-bit and
 print the reconstruction error.


In [27]:
# %% 
torch.manual_seed(0)
sample = torch.randn(1000)
for b in (2, 4, 8):
    q, s = QLinear._quantize_tensor(sample, b)
    err = (sample - q * s).abs().mean().item()
    print(f"{b}-bit | mean-abs-error: {err:.6f}")


2-bit | mean-abs-error: 0.761839
4-bit | mean-abs-error: 0.144003
8-bit | mean-abs-error: 0.008115


## 2️⃣ Swapping Layers In-Place

To quantize a model with **per-layer bit-widths**, we construct a dictionary that maps each `nn.Linear` module’s fully qualified name to its desired bit-width. This dictionary—called the **`qconfig`**—is then used to walk the model, locate each linear layer, and replace it with a post-quantized version (`QLinear`) that uses the specified number of bits for it's weight bit-width .

The `qconfig` dictionary typically looks like this:

```text
{
  "model.transformer_blocks.0.attn.q_proj": 4,
  "model.transformer_blocks.0.attn.k_proj": 2,
  ...
}



In [28]:
# %%  — utilities for model patching
def quantize_linear(layer: nn.Linear, weight_bitwidth=8, act_bitwidth=8):
    qlayer = QLinear(layer.in_features, layer.out_features,
                     bias=layer.bias is not None,
                     weight_bitwidth=weight_bitwidth,
                     act_bitwidth=act_bitwidth)
    q_w, w_s = QLinear._quantize_tensor(layer.weight.data, weight_bitwidth)
    qlayer.qweight.copy_(q_w)
    qlayer.weight_scale.copy_(w_s)
    if layer.bias is not None:
        qlayer.bias.copy_(layer.bias.data)
    return qlayer


def quantize_model(root: nn.Module, qconfig: Dict[str, int]):
    for path, bw in qconfig.items():
        parent_path, _, attr = path.rpartition('.')
        parent = root if not parent_path else root.get_submodule(parent_path)
        setattr(parent, attr,
                quantize_linear(getattr(parent, attr),
                                weight_bitwidth=bw, act_bitwidth=bw))
    return root


def default_qconfig(model: ShakespeareModule, bitwidth=8):
    cfg = {}
    for name, mod in model.model.transformer_blocks.named_modules():
        if isinstance(mod, nn.Linear):
            cfg[f"model.transformer_blocks.{name}"] = bitwidth
    return cfg


## 3️⃣ Accuracy Metric: Validation Loss (Single Batch)

To evaluate the accuracy of the per-layer quantized model, we compute the **cross-entropy loss** on a single held-out batch. For a vocabulary of size $V$, and model output logits $\mathbf{z} \in \mathbb{R}^{B \times V}$ (where $B$ is the batch size) and targets $y \in \{0, \ldots, V-1\}^B$, the loss is defined as:

$$
\mathcal{L}_{\text{val}} = -\frac{1}{B} \sum_{i=1}^B \log \left( \frac{e^{z_{i, y_i}}}{\sum_{j=1}^V e^{z_{i, j}}} \right)
$$

This is the standard cross-entropy between the predicted softmax distribution and the true target class for each sample in the batch.

In PyTorch, this is computed as:

```python
F.cross_entropy(logits, y)




In [62]:
def compute_validation_loss(model, loss_fn, dataloader, device):
    model.eval()
    
    with torch.no_grad():
        x, y = next(iter(dataloader))
        x, y = x.to(device), y.to(device)
        logits = model.model(x)
        logits = logits.view(-1, logits.size(-1))  # shape: [6*128, 1024]
        y = y.view(-1) 
        print("=========", print(type(logits), type(y), logits.shape, y.shape))
        loss = loss_fn(logits, y)
        
    return loss.item()

## 4️⃣ Memory Metric: Static Model Size

To evaluate the hardware efficiency of a given quantization configuration, we compute the **static memory footprint** of the model in **bytes**. This gives a direct measure of how much storage the model will consume on disk or in memory, based on the bit-widths assigned in `qconfig`.

### 🧮 What gets counted?

- **Quantized weights** (e.g., `QLinear.qweight`) are stored using the **bit-width specified in `qconfig`**
- **Biases, embeddings, and normalization layers** remain in full precision: **32 bits per parameter**
- **All other unquantized weights** (e.g., layer norm scales or unwrapped layers) are also assumed to be **32-bit floats**

Given a quantized linear layer weight tensor with shape $[m, n]$ and bit-width $b$, its contribution to memory is:

$$
\text{Size}_{\text{qweight}} = m \times n \times b \text{ bits}
$$

Biases and full-precision parameters contribute:

$$
\text{Size}_{\text{fp32}} = k \times 32 \text{ bits}
$$

Finally, we divide the total by 8 to convert from bits to **bytes**:



In [63]:
# %% 
def compute_model_size_bytes(root: nn.Module, qconfig: Dict[str, int]):
    total_bits = 0
    for path, bw in qconfig.items():
        parent_path, _, attr = path.rpartition('.')
        parent = root if not parent_path else root.get_submodule(parent_path)
        lin: QLinear = getattr(parent, attr)
        total_bits += lin.qweight.numel() * bw
        if lin.bias is not None:
            total_bits += lin.bias.numel() * 32
    for name, param in root.named_parameters():
        if name.endswith('bias') or 'weight' not in name:
            continue
        param_module = name.rsplit('.', 1)[0]
        if any(path.startswith(param_module) for path in qconfig):
            continue
        total_bits += param.numel() * 32
    return total_bits // 8

## 5️⃣ HyperOpt Objective: Balancing Accuracy and Efficiency

To search for an optimal per-layer quantization configuration, we define a **scalarized objective function** that balances two competing goals:

- **Accuracy**: quantified by validation loss (cross-entropy)
- **Efficiency**: quantified by static model size in megabytes

These two metrics are combined using a **weighted linear combination**:

$$
\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{val}} + (1 - \alpha) \cdot \text{Size}_{\text{MB}}
$$

Where:

- $\mathcal{L}_{\text{val}}$ is the cross-entropy loss of the quantized model on a held-out batch
- $\text{Size}_{\text{MB}}$ is the total model size in megabytes (as computed from `qconfig`)
- $\alpha \in [0, 1]$ controls the trade-off between accuracy and memory

This scalarized loss is used as the **objective function** in a non-linear optimization process (e.g., Bayesian optimization, random search). The goal is to **minimize** $\mathcal{L}_{\text{total}}$, yielding a quantization configuration that provides the best compromise between compactness and predictive quality.

You are encouraged to experiment with different values of $\alpha$:
- $\alpha = 0.2$: emphasize size minimization (more aggressive compression)
- $\alpha = 0.8$: prioritize accuracy (conservative quantization)
- $\alpha = 0.5$: balanced trade-off

In [64]:
# %%  — objective
def objective(qconfig, batches, batch_size, alpha):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    q = ShakespeareModule.load_from_checkpoint(
        'src/lab1/checkpoints/float-best.ckpt', batch_size=batch_size)
    q.setup('test')
    quantize_model(q, qconfig).to(device)

    dl = q.test_dataloader()
    val_loss = compute_validation_loss(q, nn.CrossEntropyLoss(), dl, device)

    size_mb = compute_model_size_bytes(q, qconfig) / (1024 ** 2)
    loss = alpha * val_loss + (1 - alpha) * size_mb
    #print(f"loss={loss:.4f} | KL={kl:.4f} | size={size_mb:.2f} MB")
    return {'loss': loss, 'status': STATUS_OK}


## 6️⃣ Search Space & Driver

To explore quantization configurations, we define:

- A **search space**: mapping each `nn.Linear` to a discrete bit-width from `[2, 4, 8]`
- A **driver**: an optimizer that searches this space to minimize the scalarized objective

### 🔍 Search Space

Each quantizable layer gets its own entry in `qconfig`. For a model with $L$ layers, the space has size $3^L$, so exhaustive search is impractical.

Example (with `hyperopt`):

```python
space = {
    name: hp.choice(name, [2, 4, 8])
    for name in linear_layer_names(model)
}
```

### 🚗 Search Driver

We use `hyperopt.fmin()` to minimize the scalarized objective:

$$
\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{val}} + (1 - \alpha) \cdot \text{Size}_{\text{MB}}
$$

The driver proposes a `qconfig`, quantizes the model, and evaluates it using:

- A single-batch validation loss for accuracy
- A static size estimate from the quantized weights and full-precision parameters

By repeating this over multiple iterations, the search converges toward a bit-width configuration that balances accuracy and memory footprint.

The default optimizer (`tpe.suggest`) uses a Tree-structured Parzen Estimator, but random search or other strategies can be substituted.

> 🔧 To accelerate search: reduce model size, limit batch size, or coarsen the bit-width choices (e.g., only 4 and 8 bits).


In [65]:
# %% 
def hyperopt_search(init_cfg, max_evals=200, batches=10, batch_size=6, alpha=1e-7):
    space = {k: hp.choice(k, [2, 4, 8]) for k in init_cfg}
    trials = Trials()
    fn = partial(objective, batches=batches, batch_size=batch_size, alpha=alpha)
    best = fmin(fn, space, algo=tpe.suggest, max_evals=max_evals, trials=trials)
    return space_eval(space, best)

 ## 7️⃣ Main Entry

In [66]:
# %% 
def optimize_qconfig():
    base = ShakespeareModule.load_from_checkpoint('src/lab1/checkpoints/float-best.ckpt')
    start_cfg = default_qconfig(base, bitwidth=8)
    best_cfg = hyperopt_search(start_cfg)
    return best_cfg

best_qconfig = optimize_qconfig()


<class 'torch.Tensor'>                                 
<class 'torch.Tensor'>                                 
torch.Size([6, 128, 1024])                             
torch.Size([6, 128])                                   
None                                                   
  0%|          | 0/200 [00:00<?, ?trial/s, best loss=?]

job exception: Expected target size [6, 1024], got [6, 128]



  0%|          | 0/200 [00:00<?, ?trial/s, best loss=?]


RuntimeError: Expected target size [6, 1024], got [6, 128]


 ## 🔄 Try This

 1. **Aggressive compression** – set the initial cfg to 4 bits and limit the
    search to {2,4}.  How low can the KL stay?
 2. **Latency vs throughput** – time one forward pass before and after
    quantization on CPU.
 3. **Text generation side-by-side** – sample a Shakespeare sonnet with both
    models; can you spot the quantized one?

 Post your findings on the course forum—screenshots, metrics, or even the
 strangest quantization artefacts you encounter.

 ---

 🏁 **End of Lab 2** — you now have a fully automated post-training
 quantization pipeline and a taste of multi-objective search.  
 Next stop: **quantization-aware training** and custom int kernels!
