In [3]:
%%writefile requirements.txt
torch
datasets==3.6.0
transformers
rouge_score
accelerate
evaluate

Overwriting requirements.txt


In [4]:
!cat requirements.txt && pip install -r requirements.txt

torch
datasets==3.6.0
transformers
rouge_score
accelerate
evaluate


In [5]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import evaluate

# Part 1: Exploring summarization using in-context learning

In-context learning (ICL) refers to the ability of large language models to perform new tasks simply by being shown examples or instructions within the prompt, without updating their weights. Instead of traditional training, the model conditions on the pattern provided in the context window—such as a few input–output pairs, a task description, or a chain-of-thought demonstration—and dynamically adapts its behavior for the duration of the interaction. This emerges from the model’s internal representations learned during pretraining, allowing it to infer latent structure, align with new tasks, and generalize on-the-fly. In practice, ICL enables powerful “prompt-based programming,” where users can guide the model toward specific behaviors without fine-tuning.

In part 1 of this exercise, you will explore how in-context learning affects summarization performance on the XSum dataset. You will select a small subset of XSum articles and evaluate a large language model’s summaries under different prompt settings: (1) no example (zero-shot), (2) one example (one-shot), and (3) multiple examples (few-shot). For each case, you can compare the model-generated summary with the expert summary, and compare how the model’s outputs change with increasing contextual examples. 

In [6]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import evaluate

In [7]:
# Load the Qwen3-1.7B model
model_name = "Qwen/Qwen3-1.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
from transformers import PreTrainedModel, PreTrainedTokenizerFast
from typing import Tuple

In [9]:
def text_generation_pipeline(prompt: str, model: PreTrainedModel, tokenizer: PreTrainedTokenizerFast,
                             max_new_tokens: int, enable_thinking: bool, system_prompt: str | None = None, **kwargs) -> Tuple[str, str]:
    """ 
    Text-generation pipeline for the Qwen3-model along with the tokenizer. 
    Use the Qwen3-1.7B model card to fill in this function.

    Args:
        prompt (str): The input prompt for text generation.
        model (PreTrainedModel): The pre-trained model for text generation.
        tokenizer (PreTrainedTokenizerFast): The tokenizer associated with the model.
        max_new_tokens (int): The maximum number of new tokens to generate.

    Returns:
        Tuple[str, str]: A tuple containing the generated content and thinking content.
    """
    messages = []
    if system_prompt is not None:
        messages.append(
            {"role": "system", "content": system_prompt}
        )
    messages.append(
            {"role": "user", "content": prompt}
    )
    
    text = tokenizer.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=enable_thinking
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **model_inputs, 
        max_new_tokens=max_new_tokens
    ) # type: ignore
    output_ids = generated_ids[0][len(model_inputs["input_ids"][0]):].tolist()
    # parse thinking content
    thinking_content = ""
    index = 0
    if enable_thinking:
        rindex = 151668 # </think>
        try:
            index = len(output_ids) - output_ids[::-1].index(rindex) 
        except ValueError:
            index =  0
        print(index)
        thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
    content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
    return content, thinking_content

In [10]:
from pprint import pprint

example_prompt = "Give me a short description of in-context learning for LLMs."
generated_content, thinking_content = text_generation_pipeline(
    prompt=example_prompt,
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=5000,
    enable_thinking=True,
    system_prompt="You are a helpful assistant that helps people find information."
)

pprint(f"Generated Content:\n {generated_content}")
pprint(f"\nThinking Content:\n {thinking_content}")

326
('Generated Content:\n'
 ' In-context learning refers to the ability of large language models (LLMs) '
 'to understand and generate responses based on the specific context provided '
 'in the input text. Unlike traditional training methods that rely on '
 'pre-stored knowledge, these models can process the entire conversation or '
 'passage, allowing them to adapt their responses dynamically to the current '
 'context. This enables seamless interaction, such as continuing a dialogue or '
 "answering questions within a longer text, while leveraging the model's "
 'training on vast datasets. However, the amount of context included in a '
 'single input is limited, which can affect the quality of the response.')
('\n'
 'Thinking Content:\n'
 ' <think>\n'
 'Okay, the user is asking for a short description of in-context learning for '
 'LLMs. Let me start by recalling what I know. In-context learning refers to '
 'the ability of large language models to understand and generate responses

### Exploring the Xsum dataset

The xsum dataset (extreme summarization), introduced in [1], aims to create a short, one-sentence news summary answering the question “What is the article about?”.


> [1] Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.

In [11]:
xsum_dataset = load_dataset("EdinburghNLP/xsum")

In [12]:
# explore an example 
from pprint import pprint 
split = "validation"
pprint(xsum_dataset[split][0])

{'document': 'The ex-Reading defender denied fraudulent trading charges '
             'relating to the Sodje Sports Foundation - a charity to raise '
             'money for Nigerian sport.\n'
             'Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, '
             'Bright, 50 and Stephen, 42.\n'
             'Appearing at the Old Bailey earlier, all four denied the '
             'offence.\n'
             'The charge relates to offences which allegedly took place '
             'between 2008 and 2014.\n'
             'Sam, from Kent, Efe and Bright, of Greater Manchester, and '
             'Stephen, from Bexley, are due to stand trial in July.\n'
             'They were all released on bail.',
 'id': '38295789',
 'summary': 'Former Premier League footballer Sam Sodje has appeared in court '
            'alongside three brothers accused of charity fraud.'}


We see that the summary is concise, makes sense with the article. But how does one evaluate such summaries?

For this, we work as follows: <br>
A sequence $Z = [z_1, \dots, z_n]$ is a subsequence of $X = [x_1, \dots, x_m]$ if $\exists$ a strictly increasing sequence $[i_1, \dots, i_k]$ of indices such that $\forall j = 1, \dots, k$, we have $x_{i_j} = z_j$. Given two sequences $X$ and $Y$, the longest common subsequence (LCS) of $X$ and $Y$ is a common subsequence with maximum length (i.e. $\text{LCS}(X,Y) = \max\{k: Z\text{ is a common subsequence of }X\text{ and }Y\}$). Consider the summary sentence of the article as a sequence of words. We can hypothesize that the longer the LCS of two summary sentences, the more similar the two summaries are.

 We will design a metric around this as follows: Let $X$ be a $m$-length summary generated by an expert, and $Y$ be a $n$-length candidate summary sentence generated by a neural network. Let $$R_{lcs} = \frac{|\text{LCS}(X,Y)|}{m}, \quad P_{lcs} = \frac{|\text{LCS}(X,Y)|}{n}, \quad F_{lcs} = \frac{(1+\beta^2) R_{lcs} P_{lcs}}{R_{lcs} + \beta^2 P_{lcs}}$$ 

 be the F-measure. 
> Think: <br>(a) This would remind you of the $F_\beta$ score. How are the recall, precision in $F_\beta$ related to the LCS here?<br>(b) What is the maximum and minimum values $F_{lcs}$ can take and when? 

In [13]:
def F_LCS(expert_summary: str, candidate_summary: str, beta: float = 1.0) -> float:
    """
    Compute the F-measure based on the Longest Common Subsequence (LCS) between
    an expert summary and a candidate summary. First, we tokenize both summaries
    into words (removing punctuation and converting to lowercase). Then, we compute
    the LCS length and use it to calculate recall, precision, and the F-measure.

    Args:
        expert_summary (str): The expert-generated summary.
        candidate_summary (str): The candidate summary generated by a model.
        beta (float): The weight of recall in the F-measure calculation.
    Returns:
        float: The F-measure based on LCS.
    """
    import re

    def toks(s):
        return re.findall(r"\w+", s.lower())

    expert_toks   = toks(expert_summary)
    candidate_toks = toks(candidate_summary)

    def LCS(X: list, Y: list) -> int:
        m = len(X)
        n = len(Y)
        L = [[0] * (n + 1) for _ in range(m + 1)]
        for i in range(m + 1):
            for j in range(n + 1):
                if i == 0 or j == 0:
                    L[i][j] = 0
                elif X[i - 1] == Y[j - 1]:
                    L[i][j] = L[i - 1][j - 1] + 1
                else:
                    L[i][j] = max(L[i - 1][j], L[i][j - 1])
        return L[m][n]
    lcs_length = LCS(expert_toks, candidate_toks)
    m = len(expert_toks)
    n = len(candidate_toks)
    R_lcs = lcs_length / m if m > 0 else 0.0
    P_lcs = lcs_length / n if n > 0 else 0.0
    if R_lcs + beta**2 * P_lcs == 0:
        return 0.0
    F_lcs = (1 + beta**2) * R_lcs * P_lcs / (R_lcs + beta**2 * P_lcs)
    return F_lcs

In [14]:
import numpy as np
random_expert_sentence = "Artificial intelligence is transforming healthcare by enabling earlier and more accurate diagnoses."
random_candidate_sentence = "AI is revolutionizing medicine by making diagnostic processes faster and more precise."

f_lcs_score = F_LCS(random_expert_sentence, random_candidate_sentence, beta=1.0)
assert np.isclose(f_lcs_score, 1/3)


> The above $F$-measure is defined for single sentences. Think of how one can extend it to multiple sentences? 

Now you write a function, which takes in a document to summarize, along with a parameter `n_ICL` which controls the number of examples to put in-context. You should take these examples from the training subset of `xsum_dataset`. 

In [15]:
def generate_summary(document: str, n_ICL: int) -> str:
    """
    Generate a summary for the given document using in-context learning with n_ICL examples.

    Args:
        document (str): The document to summarize.
        n_ICL (int): The number of in-context learning examples to use.

    Returns:
        str: The generated summary.
    """
    # Select n_ICL examples from the training subset of xsum_dataset
    train_data = xsum_dataset["train"]
    examples = train_data.select(range(n_ICL))

    # Construct the prompt with in-context examples
    prompt = ""
    for example in examples:
        prompt += f"Document: {example['document']}\nSummary: {example['summary']}\n\n"
    
    prompt += f"Document: {document}\nSummary:"

    # Generate the summary using the text generation pipeline
    generated_summary, _ = text_generation_pipeline(
        prompt=prompt,
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=128,
        enable_thinking=False,
        system_prompt="You are a helpful assistant that summarizes documents. You might be given some examples to help you."\
        " You should generate a concise and relevant summary for the provided document. Just provide the summary for the document without any additional text."
    )

    return generated_summary

In [18]:
# write a function to evaluate ICL
from tqdm import tqdm
def evaluate_ICL(validation_data, n_ICL: int) -> Tuple[float, list[str], list[str]]:
    """
    Evaluate the in-context learning performance on the validation dataset using F-LCS metric.

    Args:
        validation_data: Subset of the validation dataset containing documents and expert summaries.
        n_ICL (int): The number of in-context learning examples to use.

    Returns:
        float: The average F-LCS score over the validation dataset.
    """
    f_lcs_scores = []
    generated_summaries = []
    expert_summaries = []
    for example in tqdm(validation_data, leave=False):
        document, summary = example['document'], example['summary']
        generated_summary = generate_summary(document, n_ICL)
        generated_summaries.append(generated_summary)
        expert_summaries.append(summary)
        f_lcs_score = F_LCS(summary, generated_summary, beta=1.0)
        f_lcs_scores.append(f_lcs_score)
    # auto_rouge_scores = rouge.compute(predictions=generated_summaries, references=expert_summaries, rouge_types=["rougeL"])["rougeL"]
    # return auto_rouge_scores, generated_summaries, expert_summaries
    return np.mean(f_lcs_scores), generated_summaries, expert_summaries

In [19]:
validation_data = xsum_dataset["validation"].select(range(6))

for n_ICL in range(5):
    f_lcs_score, _, _ = evaluate_ICL(validation_data, n_ICL)
    print(f'F-LCS Score with {n_ICL} ICL examples: {f_lcs_score:.4f}\n')

                                             

F-LCS Score with 0 ICL examples: 0.1064



                                             

F-LCS Score with 1 ICL examples: 0.1448



                                             

F-LCS Score with 2 ICL examples: 0.1559



                                             

F-LCS Score with 3 ICL examples: 0.1625



                                             

F-LCS Score with 4 ICL examples: 0.2128





# Part 2: Quantization

Neural networks are usually trained and stored in float32 (FP32). Quantization means representing numbers with lower precision, like int8 (8-bit integers). 

This is done to:
- Lower memory footprint
- Decrease inference costs (less bandwidth, higher throughput)
- Allow deployment on constrained hardware (mobile, edge-devices etc.)

In this part, we will take a look at uniform affine INT8 quantization. However, after attempting this, you are encouraged to deep dive further into understanding recent quantization schemes (GGUF etc.)

Consider a real tensor $\mathbf{x} \in \mathbb{R}^{\bullet}$ which we want to map to a 8-bit integer tensor $q$. We can do so as: $$q = \text{clamp}\left(\text{round}\left(\frac{x}{s}\right) + z, q_{\min}, q_{\max}\right)$$ where $s$ is a scale parameter, $z$ is the zero-point integer, and $q_{\min}$, $q_{\max}$ are the range limits. 

> What are the range limits for signed INT8?

In [None]:
q_min = -128
q_max = 127

> How would we get back the de-quantized variable? Fill: $$\hat{x} = s(q-z)$$

In this exercise, we will set the zero-point $z=0$, implementing symmetric quantization. 
> What do you think is the effect of zero-point? Can $z$ be non-zero? 

Depending on the accuracy / latency trade-off you are targeting, we can have different granularity of quantization parameters:
- *Per-tensor quantization*: You will have one pair of $(s,z)$ per tensor
- *Per-channel quantization*: You will store a pair of $(s,z)$ per element along one of the dimensions of the tensor (for eg., a tensor with shape $[B,C,H,W]$ can have $C$ pairs of $(s,z)$).

In this exercise, we work with Post-training quantization (PTQ) where we take a trained model, keep weights fixed, and:
- Compute scale / zero-point from real weights (and optionally activations).
- Replace FP weights with INT8 + scale/zero-point.
- Run inference with the quantized model.

Benefits: simple, no retraining.

In [21]:
from dataclasses import dataclass
from typing import Tuple, Optional

import torch
import torch.nn as nn

In [22]:
@dataclass 
class QuantizationParams:
    scale: torch.Tensor       # shape: [] or [C] for per-channel
    zero_point: torch.Tensor  # same shape as scale
    qmin: int
    qmax: int

In [27]:
def calc_symmetric_params(
        x: torch.Tensor,
        num_bits: int,
        per_channel: bool = False,
        ch_axis: Optional[int] = None
) -> QuantizationParams:
    """
    Compute symmetric quantization parameters for x.

    Symmetric means:
        - Representable range is approximately [-x_max, x_max]
        - zero_point = 0

    If per_channel is True, compute separate scales per channel along ch_axis.

    TODO:
        1. Compute max absolute value (per tensor or per channel).
        2. Avoid division-by-zero when tensor is all zeros (e.g. set scale=1.0).
        3. Compute scale = max_val / qmax,
        4. Set zero_point = 0.
    """
    # calculate qmin and qmax based on num_bits
    q_min = - (2 ** (num_bits - 1)) # ... 
    q_max = (2 ** (num_bits - 1)) - 1 # ...

    if per_channel:
        # reduce over all dims except ch_axis
        dims = [d for d in range(x.dim()) if d != ch_axis]
        max_val = x.abs().amax(dim=dims)
    else:
        max_val = x.abs().amax()

    
    # TODO: handle the all-zero case to avoid scale = 0
    # Hint: use torch.where or a small epsilon.

    # ====== YOUR CODE HERE ======
    # compute max_val, scale, zero_point
    eps = 1e-8
    max_val = torch.clamp(max_val, min=eps)
    scale = max_val / q_max
    zero_point = torch.zeros_like(scale)

    # ====== END YOUR CODE =======
    return QuantizationParams(
        scale=scale,
        zero_point=zero_point,
        qmin=q_min,
        qmax=q_max
    )


In [28]:
def quantize_tensor(
    x: torch.Tensor,
    params: QuantizationParams,
) -> torch.Tensor:
    """
    Quantize the input tensor x using the provided quantization parameters.
    Formula:
        q = clamp(round(x / scale) + zero_point, qmin, qmax)

    Broadcasting rules:
        - If scale/zero_point are per-channel, make sure they broadcast
          correctly along the tensor dimensions (e.g. [C, 1] for [C, D] weights).

    TODO:
        1. Reshape scale and zero_point for broadcasting if needed.
        2. Apply the affine transform + rounding + clamping.
        3. Cast to torch.int8.
    """ 
    scale = params.scale 
    zp = params.zero_point

    # ====== YOUR CODE HERE ======
    # calculate the quantized tensor q
    while scale.dim() < x.dim():
        scale = scale.view(*scale.shape, *([1] * (x.dim() - scale.dim())))
        zp = zp.view(*zp.shape, *([1] * (x.dim() - zp.dim())))

    q = torch.round(x / scale) + zp
    q = torch.clamp(q, params.qmin, params.qmax)
    q = q.to(torch.int8)
    

    # ====== END YOUR CODE ======= 
    return q



In [29]:
def dequantize_tensor(
        q: torch.Tensor,
        params: QuantizationParams,
) -> torch.Tensor:
    """
    Dequantize int8 tensor back to float using the formula you derived above.

    TODO:
        1. Handle broadcasting same as in quantize_tensor.
        2. Return float32 tensor.
    """
    scale = params.scale 
    zp = params.zero_point

    # ====== YOUR CODE HERE ======
    # calculate the dequantized tensor x_hat
    while scale.dim() < q.dim():
        scale = scale.view(*scale.shape, *([1] * (q.dim() - scale.dim())))
        zp = zp.view(*zp.shape, *([1] * (q.dim() - zp.dim())))

    x_hat = scale * (q.to(torch.float32) - zp)


    # ====== END YOUR CODE =======
    return x_hat

Now we are ready to write the class for a INT8-quantized linear layer

In [85]:
class QuantLinear(nn.Module):
    """
    Weight-only INT8 quantized Linear layer.

    - Stores weights as int8 + (scale, zero_point).
    - Bias stays in float. 
    - On forward, dequantizes weights to float and uses a regular matmul.

    This is *not* optimized for speed, but shows how weight-only PTQ works.

    Usage:
        qlinear = QuantLinear.from_fp_module(linear, weight_bits=8, per_channel=True)
    """

    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = True,
        weight_bits: int = 8,
        per_channel: bool = True,
    ):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight_bits = weight_bits
        self.per_channel = per_channel

        # These will be filled by from_fp_module
        self.register_buffer("weight_int8", torch.empty(out_features, in_features, dtype=torch.int8))
        self.register_buffer("scale", torch.ones(out_features))
        self.register_buffer("zero_point", torch.zeros(out_features))

        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter("bias", None)

    @classmethod
    def from_fp_module(
        cls,
        linear: nn.Linear,
        weight_bits: int = 8,
        per_channel: bool = True,
    ) -> "QuantLinear":
        """
        Create a QuantLinear from a pretrained nn.Linear by
        quantizing its weights.

        TODO:
            1. Instantiate QuantLinear with appropriate sizes.
            2. Compute per-channel quantization parameters for weight:
                   shape [out_features, in_features]
               with channel axis = 0 (each output neuron has its own scale).
            3. Quantize weights and store in weight_int8, scale, zero_point.
            4. Copy bias if it exists.
        """
        qlin = cls(
            in_features=linear.in_features,
            out_features=linear.out_features,
            bias=linear.bias is not None,
            weight_bits=weight_bits,
            per_channel=per_channel,
        )

        with torch.no_grad():
            w = linear.weight.data

            # ====== YOUR CODE HERE ======
            params = calc_symmetric_params(
                w,
                num_bits=weight_bits,
                per_channel=per_channel,
                ch_axis=0,
            )
            q_w = quantize_tensor(w, params)

            qlin.weight_int8.copy_(q_w)
            qlin.scale.copy_(params.scale)
            qlin.zero_point.copy_(params.zero_point)

            if linear.bias is not None:
                qlin.bias.copy_(linear.bias.data)
            # ====== END YOUR CODE =======

        return qlin

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass:

        TODO:
            1. Dequantize weight_int8 using self.scale, self.zero_point.
            2. Compute x @ W^T + bias.
        """
        # ====== YOUR CODE HERE ======
        params = QuantizationParams(
            scale=self.scale,
            zero_point=self.zero_point,
            qmin=-128,
            qmax=127,
        )
        w = dequantize_tensor(self.weight_int8, params).to(x.dtype)
        out = x.matmul(w.t())
        if self.bias is not None:
            out = out + self.bias
        # ====== END YOUR CODE =======

        return out

# Example: Training on MNIST

In [86]:
from __future__ import annotations
from typing import Tuple

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms


class MLP(nn.Module):
    def __init__(self, quantized: bool = False):
        super().__init__()
        self.quantized = quantized

        # Define FP32 linear layers first
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

        if self.quantized:
            # Replace with quantized versions
            self.fc1 = QuantLinear.from_fp_module(self.fc1, weight_bits=8, per_channel=True)
            self.fc2 = QuantLinear.from_fp_module(self.fc2, weight_bits=8, per_channel=True)
            self.fc3 = QuantLinear.from_fp_module(self.fc3, weight_bits=8, per_channel=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x



In [87]:
def get_mnist_dataloaders(batch_size: int = 128) -> Tuple[DataLoader, DataLoader]:
    transform = transforms.Compose([transforms.ToTensor()])
    dataset = datasets.MNIST(root="./data", train=True, download=True, transform=transform)

    train_len = int(0.8 * len(dataset))
    val_len = len(dataset) - train_len
    train_ds, val_ds = random_split(dataset, [train_len, val_len])

    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
    return train_loader, val_loader


In [88]:
def train_one_epoch(model, loader, optimizer, device):
    model.train()
    total_loss = 0.0
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(x)
        loss = F.cross_entropy(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * x.size(0)
    return total_loss / len(loader.dataset)

In [89]:
def evaluate(model, loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            preds = logits.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += x.size(0)
    return correct / total


In [90]:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_loader, val_loader = get_mnist_dataloaders()

# ====== FP32 baseline ======
fp32_model = MLP(quantized=False).to(device)
optimizer = torch.optim.Adam(fp32_model.parameters(), lr=1e-3)

print("Training FP32 model...")
for epoch in range(3):
    loss = train_one_epoch(fp32_model, train_loader, optimizer, device)
    acc = evaluate(fp32_model, val_loader, device)
    print(f"[FP32] Epoch {epoch+1}: loss={loss:.4f}, val_acc={acc:.4f}")

# Freeze weights
fp32_model.eval()

# ====== Part INT8 model (weight-only PTQ) ======
# In a real PTQ pipeline, you'd typically train FP32, then quantize.
# Here we re-instantiate from trained FP32 weights.
int8_model = MLP(quantized=False)  # start as FP32
int8_model.load_state_dict(fp32_model.state_dict())

# Replace with quantized layers
int8_model.fc1 = QuantLinear.from_fp_module(int8_model.fc1, weight_bits=8, per_channel=True)
int8_model.fc2 = QuantLinear.from_fp_module(int8_model.fc2, weight_bits=8, per_channel=True)
int8_model.fc3 = QuantLinear.from_fp_module(int8_model.fc3, weight_bits=8, per_channel=True)

int8_model.to(device)

int8_acc = evaluate(int8_model, val_loader, device)
print(f"[INT8] Validation accuracy after PTQ: {int8_acc:.4f}")


Training FP32 model...
[FP32] Epoch 1: loss=0.3777, val_acc=0.9392
[FP32] Epoch 2: loss=0.1504, val_acc=0.9583
[FP32] Epoch 3: loss=0.1004, val_acc=0.9684
[INT8] Validation accuracy after PTQ: 0.9682


# Now let's quantize a small FP16 LLM

In [91]:
def count_parameters_in_bits(model: nn.Module, bits_per_param: int = 32) -> int:
    """
    Approximate model size in *bits* assuming all parameters
    are stored with bits_per_param bits.

    This is a simplified view (ignores buffers like running stats).
    """
    total_params = sum(p.numel() for p in model.parameters())
    return total_params * bits_per_param

In [92]:
def print_model_size_estimates(model: nn.Module, name: str):
    bits_fp32 = count_parameters_in_bits(model, bits_per_param=32)
    bits_fp16 = count_parameters_in_bits(model, bits_per_param=16)
    bits_int8 = count_parameters_in_bits(model, bits_per_param=8)

    def to_mb(bits):
        return bits / 8 / (1024 ** 2)

    print(f"=== {name} size estimates ===")
    print(f"FP32: ~{to_mb(bits_fp32):.2f} MB")
    print(f"FP16: ~{to_mb(bits_fp16):.2f} MB")
    print(f"INT8: ~{to_mb(bits_int8):.2f} MB")
    print()


In [93]:
def recursively_quantize_linear_modules(module: nn.Module) -> nn.Module:
    """
    Recursively traverse the module and replace nn.Linear with QuantLinear.

    TODO:
        1. For each child that is an nn.Linear, replace it in-place
           with QuantLinear.from_fp_module(child).
        2. Recurse into submodules.

    Hint:
        - Use module.named_children() and setattr(module, name, new_child)
    """
    for name, child in list(module.named_children()):
        if isinstance(child, nn.Linear):
            # ====== YOUR CODE HERE ======
            qchild = QuantLinear.from_fp_module(child, weight_bits=8, per_channel=True).to(child.weight.device)
            setattr(module, name, qchild)
            # ====== END YOUR CODE =======
        else:
            recursively_quantize_linear_modules(child)
    return module


In [94]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "Qwen/Qwen3-1.7B"  # larger LLM
print(f"Loading FP16 model {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
).to(device)

print_model_size_estimates(model_fp16, name="Original model (FP16)")

sum(p.numel() for p in model_fp16.parameters() if p.requires_grad)


Loading FP16 model Qwen/Qwen3-1.7B...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

=== Original model (FP16) size estimates ===
FP32: ~6563.47 MB
FP16: ~3281.74 MB
INT8: ~1640.87 MB



1720574976

In [95]:

# Simple sanity check generation
prompt = "Once upon a time in a world of quantized models,"
print("=== FP16 sample ===")
print(text_generation_pipeline(prompt, model_fp16, tokenizer, max_new_tokens=100, enable_thinking=False)[0])


=== FP16 sample ===
Once upon a time in a world of quantized models,  
There lived a being named Alex,  
A curious mind, with a passion for code,  
And a deep love for the art of machine learning.

Alex was no ordinary programmer—  
They were a pioneer in the realm of quantized models,  
A bridge between the abstract and the real,  
Where precision and simplicity met.

In a world where models were vast and complex,  
Alex sought to make them light and true,  


In [96]:
sum(p.numel() for p in model_fp16.parameters() if p.requires_grad)

1720574976

In [101]:

# ====== Part 3: Quantize the LLM weights to INT8 (weight-only) ======
print("Quantizing nn.Linear layers to INT8...")
model_int8 = recursively_quantize_linear_modules(model_fp16)


Quantizing nn.Linear layers to INT8...


In [102]:
sum(p.numel() for p in model_int8.parameters() if p.requires_grad)

311288832

In [103]:

print("=== INT8 sample ===")
print(text_generation_pipeline(prompt, model_int8, tokenizer, max_new_tokens=100, enable_thinking=False)[0])


=== INT8 sample ===
Once upon a time in a world of quantized models, the landscape of artificial intelligence was reshaped by a revolutionary idea: **quantization**. Instead of working with continuous data, AI systems began to operate with **discrete, quantized representations**—a form of digital compression that reduced computational complexity and memory usage.

In this world, neural networks were no longer built on floating-point precision but on **integer arithmetic**, where weights and activations were stored as integers. This shift allowed for **f
