# [A-2] Running Calibration

*Make sure to restart the Kernel before executing this notebook*

In this notebook, we will perform **calibration** for quantization. 

Calibration is a crucial step for Post-Training Quantization (PTQ). It involves running a representative dataset through the model to collect statistics (ranges) of activations and weights. These statistics are then used to calculate scale factors for quantization.

In [None]:
MODEL_PATH = "/home/work/Qwen3-8B"

## Intel Neural Compressor (INC)

[INC](https://github.com/intel/neural-compressor) is an open-source library for model compression developed by Intel. It provides implementations of calibration and quantization methods that can be easily applied to existing models. As it is developed primarily by Intel, it is highly optimized for Intel hardware, including the Gaudi series.

INC is closely integrated into `vllm-gaudi`, allowing calibration and quantization to be applied seamlessly through the vLLM APIs.

Now, let's start running calibration first. We should specify configurations in a config file.

In [None]:
import os

os.environ["QUANT_CONFIG"] = f"{os.getcwd()}/configs/measure.json"

Check out the config file. Mode should be set to `MEASURE` mode to perform calibration. And we'll use simple `maxabs` observer, which is the default. The result of calibration will be dumped under `calibration_outputs` directory. You can also specify which modules to quantize or not through `allowlist` and `blocklist`. In this walkthrough, we will specify nothing, which means we will quantize every module available.

Initialize the vLLM engine with `quantization="inc"`. This triggers the Intel Neural Compressor (INC) integration to hook into the model execution and collect statistics based on the `QUANT_CONFIG`.

`PT_HPU_WEIGHT_SHARING=0` is required to free the full precision weights from the device and ensure only the FP8 weights are stored. And we'll disable the warm-up for calibration.

In [None]:
from vllm import LLM, SamplingParams

os.environ["PT_HPU_WEIGHT_SHARING"] = "0"
os.environ["VLLM_SKIP_WARMUP"] = "true"

llm = LLM(
    model=MODEL_PATH,
    quantization="inc",
    max_model_len=2048,
    distributed_executor_backend="mp",
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=32)

What happens under the hood? Below is a code snippet from `HPUModelRunner`. When the model is loaded by the runner, INC APIs are applied according to current mode. In calibration mode, `prepare` function configures the model by replacing target modules with their corresponding patched versions defined in INC. These patched modules include forward hooks that collect statistics during the model’s forward pass, which are then stored in a file for later use in the `QUANTIZE` step.

In [None]:
'''
from neural_compressor.torch.quantization import (FP8Config, convert, prepare)

config = FP8Config.from_json_file(os.getenv("QUANT_CONFIG", ""))

if config.measure:
    self.model = prepare(self.model, config)
elif config.quantize:
    self.model = convert(self.model, config)
'''

We need a calibration dataset. We'll use a subset of **Pile-10k** dataset and filter for samples that are long enough to provide meaningful activation statistics.

In [None]:
from tqdm import tqdm

from datasets import load_dataset
from transformers import AutoTokenizer

def get_dataset_prompts(num_samples, least_tokens):
    print(f"Loading {num_samples} samples...")
    dataset = load_dataset("NeelNanda/pile-10k", split="train")
    dataset = dataset.shuffle(seed=42)
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    samples = []
    
    for data in tqdm(dataset):
        prompt = data["text"]
        tokens = tokenizer(prompt, return_tensors="pt")
        if len(tokens.input_ids[0]) < least_tokens:
            continue
        samples.append(prompt)
        if len(samples) >= num_samples:
            break
            
    prompt_token_ids = []
    for prompt in samples:
        tokens = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=least_tokens)
        prompt_token_ids.append(tokens.input_ids[0].tolist())
        
    return prompt_token_ids

Now we run inference on the calibration dataset. The `llm.generate` call will pass data through the model, and the hooks (configured via `QUANT_CONFIG`) will record the statistics of tensors.

In [None]:
MAX_DATASET_SAMPLES = 128
SAMPLE_LEN = 1024

prompt_token_ids = get_dataset_prompts(
    MAX_DATASET_SAMPLES, SAMPLE_LEN
)
input_batch = [{"prompt_token_ids": p} for p in prompt_token_ids]

outputs = llm.generate(input_batch, sampling_params, use_tqdm=True)

After calibration is complete, we release the resources.

In [None]:
del llm

This process is required to save the statistics to the specified path, since the API for saving the results--`finalize_calibration`--is currently implemented in the runner’s destructor, as shown below. You can now see the calibration results under `calibration_outputs` directory.

In [None]:
'''
def shutdown_inc(self):
    can_finalize_inc = self._is_quant_with_inc() and \
        (self.model.model is not None) and \
        self.inc_initialized_successfully and \
        not self._is_inc_finalized
    if can_finalize_inc:
        from neural_compressor.torch.quantization import (finalize_calibration)
        finalize_calibration(self.model.model)
        self._is_inc_finalized = True

def __del__(self):
    self.shutdown_inc()
'''