# [A-2] Running Calibration

*Make sure to restart the Kernel before executing this notebook*

In this notebook, we will perform **calibration** for quantization. 

Calibration is a crucial step for Post-Training Quantization (PTQ). It involves running a representative dataset through the model to collect statistics (ranges) of activations and weights. These statistics are then used to calculate scale factors for quantization.

In [9]:
import os

os.environ["HF_HUB_CACHE"] = "/workspace/models/hub"

## Intel Neural Compressor (INC)

[INC](https://github.com/intel/neural-compressor) is an open-source library for model compression developed by Intel. It provides implementations of calibration and quantization methods that can be easily applied to existing models. As it is developed primarily by Intel, it is highly optimized for Intel hardware, including the Gaudi series.

INC is closely integrated into `vllm-gaudi`, allowing calibration and quantization to be applied seamlessly through the vLLM APIs.

Now, let's start running calibration first. We should specify configurations in a config file.

In [2]:
import os

os.environ["QUANT_CONFIG"] = f"{os.getcwd()}/configs/measure.json"

Check out the config file. Mode should be set to `MEASURE` mode to perform calibration. And we'll use simple `maxabs` observer, which is the default. The result of calibration will be dumped under `calibration_outputs` directory. You can also specify which modules to quantize or not through `allowlist` and `blocklist`. In this walkthrough, we will specify nothing, which means we will quantize every module available.

Initialize the vLLM engine with `quantization="inc"`. This triggers the Intel Neural Compressor (INC) integration to hook into the model execution and collect statistics based on the `QUANT_CONFIG`.

`PT_HPU_WEIGHT_SHARING=0` is required to free the full precision weights from the device and ensure only the FP8 weights are stored. And we'll disable the warm-up for calibration.

In [3]:
from vllm import LLM, SamplingParams

os.environ["PT_HPU_WEIGHT_SHARING"] = "0"
os.environ["VLLM_SKIP_WARMUP"] = "true"

llm = LLM(
    model="Qwen/Qwen3-8B",
    quantization="inc",
    max_model_len=2048,
    distributed_executor_backend="mp",
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=32)



INFO 12-15 07:46:07 [__init__.py:40] Available plugins for group vllm.platform_plugins:
INFO 12-15 07:46:07 [__init__.py:42] - hpu -> vllm_gaudi:register
INFO 12-15 07:46:07 [__init__.py:45] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-15 07:46:07 [__init__.py:217] Platform plugin hpu is activated


  ret = original_fn(*args, **kwargs)


INFO 12-15 07:46:09 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 12-15 07:46:09 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 12-15 07:46:10 [utils.py:253] non-default args: {'max_model_len': 2048, 'distributed_executor_backend': 'mp', 'disable_log_stats': True, 'quantization': 'inc', 'model': 'Qwen/Qwen3-8B'}
INFO 12-15 07:46:12 [model.py:631] Resolved architecture: Qwen3ForCausalLM


Parse safetensors files: 100%|██████████| 5/5 [00:00<00:00, 10.82it/s]

INFO 12-15 07:46:13 [model.py:1745] Using max model len 2048



2025-12-15 07:46:15,099	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-15 07:46:15 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 12-15 07:46:15 [platform.py:130] [HPU] Forcing CompilationMode.NONE compilation mode




INFO 12-15 07:46:19 [__init__.py:40] Available plugins for group vllm.platform_plugins:
INFO 12-15 07:46:19 [__init__.py:42] - hpu -> vllm_gaudi:register
INFO 12-15 07:46:19 [__init__.py:45] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-15 07:46:19 [__init__.py:217] Platform plugin hpu is activated
INFO 12-15 07:46:20 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 12-15 07:46:20 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
[1;36m(EngineCore_DP0 pid=8628)[0;0m INFO 12-15 07:46:21 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='Qwen/Qwen3-8B', speculative_config=None, tokenizer='Qwen/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=No



INFO 12-15 07:46:24 [__init__.py:40] Available plugins for group vllm.platform_plugins:
INFO 12-15 07:46:24 [__init__.py:42] - hpu -> vllm_gaudi:register
INFO 12-15 07:46:24 [__init__.py:45] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-15 07:46:24 [__init__.py:217] Platform plugin hpu is activated
INFO 12-15 07:46:26 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 12-15 07:46:26 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 12-15 07:46:26 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:57017 backend=hccl
INFO 12-15 07:46:26 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0


 PT_HPU_LAZY_MODE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024,false
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
 PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 2015 GB
------------------------------------------------------------------------------
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).


INFO 12-15 07:46:29 [runtime.py:28] Environment:
INFO 12-15 07:46:29 [runtime.py:32]     hw: gaudi2
INFO 12-15 07:46:29 [runtime.py:32]     build: 1.22.2.32
INFO 12-15 07:46:29 [runtime.py:32]     engine_version: v1
INFO 12-15 07:46:29 [runtime.py:32]     bridge_mode: eager
INFO 12-15 07:46:29 [runtime.py:32]     model_type: qwen3
INFO 12-15 07:46:29 [runtime.py:32]     prefix_caching: True
INFO 12-15 07:46:29 [runtime.py:32]     vllm_gaudi_commit: Error getting commit hash
INFO 12-15 07:46:29 [runtime.py:28] Features:
INFO 12-15 07:46:29 [runtime.py:32]     fp32_alibi_biases: True
INFO 12-15 07:46:29 [runtime.py:32]     fp32_softmax: False
INFO 12-15 07:46:29 [runtime.py:32]     fused_block_softmax_adjustment: False
INFO 12-15 07:46:29 [runtime.py:32]     fused_block_softmax: False
INFO 12-15 07:46:29 [runtime.py:32]     prompt_attn_impl: fsdpa_impl
INFO 12-15 07:46:29 [runtime.py:32]     skip_warmup: True
INFO 12-15 07:46:29 [runtime.py:32]     merged_prefill: False
INFO 12-15 07:46:

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.00it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:03,  1.01s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:03<00:02,  1.34s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:05<00:01,  1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.70s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:07<00:00,  1.48s/it]
[1;36m(Worker pid=8790)[0;0m 


[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:40 [default_loader.py:314] Loading weights took 7.45 seconds
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:40 [hpu_model_runner.py:3578] Loading model weights took 0.0000 GB
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:40 [hpu_model_runner.py:3581] Preparing model with INC..


[1;36m(Worker pid=8790)[0;0m 2025-12-15 07:46:41 [INFO][hpu_model_runner.py:3588] Preparation started.
[1;36m(Worker pid=8790)[0;0m 2025-12-15 07:46:41 [INFO][quantize.py:175] Start to prepare model with fp8_quant.
[1;36m(Worker pid=8790)[0;0m 2025-12-15 07:46:48 [INFO][hpu_model_runner.py:3588] Preparation end.


[1;36m(Worker pid=8790)[0;0m 
[1;36m(Worker pid=8790)[0;0m         [TO BE DEPRECATED] Please use hpu_inference_initialize instead
[1;36m(Worker pid=8790)[0;0m         
[1;36m(Worker pid=8790)[0;0m         If mark_only_scales_as_const=True or mark_scales=True, then only scales are marked as const.
[1;36m(Worker pid=8790)[0;0m         If mark_non_scales=True, then non scale tensors are marked as constants.
[1;36m(Worker pid=8790)[0;0m         By default mark_only_scales_as_const=False, mark_scales=True, mark_non_scales=True
[1;36m(Worker pid=8790)[0;0m         
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:48 [hpu_model_runner.py:3598] Preparing model with INC took 15.2662 GB
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:48 [hpu_model_runner.py:3617] Wrapping in HPUGraph took 0.0000 GB
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:49 [hpu_model_runner.py:3645] Compilation took 0.0000 GB
[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:46:49 [hpu_worker.py:200] Model profi

What happens under the hood? Below is a code snippet from `HPUModelRunner`. When the model is loaded by the runner, INC APIs are applied according to current mode. In calibration mode, `prepare` function configures the model by replacing target modules with their corresponding patched versions defined in INC. These patched modules include forward hooks that collect statistics during the model’s forward pass, which are then stored in a file for later use in the `QUANTIZE` step.

In [4]:
'''
from neural_compressor.torch.quantization import (FP8Config, convert, prepare)

config = FP8Config.from_json_file(os.getenv("QUANT_CONFIG", ""))

if config.measure:
    self.model = prepare(self.model, config)
elif config.quantize:
    self.model = convert(self.model, config)
'''

'\nfrom neural_compressor.torch.quantization import (FP8Config, convert, prepare)\n\nconfig = FP8Config.from_json_file(os.getenv("QUANT_CONFIG", ""))\n\nif config.measure:\n    self.model = prepare(self.model, config)\nelif config.quantize:\n    self.model = convert(self.model, config)\n'

We need a calibration dataset. We'll use a subset of **Pile-10k** dataset and filter for samples that are long enough to provide meaningful activation statistics.

In [5]:
from tqdm import tqdm

from datasets import load_dataset
from transformers import AutoTokenizer

def get_dataset_prompts(num_samples, least_tokens):
    print(f"Loading {num_samples} samples...")
    dataset = load_dataset("NeelNanda/pile-10k", split="train")
    dataset = dataset.shuffle(seed=42)
    
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
    samples = []
    
    for data in tqdm(dataset):
        prompt = data["text"]
        tokens = tokenizer(prompt, return_tensors="pt")
        if len(tokens.input_ids[0]) < least_tokens:
            continue
        samples.append(prompt)
        if len(samples) >= num_samples:
            break
            
    prompt_token_ids = []
    for prompt in samples:
        tokens = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=least_tokens)
        prompt_token_ids.append(tokens.input_ids[0].tolist())
        
    return prompt_token_ids

Now we run inference on the calibration dataset. The `llm.generate` call will pass data through the model, and the hooks (configured via `QUANT_CONFIG`) will record the statistics of tensors.

In [6]:
MAX_DATASET_SAMPLES = 128
SAMPLE_LEN = 1024

prompt_token_ids = get_dataset_prompts(
    MAX_DATASET_SAMPLES, SAMPLE_LEN
)
input_batch = [{"prompt_token_ids": p} for p in prompt_token_ids]

outputs = llm.generate(input_batch, sampling_params, use_tqdm=True)

Loading 128 samples...


Generating train split: 100%|██████████| 10000/10000 [00:00<00:00, 39021.14 examples/s]
  6%|▌         | 610/10000 [00:01<00:22, 417.86it/s]
Adding requests: 100%|██████████| 128/128 [00:00<00:00, 7779.72it/s]




Processed prompts:   0%|          | 0/128 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts:  94%|█████████▍| 120/128 [00:38<00:00, 48.33it/s, est. speed input: 3214.48 toks/s, output: 100.45 toks/s]



Processed prompts: 100%|██████████| 128/128 [00:39<00:00,  3.27it/s, est. speed input: 3351.24 toks/s, output: 104.73 toks/s]


After calibration is complete, we release the resources.

In [7]:
del llm

[1;36m(Worker pid=8790)[0;0m INFO 12-15 07:47:39 [multiproc_executor.py:702] Parent process exited, terminating worker


This process is required to save the statistics to the specified path, since the API for saving the results--`finalize_calibration`--is currently implemented in the runner’s destructor, as shown below. You can now see the calibration results under `calibration_outputs` directory.

In [8]:
'''
def shutdown_inc(self):
    can_finalize_inc = self._is_quant_with_inc() and \
        (self.model.model is not None) and \
        self.inc_initialized_successfully and \
        not self._is_inc_finalized
    if can_finalize_inc:
        from neural_compressor.torch.quantization import (finalize_calibration)
        finalize_calibration(self.model.model)
        self._is_inc_finalized = True

def __del__(self):
    self.shutdown_inc()
'''

'\ndef shutdown_inc(self):\n    can_finalize_inc = self._is_quant_with_inc() and         (self.model.model is not None) and         self.inc_initialized_successfully and         not self._is_inc_finalized\n    if can_finalize_inc:\n        from neural_compressor.torch.quantization import (finalize_calibration)\n        finalize_calibration(self.model.model)\n        self._is_inc_finalized = True\n\ndef __del__(self):\n    self.shutdown_inc()\n'