# [A-3] Running Quantization and Inference

*Make sure to restart the Kernel before executing this notebook*

In this notebook, we will quantize the model and finally run inference with the quantized model. Let's see how the performance get improved!

Similarly, set model cache path and `QUANT_CONFIG` environment variables.

In [None]:
import os

os.environ["QUANT_CONFIG"] = f"{os.getcwd()}/configs/quantize.json"
MODEL_PATH = "Qwen/Qwen3-8B"

This time, we should use the config file for quantization. You can see the mode is set to `QUANTIZE`. For this walkthrough we will use `maxabs_hw` scaling method. For the available scaling methods, check the details [here](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=pt_hpu_weight_sharing#supported-json-config-file-options).

Note that model is not quantized yet. Quantization is performed while initializing the LLM instance. The statistics collected during the calibration phase will be loaded to apply quantization.

As we've seen in the previous notebook, `quantize` API will be applied to the model while loading. The patched modules for `QUANTIZE` mode perform quantization on input tensors and utilize the low precision kernels. While the additional quantize and dequantize operations may seem to introduce type casting overhead, SynapseAI's graph compiler optimizes them by eliminating unnecessary consecutive QDQ operations whenever possible, ensuring that activation tensors remain in low precision throughout the workflow.

In [None]:
from vllm import LLM, SamplingParams

os.environ["PT_HPU_WEIGHT_SHARING"] = "0"

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.0)
llm = LLM(model=MODEL_PATH, quantization="inc", kv_cache_dtype="fp8_inc")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


You may see the number of available blocks increased compared to the baseline

> [hpu_worker.py:243] Usable num_blocks: 7855 ...

Now, Define the same benchmarking helper as in the first notebook to ensure a fair comparison.

In [None]:
import time

from vllm import LLM, SamplingParams
from vllm.benchmarks.datasets import SampleRequest
from vllm.inputs import TextPrompt, TokensPrompt
from vllm.lora.request import LoRARequest
from vllm.outputs import RequestOutput
from vllm.sampling_params import BeamSearchParams


def run_vllm(
    llm: LLM,
    requests: list[SampleRequest],
    n: int,
    do_profile: bool,
    disable_detokenize: bool = False,
) -> tuple[float, list[RequestOutput] | None]:
    assert all(
        llm.llm_engine.model_config.max_model_len
        >= (request.prompt_len + request.expected_output_len)
        for request in requests
    ), (
        "Please ensure that max_model_len is greater than the sum of"
        " prompt_len and expected_output_len for all requests."
    )
    # Add the requests to the engine.
    prompts: list[TextPrompt | TokensPrompt] = []
    sampling_params: list[SamplingParams] = []
    for request in requests:
        prompt = (
            TokensPrompt(prompt_token_ids=request.prompt["prompt_token_ids"])
            if "prompt_token_ids" in request.prompt
            else TextPrompt(prompt=request.prompt)
        )
        if request.multi_modal_data:
            assert isinstance(request.multi_modal_data, dict)
            prompt["multi_modal_data"] = request.multi_modal_data
        prompts.append(prompt)

        sampling_params.append(
            SamplingParams(
                n=n,
                temperature=1.0,
                top_p=1.0,
                ignore_eos=True,
                max_tokens=request.expected_output_len,
                detokenize=not disable_detokenize,
            )
        )
    lora_requests: list[LoRARequest] | None = None

    use_beam_search = False

    outputs = None
    if not use_beam_search:
        start = time.perf_counter()
        if do_profile:
            llm.start_profile()
        outputs = llm.generate(
            prompts, sampling_params, lora_request=lora_requests, use_tqdm=True
        )
        if do_profile:
            llm.stop_profile()
        end = time.perf_counter()
    else:
        assert lora_requests is None, "BeamSearch API does not support LoRA"
        prompts = [request.prompt for request in requests]
        # output_len should be the same for all requests.
        output_len = requests[0].expected_output_len
        for request in requests:
            assert request.expected_output_len == output_len
        start = time.perf_counter()
        if do_profile:
            llm.start_profile()
        llm.beam_search(
            prompts,
            BeamSearchParams(
                beam_width=n,
                max_tokens=output_len,
                ignore_eos=True,
            ),
        )
        if do_profile:
            llm.stop_profile()
        end = time.perf_counter()
    return end - start, outputs

Now run the benchmark on the quantized model.

Compare the **Throughput (tokens/s)** with the baseline FP16/BF16 results from the first notebook. You should observe a significant improvement in performance.

In [None]:
from vllm.benchmarks.datasets import RandomDataset

requests = RandomDataset(
    dataset_path=None,
    random_seed=42,
).sample(
    tokenizer=llm.get_tokenizer(),
    input_len=512,
    output_len=512,
    num_requests=512,
)

In [None]:
elapsed_time, request_outputs = run_vllm(
    llm,
    requests,
    n=1,
    do_profile=False,
)

total_prompt_tokens = 0
total_output_tokens = 0
for ro in request_outputs:
    if not isinstance(ro, RequestOutput):
        continue
    total_prompt_tokens += (
        len(ro.prompt_token_ids) if ro.prompt_token_ids else 0
    )
    total_output_tokens += sum(len(o.token_ids) for o in ro.outputs if o)
total_num_tokens = total_prompt_tokens + total_output_tokens
print(f"Total num prompt tokens:  {total_prompt_tokens}")
print(f"Total num output tokens:  {total_output_tokens}")
print(
    f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
    f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
    f"{total_output_tokens / elapsed_time:.2f} output tokens/s"
)

In addition to FP8 quantization, `vllm-gaudi` supports a variety of serving features, including MultiLoRA, Guided Decoding, Automatic Prefix Caching, and more. You can check the full list of supported features [here](https://docs.vllm.ai/projects/gaudi/en/latest/features/supported_features.html#supported-features_1).