# [A-1] Running vLLM with Gaudi

## Wath you will do
In this walkthrough, we will run a simple vLLM inference and apply FP8 quantization on Intel® Gaudi® accelerators(HPU) with Intel® Neural Compressor.

- **Stage 1**: run basic vLLM inference and benchmark a baseline model
- **Stage 2**: run calibration and see how INC is integrated into `vllm-gaudi`
- **Stage 3**: run quantiation and benchmark the quantized model

`vllm-gaudi` is built upon **vLLM Hardware Plugin**, which allows vLLM to utilize Gaudi devices seamlessly. Because of this integration, the Python interface remains almost the same as running vLLM on GPUs.

> The vLLM Hardware Plugin for Intel® Gaudi® is a community-driven integration layer that enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.
> 
> The vLLM Hardware Plugin for Intel® Gaudi® connects the vLLM serving engine with Intel® Gaudi® hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by the Intel® Gaudi® team and follows the hardware pluggable RFC and vLLM plugin architecture RFC for modular integration.

Someone might have seen a [folked repository](https://github.com/HabanaAI/vllm-fork) for Gaudi, which is the initial implementation of vLLM for Gaudi. As vLLM moves to v1, backend support is now recommended to be implemented via a plugin-based approach, which is why the transition to vllm-gaudi is underway. Since this is still a transitional phase, some advanced features may currently be available only in the forked repository. However, the fork is expected to be deprecated in the long term.

`vllm-gaudi` is installed in your environment for this hands-on workshop. Check if it's properly installed.

In [None]:
!pip list | grep vllm_gaudi

## Running offline inference

Set the `HF_HUB_CACHE` environment variable properly to point to pre-downloaded hub path.

In [None]:
MODEL_PATH = "/home/work/Qwen3-8B"

In [None]:
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.0)
llm = LLM(model=MODEL_PATH)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


### KV Cache and Memory Blocks

vLLM manages memory using **PagedAttention**, which allocates Key-Value (KV) cache in non-contiguous memory blocks. Check the logs for the number of usable blocks:

> `[hpu_worker.py:243] Usable num_blocks: 3655 ...`

This number is calculated based on the free HPU memory available after running profile run.
*   **More free memory = More blocks = Larger batch size / context length capability.**

Keep a note of this number. Later, when we use a **quantized model** (which is smaller), we expect to see a higher number of available blocks, allowing for even higher throughput.

### Graph Compilation and Warm-up

During the initialization of the `llm` instance, you will observe "warm-up" logs similar to:
> `Prompt warmup processing: 100%|██████████| 121/121`  
> `Decode warmup processing: 100%|██████████| 37/37`

`vllm-gaudi` implements `HPUBucketingManager` to precompile graphs for different input shapes. While this warm-up process introduces some overhead during runner initialization, it enables faster inference by ensuring proper graph handling without on-the-fly recompilation. You can disable the warm-up process by setting the environment variable `VLLM_SKIP_WARMUP=True`.

As you can indicate from the compilation log, prefill graphs and decode graphs are being compiled separately. Although this exercise runs prefill and decode separately, `vllm-gaudi` has recently added support for mixed batch and unified attention. More details are available [here](https://github.com/vllm-project/vllm-gaudi/blob/v0.11.2/docs/features/unified_attn.md#unifiedmixed-batches).

## Running a benchmark with the baseline model

Now that the engine is running, let's establish a performance baseline.

We will define a helper function, `run_vllm`, to execute a benchmark using the loaded engine. This function processes a list of requests and measures the execution time, which we will compare against the quantized model in future steps.

In [None]:
import time

from vllm import LLM, SamplingParams
from vllm.benchmarks.datasets import SampleRequest
from vllm.inputs import TextPrompt, TokensPrompt
from vllm.lora.request import LoRARequest
from vllm.outputs import RequestOutput
from vllm.sampling_params import BeamSearchParams


def run_vllm(
    llm: LLM,
    requests: list[SampleRequest],
    n: int,
    do_profile: bool,
    disable_detokenize: bool = False,
) -> tuple[float, list[RequestOutput] | None]:
    assert all(
        llm.llm_engine.model_config.max_model_len
        >= (request.prompt_len + request.expected_output_len)
        for request in requests
    ), (
        "Please ensure that max_model_len is greater than the sum of"
        " prompt_len and expected_output_len for all requests."
    )
    # Add the requests to the engine.
    prompts: list[TextPrompt | TokensPrompt] = []
    sampling_params: list[SamplingParams] = []
    for request in requests:
        prompt = (
            TokensPrompt(prompt_token_ids=request.prompt["prompt_token_ids"])
            if "prompt_token_ids" in request.prompt
            else TextPrompt(prompt=request.prompt)
        )
        if request.multi_modal_data:
            assert isinstance(request.multi_modal_data, dict)
            prompt["multi_modal_data"] = request.multi_modal_data
        prompts.append(prompt)

        sampling_params.append(
            SamplingParams(
                n=n,
                temperature=1.0,
                top_p=1.0,
                ignore_eos=True,
                max_tokens=request.expected_output_len,
                detokenize=not disable_detokenize,
            )
        )
    lora_requests: list[LoRARequest] | None = None

    use_beam_search = False

    outputs = None
    if not use_beam_search:
        start = time.perf_counter()
        if do_profile:
            llm.start_profile()
        outputs = llm.generate(
            prompts, sampling_params, lora_request=lora_requests, use_tqdm=True
        )
        if do_profile:
            llm.stop_profile()
        end = time.perf_counter()
    else:
        assert lora_requests is None, "BeamSearch API does not support LoRA"
        prompts = [request.prompt for request in requests]
        
        output_len = requests[0].expected_output_len
        for request in requests:
            assert request.expected_output_len == output_len
        start = time.perf_counter()
        if do_profile:
            llm.start_profile()
        llm.beam_search(
            prompts,
            BeamSearchParams(
                beam_width=n,
                max_tokens=output_len,
                ignore_eos=True,
            ),
        )
        if do_profile:
            llm.stop_profile()
        end = time.perf_counter()
    return end - start, outputs

We will use `RandomDataset` to create a synthetic dataset.

*   `input_len=512`: The number of tokens in the prompt.
*   `output_len=512`: The number of tokens to generate.
*   `num_requests=512`: Total number of requests to process.

In [None]:
from vllm.benchmarks.datasets import RandomDataset

requests = RandomDataset(
    dataset_path=None,
    random_seed=42,
).sample(
    tokenizer=llm.get_tokenizer(),
    input_len=512,
    output_len=512,
    num_requests=512,
)

Now we run the benchmark using our `run_vllm` helper. This will report:

*   **Requests/s**: How many requests the server can handle per second.
*   **Total tokens/s**: Including both prompt processing and generation.
*   **Output tokens/s**: The speed of token generation.

In [None]:
elapsed_time, request_outputs = run_vllm(
    llm,
    requests,
    n=1,
    do_profile=False,
)

total_prompt_tokens = 0
total_output_tokens = 0
for ro in request_outputs:
    if not isinstance(ro, RequestOutput):
        continue
    total_prompt_tokens += (
        len(ro.prompt_token_ids) if ro.prompt_token_ids else 0
    )
    total_output_tokens += sum(len(o.token_ids) for o in ro.outputs if o)
total_num_tokens = total_prompt_tokens + total_output_tokens
print(f"Total num prompt tokens:  {total_prompt_tokens}")
print(f"Total num output tokens:  {total_output_tokens}")
print(
    f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
    f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
    f"{total_output_tokens / elapsed_time:.2f} output tokens/s"
)