## FP8 Quantization and Inference using Intel® Neural Compressor (INC)

Copyright (c) 2025 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


> 📝 **Note:** Before running this tutorial, it is assumed that the reader has already setup the Gaudi machine and Jupyter notebooks as laid out in the [README](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/README.md#important-to-run-these-jupyter-notebooks-you-will-need-to-follow-these-steps).

## Introduction

In this notebook, we take a look at running inference via vLLM on HPU with FP8 precision achieved using [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) package. Using FP8 data type for inference on large language models halves the required memory bandwidth compared to BF16. In addition, FP8 compute is twice as fast as BF16 compute. These two benefits enable efficient deployment of LLMs using FP8 quantization.

### Model Calibration Process
To enable inference with FP8 data type, the Intel Neural Compressor (INC) performs model measurement and quantization and this is called as model calibration procedure.
The [vllm-hpu-extension](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md) provides automated scripts that utilize INC to perform the model calibration as shown below.

#### Download Model and Dataset

Since we will be quantizing HuggingFace models, specify your [Huggingface credentials](https://huggingface.co/docs/hub/en/security-tokens) (HF_TOKEN) below.

In [None]:
HF_TOKEN="<YOUR HF_TOKEN HERE>"

Specify the Huggingface model you would like to quantize to FP8:

In [None]:
import os

MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct"
os.environ['MODEL_NAME'] = MODEL_NAME

The model checkpoint of your selected model is downloaded to `/root/models` directory by launching the `download_model.sh` script below. 

**NOTE**: This process of downloading the model is time consuming and will continue running in the background when you run this cell. After running this cell, please proceed to run the subsequent cells.

In [None]:
import subprocess

command = f"HF_TOKEN={HF_TOKEN} /bin/bash download_model.sh {MODEL_NAME}"
process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

print("Process ID:", process.pid)

The calibration script uses the publically available open_orca dataset from MLCommons to generate a calibration dataset. Run the following cell to download the dataset into the `/root/open_orca` folder:

In [None]:
!download_dataset.sh

Check if the open_orca_gpt4_tokenized_llama.sampled_24576.pkl has downloaded by running this command:

In [None]:
!ls /root/open_orca/*.pkl

Check if the model files have downloaded fully. **Once all the files are downloaded**, the commands below should show the size of the folder as ~263GB for Llama-3.1-8B-Instruct model.

In [None]:
%%bash
export CHECKPOINT_PATH=/root/models/$MODEL_NAME
du -sh $CHECKPOINT_PATH

#### Run Calibration Script
The following command downloads the calibration script and runs calibration on the model downloaded above using the downloaded open_orca .pkl file. It generates a quantization config .json (`maxbs_quant_g3.json`) in the `g3` folder.

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.20.0
cd vllm-hpu-extension
pip install -e .
cd calibration
./calibrate_model.sh -m $CHECKPOINT_PATH -d /root/open_orca_gpt4_tokenized_llama.sampled_24576.pkl -o g3 -b 128 -t 8 -l 1024

### vLLM Installation and Environment Setup

The following cell installs vLLM server for Gaudi. For more information on installing vLLM for Gaudi refer to [Build And Install vLLM](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#2-build-and-install-the-latest-from-vllm-fork).

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork
git checkout habana_main
pip install -r requirements-hpu.txt
python setup.py develop


### Benchmark online serving throughput

The [benchmark_serving.py](https://github.com/HabanaAI/vllm-fork/blob/habana_main/benchmarks/benchmark_serving.py) script is useful for benchmarking the vLLM serving throughput in online mode. For more details on online and offline modes of vLLM, refer [documentation](https://docs.habana.ai/en/v1.19.1/PyTorch/Inference_on_PyTorch/vLLM_Inference.html#sending-an-inference-request) and this [tutorial](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/Getting_Started_with_vLLM/Getting_Started_with_vLLM.ipynb).

This benchmarking script measures the following  metrics:
- Request throughput (requests/second)
- Input token throughput (tokens/second)
- Output token throughput (tokens/second)
- Time to first token (TTFT) (milliseconds)
- Time per output token (TPOT) (milliseconds)
- Inter-token latency (ITL) (milliseconds)

For more details on some of these metrics, please refer this [Intel LLM whitepaper](https://www.intel.com/content/dam/develop/public/us/en/documents/llm-with-model-server-white-paper.pdf).


#### Start the vLLM server


Since this is benchmarking in online mode, there are two sides i.e. server and client.
In the following cell, run the following command to launch the vLLM OpenAI API server as a **background process**:
```bash
python -m vllm.entrypoints.openai.api_server --model="meta-llama/Meta-Llama-3.1-80B-Instruct" --swap-space 16 --disable-log-requests --port 8000 --block-size 128
```


For quick checking of your model on vLLM and for demonstration purposes, set VLLM_SKIP_WARMUP environment variable to "true".
This will bring up the vLLM server quicker by not running the warmup phase which can be time consuming (over 5-10 minutes) based on your model and vLLM configuration. However, this will result in non-optimal benchmarks.
If you prefer to not skip the warmup phase and are prepared to wait 5-10 minutes for the vLLM server to come up, set this variable to "false".

In [None]:
export QUANT_CONFIG=<path to g2/meta-llama-3.1-70b-instruct/maxabs_quant_g2.json >

export PT_HPU_ENABLE_LAZY_COLLECTIVES=true

export PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1

export VLLM_ENGINE_ITERATION_TIMEOUT_S=3600

export VLLM_PROMPT_BS_BUCKET_MAX=16

export VLLM_DECODE_BS_BUCKET_MAX=128

export VLLM_DECODE_BLOCK_BUCKET_MIN=2048

export VLLM_DECODE_BLOCK_BUCKET_MAX=4096

export VLLM_PROMPT_SEQ_BUCKET_MAX=2048

export VLLM_PROMPT_SEQ_BUCKET_MIN=2048

export VLLM_SKIP_WARMUP="true" 

In [None]:
import subprocess

command = f"VLLM_SKIP_WARMUP={VLLM_SKIP_WARMUP} HF_TOKEN={HF_TOKEN} python -m vllm.entrypoints.openai.api_server --model={MODEL} --swap-space 16 --disable-log-requests --port 8000 --block-size 128"

with open('vllm_server_stdout.log', 'w') as file:
    process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=file, stderr=subprocess.STDOUT)

print("Process ID:", process.pid)

> 📝 **Note**:
> - For optimal performance, it is recommended to run inference with VLLM_SKIP_WARMUP="false" on Gaudi 2 with ```--block-size``` of 128 for BF16 data type.
> - To troubleshoot Out of Memory issues on your model, refer this section of [Gaudi documentation on vLLM](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/vLLM_Inference.html#basic-troubleshooting-for-out-of-memory-errors).

The server process would have launched in the background. Note the process ID for reference. The output logs of the server can can be seen in real-time in the ```vllm_server_stdout.log``` file.

#### Run benchmark_serving.py on client side

First Download the ShareGPT dataset file needed for this benchmark. The benchmarking script samples the requests from this dataset and sends it to the server.

In [None]:
!wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

On the client side, run the script in the following cells with the following self-explanatory parameters:
```bash
    python vllm-fork/benchmarks/benchmark_serving.py \
        --backend vllm \
        --model <your_model> \
        --dataset-name sharegpt \
        --dataset-path <path to dataset> \
        --request-rate <request_rate> \ # By default <request_rate> is inf
        --num-prompts <num_prompts> # By default <num_prompts> is 1000
```

> 📝 **Note:** If you set VLLM_SKIP_WARMUP="false" in the server setup phase, please ensure that server has fully come up by running the following cell.


In [None]:
!grep -B3 "INFO:     Uvicorn running on" vllm_server_stdout.log

If above cell's output says **"Application startup complete"** , then run the following cell to launch the benchmarking script, else wait a few more minutes for the server to comeup.

In [None]:
 !HF_TOKEN={HF_TOKEN} python vllm-fork/benchmarks/benchmark_serving.py --backend vllm --model "{MODEL}" --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf 2>&1 | tee vllm_benchmark_serving_client_stdout.log


If you had set VLLM_SKIP_WARMUP="true" in the server setup phase, try restarting this notebook and retry this benchmark with VLLM_SKIP_WARMUP="false" for better results.
Run the following cells to stop any lingering processes before re-launching this notebook.

In [None]:
!kill -9 <PID_OF_SERVER_FROM_CELL_NUMBER_5_ABOVE>

In [None]:
exit()