## FP8 Quantization and Inference using Intel® Neural Compressor (INC)

Copyright (c) 2025 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


> 📝 **Note:** Before running this tutorial, it is assumed that the reader has already setup the Gaudi machine and Jupyter notebooks as laid out in the [README](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/README.md#important-to-run-these-jupyter-notebooks-you-will-need-to-follow-these-steps).

## Introduction

In this notebook, we take a look at running inference via vLLM on HPU with FP8 precision achieved using [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) package. Using FP8 data type for inference on large language models halves the required memory bandwidth compared to BF16. In addition, FP8 compute is twice as fast as BF16 compute. These two benefits enable efficient deployment of LLMs using FP8 quantization.

### Model Calibration Process
To enable inference with FP8 data type, the Intel Neural Compressor (INC) performs model measurement and quantization and this is called as model calibration procedure.
The [vllm-hpu-extension](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md) provides automated scripts that utilize INC to perform the model calibration as shown below.

#### Download Dataset

Since we will be quantizing HuggingFace models, specify your [Huggingface credentials](https://huggingface.co/docs/hub/en/security-tokens) (HF_TOKEN) below.

In [None]:
HF_TOKEN="<YOUR HF_TOKEN HERE>"

Using the `NousResearch/Meta-Llama-3.1-70B-Instruct` model as an example here. You may specify your own Huggingface model that you would like to quantize to FP8:

In [None]:
import os

MODEL_NAME="NousResearch/Meta-Llama-3.1-70B-Instruct"
os.environ['MODEL_NAME'] = MODEL_NAME

In [None]:
!echo $MODEL_NAME

The calibration script uses the publically available open_orca dataset from MLCommons to generate a calibration dataset. Run the following cell to download the dataset into the `/root/open_orca` folder:

In [None]:
!bash download_dataset.sh

Check if the `open_orca_gpt4_tokenized_llama.sampled_24576.pkl` has downloaded by running this command:

In [None]:
!ls /root/open_orca/*.pkl

### Install vLLM (Pre-requisite and one-time only)

The following cell installs vLLM server for Gaudi. For more information on installing vLLM for Gaudi refer to [Build And Install vLLM](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#2-build-and-install-the-latest-from-vllm-fork).

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-fork.git -b v0.6.6.post1+Gaudi-1.20.0 
cd vllm-fork
pip install -r --quiet requirements-hpu.txt
python3 setup.py develop
pip install datasets

In [None]:
# Symlink python3 as default python (Needed for Calibration Script later on)
!ln -s /usr/bin/python3 /usr/bin/python

#### Run Calibration Script (one-time for each model)
The following command downloads the calibration script and runs calibration on the selected model using the downloaded open_orca .pkl file. It generates `maxabs_quant_g3.json` on Gaudi 3 (or `maxabs_quant_g2.json` on a Gaudi 2)  which is a quantization configuration file in the `g3` folder.

**NOTE**: Estimated time for completion of this step is **10 minutes**.

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.20.0
cd vllm-hpu-extension
pip install -e .
cd calibration
./calibrate_model.sh -m $MODEL_NAME -d /root/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl -o g3 -b 128 -t 8 -l 1024


### Benchmark online serving throughput

Once the calibration is complete we launch the vLLM server with special directives to load the model and quantize it to FP8. The [benchmark_serving.py](https://github.com/HabanaAI/vllm-fork/blob/habana_main/benchmarks/benchmark_serving.py) script is used for benchmarking the vLLM serving throughput in online mode.

#### Start the vLLM server

Double check path of the quantization config generated by the model calibration script from the previous section:

In [None]:
!ls -lhart /root/vllm-hpu-extension/calibration/g3/*/maxabs_quant_*.json

Copy the path your model's maxabs_quant .json config file from the above command and place it in the next cell:

In [None]:
# Enter the full path of your model's maxabs_quant_[g2,g3].json file

QUANT_CONFIG="<FULL_PATH_TO_YOUR_MODELS_MAXABS_QUANT>"
## e.g. QUANT_CONFIG="/root/vllm-hpu-extension/calibration/g3/meta-llama-3.1-70b-instruct/maxabs_quant_g2.json"

In [None]:
print(f"{QUANT_CONFIG}")

View the run_server.sh server launch script that takes your model name as input and launches the vLLM server.
Note the use of `--quantization inc` and `--kv-cache-dtype fp8_inc` parameters that enable the FP8 quantization using INC. 

In [None]:
!cat run_server.sh

View this special command which will be used to launch the vllm server. Note the use of `QUANT_CONFIG` env variable and the `MODEL_NAME` given as input to the launcher script.

In [None]:
command = f"HF_TOKEN={HF_TOKEN} QUANT_CONFIG={QUANT_CONFIG} /bin/bash run_server.sh {MODEL_NAME}"
print(command)

Launch vLLM server instance as a background process:

In [None]:
import subprocess

process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)

print("Process ID:", process.pid)

The server process would have launched in the background. Note the process ID for reference. The output logs of the server can can be seen in real-time in the ```server_TP8_fp8.log``` file.

**NOTE**: Estimated time for server to startup is **10 minutes**.

#### Run benchmark_serving.py on client side

After having waited sufficient time for server to start, check the logs to confirm FP8 quantization:

In [None]:
!grep "Start to convert model with fp8_quant" server_TP8_fp8.log

 Finally, check the server instance readiness by checking the server output log:

In [None]:
!grep -B3 "INFO:     Uvicorn running on" server_TP8_fp8.log

If above cell's output says **"Application startup complete"** , then run the following cell to launch the benchmarking script, else wait a few more minutes for the server to comeup and retry above command.

In [None]:
%%bash
HF_TOKEN={HF_TOKEN} python3 /root/vllm-fork/benchmarks/benchmark_serving.py --backend vllm \
--model $MODEL_NAME \
--dataset-name sonnet \
--dataset-path /root/vllm-fork/benchmarks/sonnet.txt \
--request-rate 4 \
--num-prompts 1000 \
--port 8080 \
--sonnet-input-len 2048 \
--sonnet-output-len 2048 \
--sonnet-prefix-len 100 2>&1 | tee client_2k_2k_fp8.log


View the benchmarking results in the `client_2k_2k_fp8.log` output file.
Run the following cells before exiting this notebook.

In [None]:
!kill -9 <PID_OF_SERVER_FROM_CELL_NUMBER_14_ABOVE>

In [None]:
exit()