## FP8 Quantization and Inference using Intel® Neural Compressor (INC)

Copyright (c) 2025 Habana Labs, Ltd. an Intel Company.
SPDX-License-Identifier: Apache-2.0


> 📝 **Note:** Before running this tutorial, it is assumed that the reader has already setup the Gaudi machine and Jupyter notebooks as laid out in the [README](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/README.md#important-to-run-these-jupyter-notebooks-you-will-need-to-follow-these-steps).

## Introduction

In this notebook, we take a look at running inference via vLLM on HPU with FP8 precision achieved using [Intel® Neural Compressor](https://github.com/intel/neural-compressor) (INC) package. Using FP8 data type for inference on large language models halves the required memory bandwidth compared to BF16. In addition, FP8 compute is twice as fast as BF16 compute. These two benefits enable efficient deployment of LLMs using FP8 quantization.

### Model Calibration Process
To enable inference with FP8 data type, the Intel Neural Compressor (INC) performs model measurement and quantization and this is called as model calibration procedure.
The [vllm-hpu-extension](https://github.com/HabanaAI/vllm-hpu-extension/blob/main/calibration/README.md) provides automated scripts that utilize INC to perform the model calibration as shown below.

#### Download Model and Dataset

Since we will be quantizing HuggingFace models, specify your [Huggingface credentials](https://huggingface.co/docs/hub/en/security-tokens) (HF_TOKEN) below.

In [None]:
HF_TOKEN="<YOUR HF_TOKEN HERE>"

Specify the Huggingface model you would like to quantize to FP8:

In [None]:
import os

MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct"
os.environ['MODEL_NAME'] = MODEL_NAME

The model checkpoint of your selected model is downloaded to `/root/models` directory by launching the `download_model.sh` script below. 

**NOTE**: This process of downloading the model is time consuming and will continue running in the background after you run this cell. Please immediately proceed to run the subsequent cells.

In [None]:
import subprocess

command = f"HF_TOKEN={HF_TOKEN} /bin/bash download_model.sh {MODEL_NAME}"
process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

print("Process ID:", process.pid)

The calibration script uses the publically available open_orca dataset from MLCommons to generate a calibration dataset. Run the following cell to download the dataset into the `/root/open_orca` folder:

In [None]:
!download_dataset.sh

Check if the open_orca_gpt4_tokenized_llama.sampled_24576.pkl has downloaded by running this command:

In [None]:
!ls /root/open_orca/*.pkl

Check if the model files have downloaded fully. Set the `CHECKPOINT_PATH` variable to point to the location where the `download_model.sh` script has downloaded the model weights.

In [None]:
CHECKPOINT_PATH=f'/root/models/{MODEL_NAME}'
%env CHECKPOINT_PATH = {CHECKPOINT_PATH}

**Once all the files have been downloaded by the download_model.sh script**, the command below should show the size of the folder as ~263GB for Llama-3.1-8B-Instruct model. If not, then wait for the model to download fully and re-check periodically before proceeding.

In [None]:
!du -sh $CHECKPOINT_PATH

### Install vLLM (Pre-requisite)

The following cell installs vLLM server for Gaudi. For more information on installing vLLM for Gaudi refer to [Build And Install vLLM](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#2-build-and-install-the-latest-from-vllm-fork).

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-fork.git -b v0.6.6.post1+Gaudi-1.20.0 
cd vllm-fork
pip install -r requirements-hpu.txt
python3 setup.py develop

#### Run Calibration Script
The following command downloads the calibration script and runs calibration on the model downloaded above using the downloaded open_orca .pkl file. It generates `maxabs_quant_g3.json` on Gaudi 3 (or `maxabs_quant_g2.json` on a Gaudi 2)  which is a quantization configuration file in the `g3` folder.

**NOTE**: Estimated time for completion of this step is *10 minutes*.

In [None]:
%%bash
cd /root
git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.20.0
cd vllm-hpu-extension
pip install -e .
cd calibration
./calibrate_model.sh -m $CHECKPOINT_PATH -d /root/open_orca/open_orca_gpt4_tokenized_llama.sampled_24576.pkl -o g3 -b 128 -t 8 -l 1024


### Benchmark online serving throughput

The [benchmark_serving.py](https://github.com/HabanaAI/vllm-fork/blob/habana_main/benchmarks/benchmark_serving.py) script is useful for benchmarking the vLLM serving throughput in online mode. For more details on online and offline modes of vLLM, refer [documentation](https://docs.habana.ai/en/v1.19.1/PyTorch/Inference_on_PyTorch/vLLM_Inference.html#sending-an-inference-request) and this [tutorial](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/Getting_Started_with_vLLM/Getting_Started_with_vLLM.ipynb).

#### Start the vLLM server

Double check path of the quantization config generated by the model calibration script from the previous section. 

In [None]:
!ls /root/vllm-hpu-extension/calibration/g3/meta-llama-3.1-70b-instruct/*.json

Set the required environment variables before starting the server. Most importantly, point the `QUANT_CONFIG` variable to the full path quantization config .json as seen above. In this example, it is assumed to be `/root/vllm-hpu-extension/calibration/g3/meta-llama-3.1-70b-instruct/maxabs_quant_g2.json` for a Gaudi 2 machine.

In [None]:
%env QUANT_CONFIG=/root/vllm-hpu-extension/calibration/g3/meta-llama-3.1-70b-instruct/maxabs_quant_g2.json

%env PT_HPU_ENABLE_LAZY_COLLECTIVES=true
%env PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1
%env VLLM_ENGINE_ITERATION_TIMEOUT_S=3600
%env VLLM_PROMPT_BS_BUCKET_MAX=16
%env VLLM_DECODE_BS_BUCKET_MAX=128
%env VLLM_DECODE_BLOCK_BUCKET_MIN=2048
%env VLLM_DECODE_BLOCK_BUCKET_MAX=4096
%env VLLM_PROMPT_SEQ_BUCKET_MAX=2048
%env VLLM_PROMPT_SEQ_BUCKET_MIN=2048
%env VLLM_SKIP_WARMUP="true"
!cd /root/vllm-fork

Use this special start command to launch the vllm server. Note the use of `--quantization inc` and `--kv-cache-dtype fp8_inc` parameters that enable the FP8 quantization using INC and above `QUANT_CONFIG`.


In [None]:
command=f"python -m vllm.entrypoints.openai.api_server \
    --port 8080 \
    --model {CHECKPOINT_PATH} \
    --tensor-parallel-size 8 \
    --max-num-seqs 128 \
    --disable-log-requests \
    --dtype bfloat16 \
    --block-size 128 \
    --gpu-memory-util 0.9 \
    --num-lookahead-slots 1 \
    --use-v2-block-manager \
    --max-num-batched-tokens 32768 \
    --max-model-len 4096 \
    --quantization inc \
    --kv-cache-dtype fp8_inc \
    --weights-load-device cpu \
    2>&1 | tee server_70b_TP8_fp8.log"

Launch vLLM server instance:

In [None]:
process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print("Process ID for server:", process.pid)

The server process would have launched in the background. Note the process ID for reference. The output logs of the server can can be seen in real-time in the ```server_70b_TP8_fp8.log``` file.

#### Run benchmark_serving.py on client side

Check the server instance readiness by checking the server output log.

In [None]:
!grep -B3 "INFO:     Uvicorn running on" server_70b_TP8_fp8.log

If above cell's output says **"Application startup complete"** , then run the following cell to launch the benchmarking script, else wait a few more minutes for the server to comeup.

In [None]:
%%bash
HF_TOKEN={HF_TOKEN} python3 vllm-fork/benchmarks/benchmark_serving.py --backend vllm \
--model $CHECKPOINT_PATH \
--dataset-name sonnet \
--dataset-path ./sonnet.txt \
--request-rate 4 \
--num-prompts 1000 \
--port 8080 \
--sonnet-input-len 2048 \
--sonnet-output-len 2048 \
--sonnet-prefix-len 100 2>&1 | tee client_70b_2k_2k_fp8.log


In [None]:
 !HF_TOKEN={HF_TOKEN} python vllm-fork/benchmarks/benchmark_serving.py --backend vllm --model "{MODEL}" --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf 2>&1 | tee vllm_benchmark_serving_client_stdout.log


View the bernchmarking results in the `client_70b_2k_2k_fp8.log` output file.
Run the following cells before exiting this notebook.

In [None]:
!kill -9 <PID_OF_SERVER_FROM_CELL_NUMBER_5_ABOVE>

In [None]:
exit()