# Understanding vLLM on Intel® Gaudi® 2 AI Accelerators

Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

## Introduction

This notebook is a follow up to the [Getting Started with vLLM](https://github.com/HabanaAI/Gaudi-tutorials/blob/main/PyTorch/Getting_Started_with_vLLM/Getting_Started_with_vLLM.ipynb) tutorial and demonstrates the effects of tuning vLLM specific parameters for Intel® Gaudi® 2 AI Accelerators on the overall performance of the vLLM engine. This notebook walks you through the server bringup phase and explains how to read and understand the logs regarding memory consumption trends. This tutorial also utilises the vLLM's in-built ```benchmark_serving.py``` script which measures various throughput and latency metrics in an online serving setting.

### vLLM Installation and Environment Setup

For Gaudi requirements and installation please refer to [Requirements and Installation](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#requirements-and-installation).

It is highly recommended to use the latest Docker image from Intel Gaudi vault. Refer to the [Run Docker Image](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#run-docker-image) section from [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers) for more details.

The following cell installs vLLM server for Gaudi. For more information on installing vLLM for Gaudi refer to [Build And Install vLLM](https://github.com/HabanaAI/vllm-fork/blob/habana_main/README_GAUDI.md#build-and-install-vllm-fork).

In [None]:
%%bash
git clone https://github.com/HabanaAI/vllm-fork.git 
cd vllm-fork &&
git checkout v0.5.3.post1+Gaudi-1.18.0 &&
pip install -e .

Enter your [Huggingface Hub Token](https://huggingface.co/docs/hub/en/security-tokens) (HF_TOKEN) needed for accessing certain models like Llama3.1.

In [2]:
HF_TOKEN="<YOUR HF_TOKEN HERE>"

Specify the model you would like to benchmark:

In [3]:
MODEL="meta-llama/Meta-Llama-3.1-8B"

### Start the vLLM server

We will be running this basic command (below) to load the model specified earlier and launch the OpenAI compatible server instance.

```bash
python -m vllm.entrypoints.openai.api_server --model=models/Meta-Llama-3.1-8B --port 8000 --block-size 128
```

> **Note**: For optimal performance, it is recommended to run inference on Gaudi 2 with ```--block-size``` of 128 for BF16 data type.

We launch the vLLM server as a separate process in the background by running the following cells and logging its output in a separate file called ```vllm_server_stdout.log```.

In [None]:
cd vllm-fork

In [33]:
import subprocess

command = f"HF_TOKEN={HF_TOKEN} python -m vllm.entrypoints.openai.api_server --model={MODEL} --port 8000 --block-size 128"

with open('vllm_server_stdout.log', 'w') as file:
    process = subprocess.Popen(command, shell=True, start_new_session=True, stdout=file, stderr=subprocess.STDOUT)

print("Process ID:", process.pid)

Process ID: 10681


The server is now beginning initialization and will proceed to warmup phase.
**To read the server logs, run the following cell after waiting a couple of minutes allowing the server to load weights:**

In [23]:
!grep -B5 "Free device memory:" vllm_server_stdout.log


INFO 11-03 21:23:07 habana_model_runner.py:587] Pre-loading model weights on hpu:0 took 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 3.147 GiB of host memory (57.84 GiB/1007 GiB used)
INFO 11-03 21:23:07 habana_model_runner.py:639] Wrapping in HPU Graph took 0 B of device memory (15.05 GiB/94.62 GiB used) and -252 KiB of host memory (57.84 GiB/1007 GiB used)
INFO 11-03 21:23:07 habana_model_runner.py:643] Loading model weights took in total 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 3.146 GiB of host memory (57.84 GiB/1007 GiB used)
INFO 11-03 21:23:14 habana_worker.py:146] Model profiling run took 5.318 GiB of device memory (20.37 GiB/94.62 GiB used) and 226 MiB of host memory (58.06 GiB/1007 GiB used)
INFO 11-03 21:23:14 habana_worker.py:170] Free device memory: 74.25 GiB, 66.83 GiB usable (gpu_memory_utilization=0.9), 26.73 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.4), 40.1 GiB reserved for KV cache


These log entries tell us:
- State of the memory consumption in the device after loading of model weights and server's profiling run but *before* HPU Graphs Capture.
- How much memory is reserved for the **KV Cache and the HPU Graphs** out of the usable memory pool.


By now, the server has begun the warmup and HPU Graphs Capture phase. Let us take a quick look at the [bucketing configuration](https://github.com/HabanaAI/vllm-fork/blob/v0.5.3.post1%2BGaudi-1.18.0/README_GAUDI.md#bucketing-mechanism) used which is crucial in getting efficient server responses.

In [27]:
!grep -A14 "VLLM_PROMPT_BS_BUCKET_MIN=" vllm_server_stdout.log

INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MIN=1 (default:min)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_STEP=32 (default:step)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_BS_BUCKET_MAX=64 (default:max)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MIN=32 (default:min)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_STEP=32 (default:step)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_DECODE_BS_BUCKET_MAX=256 (default:max)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:min)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:step)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:max)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:min)
INFO 11-03 21:22:30 habana_model_runner.py:95] VLLM_DECODE_BLOCK_BUCKET_STEP=128 (d

Following that, run the following cell **after waiting for ten minutes** allowing the server to finish warmup and HPU Graphs capture
> **Note**: Warmup and HPUGraphs capture time depends on many factor, e.g. input and output sequence length, batch size, number of buckets (bucketing config) and datatype. It can take anywhere from tens of minutes to even hours based on the configurations.

In [34]:
!grep -B1 "init_cache_engine took" vllm_server_stdout.log

INFO 11-04 05:27:00 habana_model_runner.py:1635] Warmup finished in 225 secs, allocated 1.429 GiB of device memory
INFO 11-04 05:27:00 habana_executor.py:91] init_cache_engine took 41.52 GiB of device memory (61.89 GiB/94.62 GiB used) and 2.675 GiB of host memory (66.15 GiB/1007 GiB used)


**If all looks good with no errors at this point, the server is ready to serve inference requests.**

### Debug Out of Memory Errors

If you see ```Out of Memory``` errors in the logs in any of the above stages, try restarting the server with the following parameters in the command line: 
- Increased ```--gpu-memory-utilization``` (default: 0.9)- This addresses insufficient available memory per card.
- Increased ```--tensor-parallel-size``` (default: 1) - This approach shards model weights across the devices and may help in loading a model which is too big for a single card.

Refer [Understanding vLLM logs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/vLLM_Inference.html#understanding-vllm-logs) article for more details.

### Key Environment Variables
Here are some key environment variables that can affect the server's efficiency and can be set before running the above server command.

#### VLLM_GRAPH_RESERVED_MEM
Defines the ratio of memory reserved for HPU Graphs vs KV Cache. Default value: 0.4

Since HPU Graphs and KV Cache occupy the same memory pool (*“usable memory”* determined by ```--gpu-memory-utilization```), a balance is required between the two which can be managed using this variable:
- Maximizing KV Cache size helps to accommodate bigger batches resulting in **increased overall throughput**.
- On the other hand, maximizing HPU Graphs capture size reduces host overhead times and can be useful for **reducing latency**.

#### VLLM_GRAPH_PROMPT_RATIO
Determines the ratio of usable graph memory reserved for prefill and decode graphs. Default value: 0.5
- Keep a high value if input tokens throughput needs to be prioritized.
- Vice versa if output tokens throughput needs priority.

#### VLLM_GRAPH_PROMPT_STRATEGY
Configure the strategy for determining order of prompt graph capture, `min_tokens` or `max_bs`, Default:`min_tokens`.

#### VLLM_GRAPH_DECODE_STRATEGY
Configure the strategy for determining order of decode graph capture, `min_tokens` or `max_bs`, Default:`max_bs`.


## Executing Benchmarks

Now that our server is running in the background and ready to serve requests, in this section, we analyze the effects of the various **runtime** performance tuning knobs on the throughput and latency by utilizing the vLLMs in-built `benchmark_serving.py` script for online serving.

In [None]:
%cd benchmarks

### Set Baseline:
Prepare the baseline command to run the with user specified `MODEL` and default key environment variables with other arguments.
- It utilizes the `ShareGPT_V3_unfiltered_cleaned_split.json` which is a dataset containing real conversations from ChatGPT.
- The results from this command are saved in the `benchmark_serving.log` file and would be compared against results from the other experiments.
- The total input tokens and total output tokens are kept fixed across all the experiments.

In [17]:
!python benchmark_serving.py --backend vllm --model $MODEL --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf --sharegpt-output-len 128 2>&1 | tee benchmark_serving_baseline.log

  return isinstance(object, types.FunctionType)
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/mnt/weka/data/git_lfs/pytorch/llama3.1/Meta-Llama-3.1-8B', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=128, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|██████████| 1000/1000 [01:32<00:00, 10.84it/s]
Successful requests:                     1000      
Benchmark duration (s):                  92.28     
Total input tokens:       

### Experiment with a higher VLLM_GRAPH_PROMPT_RATIO:
Let us **restart** the server with a higher prefill ratio (VLLM_GRAPH_PROMPT_RATIO=0.8) and run the benchmark again and see the results.


In [36]:
!python benchmark_serving.py --backend vllm --model $MODEL --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate inf --sharegpt-output-len 128 2>&1 | tee benchmark_serving_pr_08.log

  return isinstance(object, types.FunctionType)
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/mnt/weka/data/git_lfs/pytorch/llama3.1/Meta-Llama-3.1-8B', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1000, sharegpt_output_len=128, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=False, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|██████████| 1000/1000 [01:16<00:00, 13.05it/s]
Successful requests:                     1000      
Benchmark duration (s):                  76.60     
Total input tokens:       

#### Summary of Experiment (Increased Prefill Graph Memory) when compared to baseline numbers:
 - 20% Increase in Request throughput (req/s) and similar across other throughtput benchmarks. 
   (This can be explained by the higher memory reserved for pre-fill stage.)
 - 13% drop in Mean TTFT
 - 9% drop in Mean TPOT
 - 12% drop in Mean ITL


### Further experimentation:
You may retry above experiment with a lower **VLLM_GRAPH_PROMPT_RATIO**, different graph strategies, higher output-lengths etc. and make your own conclusions.