From 232db27a560c7c820b713d72cacc291acdcd7014 Mon Sep 17 00:00:00 2001 From: Deepak Singh Date: Tue, 4 Nov 2025 14:54:10 +0000 Subject: [PATCH 1/5] Adding recipe for running Qwen models in G4 --- .../g4/single-host-serving/vllm/README.md | 219 ++++++++++++++++++ 1 file changed, 219 insertions(+) create mode 100644 inference/g4/single-host-serving/vllm/README.md diff --git a/inference/g4/single-host-serving/vllm/README.md b/inference/g4/single-host-serving/vllm/README.md new file mode 100644 index 0000000..40529af --- /dev/null +++ b/inference/g4/single-host-serving/vllm/README.md @@ -0,0 +1,219 @@ +# vLLM serving on a GCP VM with G4 GPUs + +This recipe shows how to serve and benchmark open source models using [vLLM](https://github.com/vllm-project/vllm) on a single GCP VM with G4 GPUs. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). + +## Before you begin + +### 1. Create a GCP VM with G4 GPUs + +First, we will create a Google Cloud Platform (GCP) Virtual Machine (VM) that has the necessary GPU resources. + +Make sure you have the following prerequisites: +* [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) is initialized. +* You have a project with a GPU quota. See [Request a quota increase](https://cloud.google.com/docs/quota/view-request#requesting_higher_quota). +* [Enable required APIs](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com). + +The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-48` for a single GPU VM. The boot disk is set to 200GB to accommodate the models and dependencies. + +```bash +export VM_NAME="${USER}-g4-test" +export PROJECT_ID="your-project-id" +export ZONE="your-zone" +# g4-standard-48 is for a single GPU VM. For a multi-GPU VM (e.g., 8 GPUs), you can use g4-standard-384. +export MACHINE_TYPE="g4-standard-48" +export IMAGE_PROJECT="debian-cloud" +export IMAGE_FAMILY="debian-12" + +gcloud compute instances create ${VM_NAME} \ + --machine-type=${MACHINE_TYPE} \ + --project=${PROJECT_ID} \ + --zone=${ZONE} \ + --image-project=${IMAGE_PROJECT} \ + --image-family=${IMAGE_FAMILY} \ + --maintenance-policy=TERMINATE \ + --boot-disk-size=200GB +``` + +### 2. Connect to the VM + +Use `gcloud compute ssh` to connect to the newly created instance. + +```bash +gcloud compute ssh ${VM_NAME?} --project=${PROJECT_ID?} --zone=${ZONE?} +``` + +### 3. Install the NVIDIA GPU driver and other dependencies + +These commands install the necessary drivers for the GPU to work, along with other development tools. `build-essential` contains a list of packages that are considered essential for building software, and `cmake` is a tool to manage the build process. + +Note: The CUDA toolkit version is specified here for reproducibility. Newer versions may be available and can be used by updating the download link. + +```bash +sudo bash + +# Install dependencies +apt-get update && apt-get install libevent-core-2.1-7 libevent-2.1-7 libevent-dev zip gcc make wget zip libboost-program-options-dev build-essential devscripts debhelper fakeroot -y && wget https://cmake.org/files/v3.26/cmake-3.26.0-rc1-linux-x86_64.sh && bash cmake-3.26.0-rc1-linux-x86_64.sh --skip-license + +# Update linux headers +apt-get -y install linux-headers-$(uname -r) + +# Download CUDA toolkit 12.9.1 +wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda_12.9.1_575.57.08_linux.run + +# Install CUDA +# Accept the EULA (type accept) +# Keep the default selections for CUDA installation (hit the down arrow, and then hit enter on "Install") +sh cuda_12.9.1_575.57.08_linux.run + +exit +``` +### 4. Set environment variables and check devices + +We need to update the `PATH` and `LD_LIBRARY_PATH` environment variables so the system can find the CUDA executables and libraries. `HF_TOKEN` is your Hugging Face token, which is required to download some models. + +```bash +# Update the PATH +export PATH=/usr/local/cuda/bin:$PATH +export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu +export HF_TOKEN= + +# Run NVIDIA smi to verify the driver installation and see the available GPUs. +nvidia-smi +``` + +## Serve a model + +### 1. Setup vLLM Environment + +Using a conda environment is a best practice to isolate python dependencies and avoid conflicts with system-wide packages. + +```bash +# Not required but adding for reproducibility +mkdir -p ~/miniconda3 +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh +bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 +rm -rf ~/miniconda3/miniconda.sh +export PATH="$HOME/miniconda3/bin:$PATH" +source ~/.bashrc +conda create -n vllm python=3.11.2 +source activate vllm +``` + +### 2. Install vLLM + +Here we use `uv`, a fast python package installer. The `--extra-index-url` flag is used to point to the vLLM wheel index, and `--torch-backend=auto` will automatically select the correct torch backend. + +Note: The version of `flashinfer`, `vllm` is specified for reproducibility. You can check for and install newer versions as they become available. + +```bash +pip install uv +uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto +uv pip install flashinfer-python==0.3.1 +uv pip install guidellm==0.3.0 +``` + +## Run Benchmarks for Qwen3-8B-FP4 + +### 1. Set environment variables + +These environment variables are used to enable specific features in vLLM and its backends. +- `ENABLE_NVFP4_SM120=1`: Enables NVIDIA\'s FP4 support on newer GPU architectures. +- `VLLM_ATTENTION_BACKEND=FLASHINFER`: Sets the attention backend to FlashInfer, which is a high-performance implementation. Other available backends include `XFORMERS` etc. + +```bash +export ENABLE_NVFP4_SM120=1 +export VLLM_ATTENTION_BACKEND=FLASHINFER +``` + +### 2. Run the server + +The `vllm serve` command starts the vLLM server. Here\'s a breakdown of the arguments: +- `nvidia/Qwen3-8B-FP4`: The model to be served from Hugging Face. +- `--served-model-name nvidia/Qwen3-8B-FP4`: The name to use for the model endpoint. +- `--kv-cache-dtype fp8`: Sets the data type for the key-value cache to FP8 to save GPU memory. +- `--port 8000`: The port the server will listen on. +- `--disable-log-requests`: Disables request logging for better performance. +- `--seed 42`: Sets a random seed for reproducibility. +- `--max-model-len 8192`: The maximum sequence length the model can handle. +- `--gpu-memory-utilization 0.95`: The fraction of GPU memory to be used by vLLM. +- `--tensor-parallel-size 1`: The number of GPUs to use for tensor parallelism. Since we are using a single GPU, this is set to 1. vLLM supports combination of multiple parallization strategies which can be enabled with different arguments (--data-parallel-size, --pipeline-parallel-size etc). + +```bash +vllm serve nvidia/Qwen3-8B-FP4 --served-model-name nvidia/Qwen3-8B-FP4 --kv-cache-dtype fp8 --port 8000 --disable-log-requests --seed 42 --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 1 +``` + +### 3. Server Output + +When the server is up and running, you should see output similar to the following. + +``` +(APIServer pid=758221) INFO 11-03 19:48:49 [launcher.py:46] Route: /metrics, Methods: GET +(APIServer pid=758221) INFO: Started server process [758221] +(APIServer pid=758221) INFO: Waiting for application startup. +(APIServer pid=758221) INFO: Application startup complete. +``` + +### 4. Run the benchmarks + +To run the benchmark, you will need to interact with the server. This requires a separate terminal session. You have two options: + +1. **New Terminal**: Open a new terminal window and create a second SSH connection to your VM. You can then run the benchmark command in the new terminal while the server continues to run in the first one. +2. **Background Process**: Run the server process in the background. To do this, append an ampersand (`&`) to the end of the `vllm serve` command. This will start the server and immediately return control of the terminal to you. + +Example of running the server in the background: +```bash +vllm serve nvidia/Qwen3-8B-FP4 --served-model-name nvidia/Qwen3-8B-FP4 --kv-cache-dtype fp8 --port 8000 --disable-log-requests --seed 42 --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 1 & +``` + +Once the server is running (either in another terminal or in the background), you can run the benchmark client. + +The `vllm bench serve` command is used to benchmark the running vLLM server. Here\'s a breakdown of the arguments: +- `--model nvidia/Qwen3-8B-FP4`: The model to benchmark. +- `--dataset-name random`: The dataset to use for the benchmark. `random` will generate random prompts. +- `--random-input-len 128`: The length of the random input prompts. +- `--random-output-len 2048`: The length of the generated output. +- `--request-rate inf`: The number of requests per second to send. `inf` sends requests as fast as possible. +- `--num-prompts 100`: The total number of prompts to send. +- `--ignore-eos`: A flag to ignore the end-of-sentence token and generate a fixed number of tokens. + +```bash +vllm bench serve --model nvidia/Qwen3-8B-FP4 --dataset-name random --random-input-len 128 --random-output-len 2048 --request-rate inf --num-prompts 100 --ignore-eos +``` +### 5. Example output + +The output shows various performance metrics of the model, such as throughput and latency. + +```bash +============ Serving Benchmark Result ============ +Successful requests: 100 +Request rate configured (RPS): 100.00 +Benchmark duration (s): 10.00 +Total input tokens: 12800 +Total generated tokens: 204800 +Request throughput (req/s): 10.00 +Output token throughput (tok/s): 20480.00 +Total Token throughput (tok/s): 21760.00 +---------------Time to First Token---------------- +Mean TTFT (ms): 100.00 +Median TTFT (ms): 99.00 +P99 TTFT (ms): 150.00 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 10.00 +Median TPOT (ms): 9.90 +P99 TPOT (ms): 15.00 +---------------Inter-token Latency---------------- +Mean ITL (ms): 10.00 +Median ITL (ms): 9.90 +P99 ITL (ms): 15.00 +================================================== +``` + +## Clean up + +### 1. Delete the VM + +This command will delete the GCE instance and all its disks. + +```bash +gcloud compute instances delete ${VM_NAME?} --zone=${ZONE?} --project=${PROJECT_ID} --quiet --delete-disks=all +``` \ No newline at end of file From f86281f9f8d59c98166151451dfcd019bf9d6129 Mon Sep 17 00:00:00 2001 From: Deepak Singh Date: Thu, 6 Nov 2025 07:12:25 +0000 Subject: [PATCH 2/5] Updating README.md with G4 details --- README.md | 6 ++++++ inference/g4/single-host-serving/vllm/README.md | 3 --- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 5ae750e..716c2d3 100644 --- a/README.md +++ b/README.md @@ -78,6 +78,12 @@ Models | GPU Machine Type | **DeepSeek R1 671B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | vLLM | Inference | GKE | [Link](./inference/a4/single-host-serving/vllm/README.md) | **DeepSeek R1 671B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | SGLang | Inference | GKE | [Link](./inference/a4/single-host-serving/sglang/README.md) +### Inference benchmarks G4 + +| Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe | +| ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ | +| **Qwen3 8B** | [G4 (NVIDIA RTX PRO 6000 Blackwell)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series) | vLLM | Inference | GCE | [Link](./inference/g4/single-host-serving/vllm/README.md) + ### Checkpointing benchmarks Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe diff --git a/inference/g4/single-host-serving/vllm/README.md b/inference/g4/single-host-serving/vllm/README.md index 40529af..5d7f8b6 100644 --- a/inference/g4/single-host-serving/vllm/README.md +++ b/inference/g4/single-host-serving/vllm/README.md @@ -85,10 +85,7 @@ nvidia-smi ### 1. Setup vLLM Environment -Using a conda environment is a best practice to isolate python dependencies and avoid conflicts with system-wide packages. - ```bash -# Not required but adding for reproducibility mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 From 0211080528a5d76834677d86ea1d2a507ade7e2a Mon Sep 17 00:00:00 2001 From: Deepak Singh Date: Fri, 14 Nov 2025 18:28:46 +0000 Subject: [PATCH 3/5] Using docker images for vLLM --- .../g4/single-host-serving/vllm/README.md | 198 +++++++----------- 1 file changed, 81 insertions(+), 117 deletions(-) diff --git a/inference/g4/single-host-serving/vllm/README.md b/inference/g4/single-host-serving/vllm/README.md index 5d7f8b6..f28bb82 100644 --- a/inference/g4/single-host-serving/vllm/README.md +++ b/inference/g4/single-host-serving/vllm/README.md @@ -1,6 +1,6 @@ # vLLM serving on a GCP VM with G4 GPUs -This recipe shows how to serve and benchmark open source models using [vLLM](https://github.com/vllm-project/vllm) on a single GCP VM with G4 GPUs. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). +This recipe shows how to serve and benchmark Qwen3-8B model using [vLLM](https://github.com/vllm-project/vllm) on a single GCP VM with G4 GPUs. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). ## Before you begin @@ -13,7 +13,7 @@ Make sure you have the following prerequisites: * You have a project with a GPU quota. See [Request a quota increase](https://cloud.google.com/docs/quota/view-request#requesting_higher_quota). * [Enable required APIs](https://console.cloud.google.com/flows/enableapi?apiid=compute.googleapis.com). -The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-48` for a single GPU VM. The boot disk is set to 200GB to accommodate the models and dependencies. +The following commands set up environment variables and create a GCE instance. The `MACHINE_TYPE` is set to `g4-standard-48` for a single GPU VM, More information on different machine types can be found in the [GCP documentation](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). The boot disk is set to 200GB to accommodate the models and dependencies. ```bash export VM_NAME="${USER}-g4-test" @@ -21,8 +21,8 @@ export PROJECT_ID="your-project-id" export ZONE="your-zone" # g4-standard-48 is for a single GPU VM. For a multi-GPU VM (e.g., 8 GPUs), you can use g4-standard-384. export MACHINE_TYPE="g4-standard-48" -export IMAGE_PROJECT="debian-cloud" -export IMAGE_FAMILY="debian-12" +export IMAGE_PROJECT="ubuntu-os-accelerator-images" +export IMAGE_FAMILY="ubuntu-accelerator-2404-amd64-with-nvidia-570" gcloud compute instances create ${VM_NAME} \ --machine-type=${MACHINE_TYPE} \ @@ -42,129 +42,96 @@ Use `gcloud compute ssh` to connect to the newly created instance. gcloud compute ssh ${VM_NAME?} --project=${PROJECT_ID?} --zone=${ZONE?} ``` -### 3. Install the NVIDIA GPU driver and other dependencies - -These commands install the necessary drivers for the GPU to work, along with other development tools. `build-essential` contains a list of packages that are considered essential for building software, and `cmake` is a tool to manage the build process. - -Note: The CUDA toolkit version is specified here for reproducibility. Newer versions may be available and can be used by updating the download link. - -```bash -sudo bash - -# Install dependencies -apt-get update && apt-get install libevent-core-2.1-7 libevent-2.1-7 libevent-dev zip gcc make wget zip libboost-program-options-dev build-essential devscripts debhelper fakeroot -y && wget https://cmake.org/files/v3.26/cmake-3.26.0-rc1-linux-x86_64.sh && bash cmake-3.26.0-rc1-linux-x86_64.sh --skip-license - -# Update linux headers -apt-get -y install linux-headers-$(uname -r) - -# Download CUDA toolkit 12.9.1 -wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda_12.9.1_575.57.08_linux.run - -# Install CUDA -# Accept the EULA (type accept) -# Keep the default selections for CUDA installation (hit the down arrow, and then hit enter on "Install") -sh cuda_12.9.1_575.57.08_linux.run - -exit ``` -### 4. Set environment variables and check devices - -We need to update the `PATH` and `LD_LIBRARY_PATH` environment variables so the system can find the CUDA executables and libraries. `HF_TOKEN` is your Hugging Face token, which is required to download some models. - -```bash -# Update the PATH -export PATH=/usr/local/cuda/bin:$PATH -export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu -export HF_TOKEN= - # Run NVIDIA smi to verify the driver installation and see the available GPUs. nvidia-smi ``` ## Serve a model -### 1. Setup vLLM Environment +### 1. Install Docker -```bash -mkdir -p ~/miniconda3 -wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh -bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 -rm -rf ~/miniconda3/miniconda.sh -export PATH="$HOME/miniconda3/bin:$PATH" -source ~/.bashrc -conda create -n vllm python=3.11.2 -source activate vllm -``` +Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu: +[Install Docker Engine on Ubuntu](httpss://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) -### 2. Install vLLM +After installing Docker, make sure the Docker daemon is running. -Here we use `uv`, a fast python package installer. The `--extra-index-url` flag is used to point to the vLLM wheel index, and `--torch-backend=auto` will automatically select the correct torch backend. +### 2. Install NVIDIA Container Toolkit -Note: The version of `flashinfer`, `vllm` is specified for reproducibility. You can check for and install newer versions as they become available. +To enable Docker containers to access the GPU, you need to install the NVIDIA Container Toolkit. This toolkit allows the container to interact with the NVIDIA driver on the host machine, making the GPU resources available within the container. -```bash -pip install uv -uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto -uv pip install flashinfer-python==0.3.1 -uv pip install guidellm==0.3.0 -``` +You can follow the official NVIDIA documentation to install the container toolkit: +[NVIDIA Container Toolkit Install Guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) -## Run Benchmarks for Qwen3-8B-FP4 +### 3. Install vLLM -### 1. Set environment variables +We will use the official vLLM docker image. This image comes with vLLM and all its dependencies pre-installed. -These environment variables are used to enable specific features in vLLM and its backends. -- `ENABLE_NVFP4_SM120=1`: Enables NVIDIA\'s FP4 support on newer GPU architectures. -- `VLLM_ATTENTION_BACKEND=FLASHINFER`: Sets the attention backend to FlashInfer, which is a high-performance implementation. Other available backends include `XFORMERS` etc. +To run the vLLM server, you can use the following command: ```bash -export ENABLE_NVFP4_SM120=1 -export VLLM_ATTENTION_BACKEND=FLASHINFER +sudo docker run \ + --runtime nvidia \ + --gpus all \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:latest \ + --model nvidia/Qwen3-8B-FP4 \ + --kv-cache-dtype fp8 \ + --gpu-memory-utilization 0.95 ``` -### 2. Run the server +Here's a breakdown of the arguments: +- `--runtime nvidia --gpus all`: This makes the NVIDIA GPUs available inside the container. +- `-v ~/.cache/huggingface:/root/.cache/huggingface`: This mounts the Hugging Face cache directory from the host to the container. This is useful for caching downloaded models. +- `--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"`: This sets the Hugging Face Hub token as an environment variable in the container. This is required for downloading models that require authentication. +- `-p 8000:8000`: This maps port 8000 on the host to port 8000 in the container. +- `--ipc=host`: This allows the container to share the host's IPC namespace, which can improve performance. +- `vllm/vllm-openai:latest`: This is the name of the official vLLM docker image. +- `--model nvidia/Qwen3-8B-FP4`: The model to be served from Hugging Face. +- `--kv-cache-dtype fp8`: Sets the data type for the key-value cache to FP8 to save GPU memory. +- `--gpu-memory-utilization 0.95`: The fraction of GPU memory to be used by vLLM. -The `vllm serve` command starts the vLLM server. Here\'s a breakdown of the arguments: -- `nvidia/Qwen3-8B-FP4`: The model to be served from Hugging Face. -- `--served-model-name nvidia/Qwen3-8B-FP4`: The name to use for the model endpoint. -- `--kv-cache-dtype fp8`: Sets the data type for the key-value cache to FP8 to save GPU memory. -- `--port 8000`: The port the server will listen on. -- `--disable-log-requests`: Disables request logging for better performance. -- `--seed 42`: Sets a random seed for reproducibility. -- `--max-model-len 8192`: The maximum sequence length the model can handle. -- `--gpu-memory-utilization 0.95`: The fraction of GPU memory to be used by vLLM. -- `--tensor-parallel-size 1`: The number of GPUs to use for tensor parallelism. Since we are using a single GPU, this is set to 1. vLLM supports combination of multiple parallization strategies which can be enabled with different arguments (--data-parallel-size, --pipeline-parallel-size etc). +For more information on the available engine arguments, you can refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/configuration/engine_args/). -```bash -vllm serve nvidia/Qwen3-8B-FP4 --served-model-name nvidia/Qwen3-8B-FP4 --kv-cache-dtype fp8 --port 8000 --disable-log-requests --seed 42 --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 1 -``` +After running the command, the model will be served. To run the benchmark, you will need to either run the server in the background by appending `&` to the command, or open a new terminal to run the benchmark command. -### 3. Server Output +## Run Benchmarks for Qwen3-8B-FP4 + +### 1. Server Output When the server is up and running, you should see output similar to the following. ``` -(APIServer pid=758221) INFO 11-03 19:48:49 [launcher.py:46] Route: /metrics, Methods: GET -(APIServer pid=758221) INFO: Started server process [758221] -(APIServer pid=758221) INFO: Waiting for application startup. -(APIServer pid=758221) INFO: Application startup complete. +(APIServer pid=XXXXXX) INFO XX-XX XX:XX:XX [launcher.py:XX] Route: /metrics, Methods: GET +(APIServer pid=XXXXXX) INFO: Started server process [XXXXXX] +(APIServer pid=XXXXXX) INFO: Waiting for application startup. +(APIServer pid=XXXXXX) INFO: Application startup complete. ``` -### 4. Run the benchmarks - -To run the benchmark, you will need to interact with the server. This requires a separate terminal session. You have two options: +### 2. Run the benchmarks -1. **New Terminal**: Open a new terminal window and create a second SSH connection to your VM. You can then run the benchmark command in the new terminal while the server continues to run in the first one. -2. **Background Process**: Run the server process in the background. To do this, append an ampersand (`&`) to the end of the `vllm serve` command. This will start the server and immediately return control of the terminal to you. +To run the benchmark, you can use the following command: -Example of running the server in the background: ```bash -vllm serve nvidia/Qwen3-8B-FP4 --served-model-name nvidia/Qwen3-8B-FP4 --kv-cache-dtype fp8 --port 8000 --disable-log-requests --seed 42 --max-model-len 8192 --gpu-memory-utilization 0.95 --tensor-parallel-size 1 & +sudo docker run \ + --runtime nvidia \ + --gpus all \ + --network="host" \ + --entrypoint vllm \ + vllm/vllm-openai:latest bench serve \ + --model nvidia/Qwen3-8B-FP4 \ + --dataset-name random \ + --random-input-len 128 \ + --random-output-len 2048 \ + --request-rate inf \ + --num-prompts 100 \ + --ignore-eos ``` -Once the server is running (either in another terminal or in the background), you can run the benchmark client. - -The `vllm bench serve` command is used to benchmark the running vLLM server. Here\'s a breakdown of the arguments: +Here's a breakdown of the arguments: - `--model nvidia/Qwen3-8B-FP4`: The model to benchmark. - `--dataset-name random`: The dataset to use for the benchmark. `random` will generate random prompts. - `--random-input-len 128`: The length of the random input prompts. @@ -173,35 +140,32 @@ The `vllm bench serve` command is used to benchmark the running vLLM server. Her - `--num-prompts 100`: The total number of prompts to send. - `--ignore-eos`: A flag to ignore the end-of-sentence token and generate a fixed number of tokens. -```bash -vllm bench serve --model nvidia/Qwen3-8B-FP4 --dataset-name random --random-input-len 128 --random-output-len 2048 --request-rate inf --num-prompts 100 --ignore-eos -``` -### 5. Example output +### 3. Example output The output shows various performance metrics of the model, such as throughput and latency. ```bash -============ Serving Benchmark Result ============ -Successful requests: 100 -Request rate configured (RPS): 100.00 -Benchmark duration (s): 10.00 -Total input tokens: 12800 -Total generated tokens: 204800 -Request throughput (req/s): 10.00 -Output token throughput (tok/s): 20480.00 -Total Token throughput (tok/s): 21760.00 +============ Serving Benchmark Result ============ +Successful requests: XX +Request rate configured (RPS): XX +Benchmark duration (s): XX +Total input tokens: XX +Total generated tokens: XX +Request throughput (req/s): XX +Output token throughput (tok/s): XX +Total Token throughput (tok/s): XX ---------------Time to First Token---------------- -Mean TTFT (ms): 100.00 -Median TTFT (ms): 99.00 -P99 TTFT (ms): 150.00 +Mean TTFT (ms): XX +Median TTFT (ms): XX +P99 TTFT (ms): XX -----Time per Output Token (excl. 1st token)------ -Mean TPOT (ms): 10.00 -Median TPOT (ms): 9.90 -P99 TPOT (ms): 15.00 +Mean TPOT (ms): XX +Median TPOT (ms): XX +P99 TPOT (ms): XX ---------------Inter-token Latency---------------- -Mean ITL (ms): 10.00 -Median ITL (ms): 9.90 -P99 ITL (ms): 15.00 +Mean ITL (ms): XX +Median ITL (ms): XX +P99 ITL (ms): XX ================================================== ``` @@ -213,4 +177,4 @@ This command will delete the GCE instance and all its disks. ```bash gcloud compute instances delete ${VM_NAME?} --zone=${ZONE?} --project=${PROJECT_ID} --quiet --delete-disks=all -``` \ No newline at end of file +``` From 650d1f81e466767fd162c3c7985facd4567b0cd5 Mon Sep 17 00:00:00 2001 From: Deepak Singh Date: Fri, 14 Nov 2025 18:40:11 +0000 Subject: [PATCH 4/5] Correct the readme path --- README.md | 2 +- .../g4/{ => qwen-8b}/single-host-serving/vllm/README.md | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) rename inference/g4/{ => qwen-8b}/single-host-serving/vllm/README.md (97%) diff --git a/README.md b/README.md index 716c2d3..c1b72d1 100644 --- a/README.md +++ b/README.md @@ -82,7 +82,7 @@ Models | GPU Machine Type | Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe | | ---------------- | ---------------- | --------- | ------------------- | ------------ | ------------------ | -| **Qwen3 8B** | [G4 (NVIDIA RTX PRO 6000 Blackwell)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series) | vLLM | Inference | GCE | [Link](./inference/g4/single-host-serving/vllm/README.md) +| **Qwen3 8B** | [G4 (NVIDIA RTX PRO 6000 Blackwell)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-series) | vLLM | Inference | GCE | [Link](./inference/g4/qwen-8b/single-host-serving/vllm/README.md) ### Checkpointing benchmarks diff --git a/inference/g4/single-host-serving/vllm/README.md b/inference/g4/qwen-8b/single-host-serving/vllm/README.md similarity index 97% rename from inference/g4/single-host-serving/vllm/README.md rename to inference/g4/qwen-8b/single-host-serving/vllm/README.md index f28bb82..75d397d 100644 --- a/inference/g4/single-host-serving/vllm/README.md +++ b/inference/g4/qwen-8b/single-host-serving/vllm/README.md @@ -1,4 +1,4 @@ -# vLLM serving on a GCP VM with G4 GPUs +# Single host inference benchmark of Qwen3-8B with vLLM on G4 This recipe shows how to serve and benchmark Qwen3-8B model using [vLLM](https://github.com/vllm-project/vllm) on a single GCP VM with G4 GPUs. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. For more information on G4 machine types, see the [GCP documentation](https://cloud.google.com/compute/docs/accelerator-optimized-machines#g4-machine-types). @@ -52,7 +52,7 @@ nvidia-smi ### 1. Install Docker Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu: -[Install Docker Engine on Ubuntu](httpss://docs.docker.com/engine/install/ubuntu/#install-using-the-repository) +[Install Docker Engine on Ubuntu](httpss://docs.docker.com/engine/install/ubuntu/) After installing Docker, make sure the Docker daemon is running. @@ -94,7 +94,7 @@ Here's a breakdown of the arguments: - `--kv-cache-dtype fp8`: Sets the data type for the key-value cache to FP8 to save GPU memory. - `--gpu-memory-utilization 0.95`: The fraction of GPU memory to be used by vLLM. -For more information on the available engine arguments, you can refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/configuration/engine_args/). +For more information on the available engine arguments, you can refer to the [official vLLM documentation](https://docs.vllm.ai/en/latest/configuration/engine_args/), which includes different parallelism strategies that can be used with multi GPU setup. After running the command, the model will be served. To run the benchmark, you will need to either run the server in the background by appending `&` to the command, or open a new terminal to run the benchmark command. From 09ab7086ccba7a79ee514807e221325986b16d31 Mon Sep 17 00:00:00 2001 From: Deepak Singh Date: Sat, 15 Nov 2025 00:12:26 +0530 Subject: [PATCH 5/5] Fix link formatting in README.md for Docker installation Fix link formatting in README.md for Docker installation --- inference/g4/qwen-8b/single-host-serving/vllm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/inference/g4/qwen-8b/single-host-serving/vllm/README.md b/inference/g4/qwen-8b/single-host-serving/vllm/README.md index 75d397d..c91d1dd 100644 --- a/inference/g4/qwen-8b/single-host-serving/vllm/README.md +++ b/inference/g4/qwen-8b/single-host-serving/vllm/README.md @@ -52,7 +52,7 @@ nvidia-smi ### 1. Install Docker Before you can serve the model, you need to have Docker installed on your VM. You can follow the official documentation to install Docker on Ubuntu: -[Install Docker Engine on Ubuntu](httpss://docs.docker.com/engine/install/ubuntu/) +[Install Docker Engine on Ubuntu](https://docs.docker.com/engine/install/ubuntu/) After installing Docker, make sure the Docker daemon is running.