Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Overview and Optimized Build
title: Build and validate vLLM for Arm64 inference on Azure Cobalt 100
weight: 2

### FIXED, DO NOT MODIFY
Expand All @@ -8,50 +8,57 @@ layout: learningpathall

## What is vLLM?

vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs).
It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.
vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference.

### Key Features
* Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput.
* KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead.
* Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios.
### Interaction Modes
## Key features
* Continuous batching: dynamically merges incoming inference requests into larger batches, maximizing Arm CPU utilization and overall throughput
* KV cache management: efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead
* Token streaming: streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios
## Interaction modes
You can use vLLM in two main ways:
* OpenAI-Compatible REST Server:
vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK.
* Python API:
Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation.
- Using an OpenAI-Compatible REST Server: vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK
- Using a Python API: load and serve models programmatically within your own Python scripts for flexible local inference and evaluation

vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference.

## What you build
## What'll you build

In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
In this Learning Path, you'll build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL).
This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations.
After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed.

## Why this is fast on Arm

vLLM achieves high performance on Arm servers by combining software and hardware optimizations. Here’s why your build runs fast:

- Arm-optimized kernels: vLLM uses oneDNN and the Arm Compute Library to accelerate matrix multiplications, normalization, and activation functions. These libraries are tuned for Arm’s aarch64 architecture.
- Efficient quantization: INT4 quantized models run faster on Arm because KleidiAI microkernels use DOT-product instructions (SDOT/UDOT) available on Arm CPUs.
- Paged attention tuning: the paged attention mechanism is optimized for Arm’s NEON and SVE pipelines, improving token reuse and throughput during long-sequence generation.
- MoE fusion: for Mixture-of-Experts models, vLLM fuses INT4 expert layers to reduce memory transfers and bandwidth bottlenecks.
- Thread affinity and memory allocation: setting thread affinity ensures balanced CPU core usage, while tcmalloc reduces memory fragmentation and allocator contention.

These optimizations work together to deliver higher throughput and lower latency for LLM inference on Arm servers.

vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration.
Each component of this optimized build contributes to higher throughput and lower latency during inference:

- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
- Optimized kernels: the aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations.
- 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions.
- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
- Efficient MoE execution: for Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks
- Optimized Paged attention: the paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines.
- System tuning: using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters.
Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads.

## Before you begin
## Set up your environment

Verify that your environment meets the following requirements:
Before you begin, make sure your environment meets these requirements:

Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later.
Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space.
- Python 3.12 on Ubuntu 22.04 LTS or newer
- At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space

This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.
This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage.

### Install Build Dependencies
## Install build dependencies

Install the following packages required for compiling vLLM and its dependencies on Arm64:
```bash
Expand All @@ -74,7 +81,7 @@ This ensures optimized Arm kernels are used for matrix multiplications, layer no
## Build vLLM for Arm64 CPU
You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend.

1. Create and Activate a Python Virtual Environment
## Create and activate a Python virtual environment
It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies:

```bash
Expand All @@ -83,7 +90,7 @@ source vllm_env/bin/activate
python3 -m pip install --upgrade pip
```

2. Clone vLLM and Install Build Requirements
## Clone vLLM and install build requirements
Download the official vLLM source code and install its CPU-specific build dependencies:

```bash
Expand All @@ -94,15 +101,15 @@ pip install -r requirements/cpu.txt -r requirements/cpu-build.txt
```
The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems.

3. Build the vLLM Wheel for CPU
## Build the vLLM wheel for CPU
Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference:

```bash
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
```
The output wheel will appear under dist/ and include all compiled C++/PyBind modules.

4. Install the Wheel
## Install the wheel
Install the freshly built wheel into your active environment:

```bash
Expand All @@ -115,7 +122,31 @@ Do not delete the local vLLM source directory.
The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation.
{{% /notice %}}

## Quick validation via Offline Inferencing
## Validate your build with offline inference

Run a quick test to confirm your Arm-optimized vLLM build works as expected. Use the built-in chat example to perform offline inference and verify that oneDNN and Arm Compute Library optimizations are active.

```bash
python examples/offline_inference/basic/chat.py \
--dtype=bfloat16 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

This command runs a small Hugging Face model in bfloat16 precision, streaming generated tokens to the console. You should see output similar to:

```output
Generated Outputs:
--------------------------------------------------------------------------------
Prompt: None

Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right'
--------------------------------------------------------------------------------
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s]
...
```

If you see token streaming and generated text, your vLLM build is correctly configured for Arm64 inference.

Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build.
Run the built-in chat example included in the vLLM repository:
Expand Down Expand Up @@ -144,7 +175,7 @@ Processed prompts: 100%|██████████████████
```

{{% notice Note %}}
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers.
As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined `pip` install workflow for aarch64, simplifying future deployments on Arm servers.
{{% /notice %}}

You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Quantize an LLM to INT4 for Arm Platform
title: Quantize an LLM to INT4
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---
## Accelerating LLMs with 4-bit Quantization
## Accelerate LLMs with 4-bit quantization

You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights.
The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels.
Expand Down Expand Up @@ -35,7 +35,7 @@ If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or pr
huggingface-cli login
```

## INT4 Quantization Recipe
## Apply the INT4 quantization recipe

Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`:

Expand Down Expand Up @@ -134,12 +134,16 @@ This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and sav

## Quantize DeepSeek‑V2‑Lite model

### Quantization parameter tuning
Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs.
Quantizing your model to INT4 format significantly reduces memory usage and improves inference speed on Arm CPUs. In this section, you'll apply the quantization script to the DeepSeek‑V2‑Lite model, tuning key parameters for optimal performance and accuracy. This process prepares your model for efficient deployment with vLLM on Arm-based servers.

1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method.
2. `channelwise` is a good default for most models.
3. `groupwise` can improve accuracy further; `--groupsize 32` is common.
## Tune quantization parameters
Quantization parameters control how the model’s floating-point weights and activations are converted to lower-precision integer formats. The right settings help you balance accuracy, memory usage, and performance on Arm CPUs.

- Use `minmax` for faster quantization, or `mse` for higher accuracy (but slower)
- Choose `channelwise` for most models; it’s a reliable default
- Try `groupwise` for potentially better accuracy; `--groupsize 32` is a common choice

Pick the combination that fits your accuracy and speed needs.

Execute the following command to quantize the DeepSeek-V2-Lite model:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@ layout: learningpathall
## Batch Sizing in vLLM

vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process:
* `max_model_len` — The maximum sequence length (number of tokens per request).
* `max_model_len`, which is the maximum sequence length (number of tokens per request).
No single prompt or generated sequence can exceed this limit.
* `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests.
* `max_num_batched_tokens`, which is the total number of tokens processed in one batch across all requests.
The sum of input and output tokens from all concurrent requests must stay within this limit.

Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated.
On Arm-based servers, tuning them helps achieve stable throughput while avoiding excessive paging or cache thrashing.

## Serve an OpenAI‑compatible API

Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance.
Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance:

```bash
export VLLM_TARGET_DEVICE=cpu
Expand Down Expand Up @@ -125,9 +125,9 @@ This validates multi‑request behavior and shows aggregate throughput in the se
(APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
```
## Optional: Serve a BF16 (Non-Quantized) Model
## Serve a BF16 (non-quantized) model (optional)

For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64).
For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels using ACL under aarch64).

```bash
vllm serve deepseek-ai/DeepSeek-V2-Lite \
Expand All @@ -136,17 +136,18 @@ vllm serve deepseek-ai/DeepSeek-V2-Lite \
```
Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system.

## Go Beyond: Power Up Your vLLM Workflow
## Go beyond: power up your vLLM workflow
Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further.

**Try Different Models**
Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration:
* Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance.
* Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models.
* Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving.

You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name.
## Try different models
Explore other Hugging Face models that work well with vLLM and take advantage of Arm acceleration:

**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui)
- Meta Llama 2 and Llama 3: these versatile models work well for general tasks, and you can try them to compare BF16 and INT4 performance
- Qwen and Qwen-Chat: these models support multiple languages and are tuned for instructions, giving you high-quality results
- Gemma (Google): this compact and efficient model is a good choice for edge devices or deployments where cost matters

You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures.
You can quantize and serve any of these models using the same `quantize_vllm_models.py` script. Just update the model name in the script.

You can also try connecting a chat client by linking your server with OpenAI-compatible user interfaces such as [Open WebUI](https://github.com/open-webui/open-webui).

Continue exploring how Arm efficiency, oneDNN and ACL acceleration, and vLLM dynamic batching work together to provide fast, sustainable, and scalable AI inference on modern Arm architectures.
Loading