diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md index a20648437..c47e03a8f 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/1-overview-and-build.md @@ -1,5 +1,5 @@ --- -title: Overview and Optimized Build +title: Build and validate vLLM for Arm64 inference on Azure Cobalt 100 weight: 2 ### FIXED, DO NOT MODIFY @@ -8,50 +8,57 @@ layout: learningpathall ## What is vLLM? -vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). -It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference. +vLLM is an open-source, high-throughput inference and serving engine for large language models (LLMs). It’s designed to make LLM inference faster, more memory-efficient, and scalable, particularly during the prefill (context processing) and decode (token generation) phases of inference. -### Key Features - * Continuous Batching – Dynamically combines incoming inference requests into a single large batch, maximizing CPU/GPU utilization and throughput. - * KV Cache Management – Efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead. - * Token Streaming – Streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios. -### Interaction Modes +## Key features +* Continuous batching: dynamically merges incoming inference requests into larger batches, maximizing Arm CPU utilization and overall throughput +* KV cache management: efficiently stores and reuses key-value attention states, sustaining concurrency across multiple active sessions while minimizing memory overhead +* Token streaming: streams generated tokens as they are produced, enabling real-time responses for chat or API scenarios +## Interaction modes You can use vLLM in two main ways: - * OpenAI-Compatible REST Server: - vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK. - * Python API: - Load and serve models programmatically within your own Python scripts for flexible local inference and evaluation. +- Using an OpenAI-Compatible REST Server: vLLM provides a /v1/chat/completions endpoint compatible with the OpenAI API schema, making it drop-in ready for tools like LangChain, LlamaIndex, and the official OpenAI Python SDK +- Using a Python API: load and serve models programmatically within your own Python scripts for flexible local inference and evaluation vLLM supports Hugging Face Transformer models out-of-the-box and scales seamlessly from single-prompt testing to production batch inference. -## What you build +## What'll you build -In this learning path, you will build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL). +In this Learning Path, you'll build a CPU-optimized version of vLLM targeting the Arm64 architecture, integrated with oneDNN and the Arm Compute Library (ACL). This build enables high-performance LLM inference on Arm servers, leveraging specialized Arm math libraries and kernel optimizations. After compiling, you’ll validate your build by running a local chat example to confirm functionality and measure baseline inference speed. ## Why this is fast on Arm +vLLM achieves high performance on Arm servers by combining software and hardware optimizations. Here’s why your build runs fast: + +- Arm-optimized kernels: vLLM uses oneDNN and the Arm Compute Library to accelerate matrix multiplications, normalization, and activation functions. These libraries are tuned for Arm’s aarch64 architecture. +- Efficient quantization: INT4 quantized models run faster on Arm because KleidiAI microkernels use DOT-product instructions (SDOT/UDOT) available on Arm CPUs. +- Paged attention tuning: the paged attention mechanism is optimized for Arm’s NEON and SVE pipelines, improving token reuse and throughput during long-sequence generation. +- MoE fusion: for Mixture-of-Experts models, vLLM fuses INT4 expert layers to reduce memory transfers and bandwidth bottlenecks. +- Thread affinity and memory allocation: setting thread affinity ensures balanced CPU core usage, while tcmalloc reduces memory fragmentation and allocator contention. + +These optimizations work together to deliver higher throughput and lower latency for LLM inference on Arm servers. + vLLM’s performance on Arm servers is driven by both software optimization and hardware-level acceleration. Each component of this optimized build contributes to higher throughput and lower latency during inference: -- Optimized kernels: The aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations. +- Optimized kernels: the aarch64 vLLM build uses direct oneDNN with the Arm Compute Library for key operations. - 4‑bit weight quantization: vLLM supports INT4 quantized models, and Arm accelerates this using KleidiAI microkernels, which take advantage of DOT-product (SDOT/UDOT) instructions. -- Efficient MoE execution: For Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks -- Optimized Paged attention: The paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines. -- System tuning: Using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters. +- Efficient MoE execution: for Mixture-of-Experts (MoE) models, vLLM fuses INT4 quantized expert layers to reduce intermediate memory transfers, which minimizes bandwidth bottlenecks +- Optimized Paged attention: the paged attention mechanism, which handles token reuse during long-sequence generation, is SIMD-tuned for Arm’s NEON and SVE (Scalable Vector Extension) pipelines. +- System tuning: using thread affinity ensures efficient CPU core pinning and balanced thread scheduling across Arm clusters. Additionally, enabling tcmalloc (Thread-Caching Malloc) reduces allocator contention and memory fragmentation under high-throughput serving loads. -## Before you begin +## Set up your environment -Verify that your environment meets the following requirements: +Before you begin, make sure your environment meets these requirements: -Python version: Use Python 3.12 on Ubuntu 22.04 LTS or later. -Hardware requirements: At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space. +- Python 3.12 on Ubuntu 22.04 LTS or newer +- At least 32 vCPUs, 64 GB RAM, and 64 GB of free disk space -This Learning Path was validated on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage. +This Learning Path was tested on an AWS Graviton4 c8g.12xlarge instance with 64 GB of attached storage. -### Install Build Dependencies +## Install build dependencies Install the following packages required for compiling vLLM and its dependencies on Arm64: ```bash @@ -74,7 +81,7 @@ This ensures optimized Arm kernels are used for matrix multiplications, layer no ## Build vLLM for Arm64 CPU You’ll now build vLLM optimized for Arm (aarch64) servers with oneDNN and the Arm Compute Library (ACL) automatically enabled in the CPU backend. -1. Create and Activate a Python Virtual Environment +## Create and activate a Python virtual environment It’s best practice to build vLLM inside an isolated environment to prevent conflicts between system and project dependencies: ```bash @@ -83,7 +90,7 @@ source vllm_env/bin/activate python3 -m pip install --upgrade pip ``` -2. Clone vLLM and Install Build Requirements +## Clone vLLM and install build requirements Download the official vLLM source code and install its CPU-specific build dependencies: ```bash @@ -94,7 +101,7 @@ pip install -r requirements/cpu.txt -r requirements/cpu-build.txt ``` The specific commit (5fb4137) pins a verified version of vLLM that officially adds Arm CPUs to the list of supported build targets, ensuring full compatibility and optimized performance for Arm-based systems. -3. Build the vLLM Wheel for CPU +## Build the vLLM wheel for CPU Run the following command to compile and package vLLM as a Python wheel optimized for CPU inference: ```bash @@ -102,7 +109,7 @@ VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel ``` The output wheel will appear under dist/ and include all compiled C++/PyBind modules. -4. Install the Wheel +## Install the wheel Install the freshly built wheel into your active environment: ```bash @@ -115,7 +122,31 @@ Do not delete the local vLLM source directory. The repository contains C++ extensions and runtime libraries required for correct CPU inference on aarch64 after wheel installation. {{% /notice %}} -## Quick validation via Offline Inferencing +## Validate your build with offline inference + +Run a quick test to confirm your Arm-optimized vLLM build works as expected. Use the built-in chat example to perform offline inference and verify that oneDNN and Arm Compute Library optimizations are active. + +```bash +python examples/offline_inference/basic/chat.py \ + --dtype=bfloat16 \ + --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 +``` + +This command runs a small Hugging Face model in bfloat16 precision, streaming generated tokens to the console. You should see output similar to: + +```output +Generated Outputs: +-------------------------------------------------------------------------------- +Prompt: None + +Generated text: 'The Importance of Higher Education\n\nHigher education is a fundamental right' +-------------------------------------------------------------------------------- +Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9552.05it/s] +Processed prompts: 100%|████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 6.78it/s, est. speed input: 474.32 toks/s, output: 108.42 toks/s] +... +``` + +If you see token streaming and generated text, your vLLM build is correctly configured for Arm64 inference. Once your Arm-optimized vLLM build completes, you can validate it by running a small offline inference example. This ensures that the CPU-specific backend and oneDNN and ACL optimizations were correctly compiled into your build. Run the built-in chat example included in the vLLM repository: @@ -144,7 +175,7 @@ Processed prompts: 100%|██████████████████ ``` {{% notice Note %}} -As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined pip install workflow for aarch64, simplifying future deployments on Arm servers. +As CPU support in vLLM continues to mature, these manual build steps will eventually be replaced by a streamlined `pip` install workflow for aarch64, simplifying future deployments on Arm servers. {{% /notice %}} You have now verified that your vLLM Arm64 build runs correctly and performs inference using Arm-optimized kernels. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md index 102ea00e0..72860c8d4 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/2-quantize-model.md @@ -1,11 +1,11 @@ --- -title: Quantize an LLM to INT4 for Arm Platform +title: Quantize an LLM to INT4 weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Accelerating LLMs with 4-bit Quantization +## Accelerate LLMs with 4-bit quantization You can accelerate many LLMs on Arm CPUs with 4‑bit quantization. In this section, you’ll quantize the deepseek-ai/DeepSeek-V2-Lite model to 4-bit integer (INT4) weights. The quantized model runs efficiently through vLLM’s INT4 inference path, which is accelerated by Arm KleidiAI microkernels. @@ -35,7 +35,7 @@ If the model you plan to quantize is gated on Hugging Face (e.g., DeepSeek or pr huggingface-cli login ``` -## INT4 Quantization Recipe +## Apply the INT4 quantization recipe Using a file editor of your choice, save the following code into a file named `quantize_vllm_models.py`: @@ -134,12 +134,16 @@ This script creates a Arm KleidiAI INT4 quantized copy of the vLLM model and sav ## Quantize DeepSeek‑V2‑Lite model -### Quantization parameter tuning -Quantization parameters determine how the model’s floating-point weights and activations are converted into lower-precision integer formats. Choosing the right combination is essential for balancing model accuracy, memory footprint, and runtime throughput on Arm CPUs. +Quantizing your model to INT4 format significantly reduces memory usage and improves inference speed on Arm CPUs. In this section, you'll apply the quantization script to the DeepSeek‑V2‑Lite model, tuning key parameters for optimal performance and accuracy. This process prepares your model for efficient deployment with vLLM on Arm-based servers. -1. You can choose `minmax` (faster model quantization) or `mse` (more accurate but slower model quantization) method. -2. `channelwise` is a good default for most models. -3. `groupwise` can improve accuracy further; `--groupsize 32` is common. +## Tune quantization parameters +Quantization parameters control how the model’s floating-point weights and activations are converted to lower-precision integer formats. The right settings help you balance accuracy, memory usage, and performance on Arm CPUs. + +- Use `minmax` for faster quantization, or `mse` for higher accuracy (but slower) +- Choose `channelwise` for most models; it’s a reliable default +- Try `groupwise` for potentially better accuracy; `--groupsize 32` is a common choice + +Pick the combination that fits your accuracy and speed needs. Execute the following command to quantize the DeepSeek-V2-Lite model: diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md index 0e208af88..52a23ed1a 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/3-run-inference-and-serve.md @@ -9,9 +9,9 @@ layout: learningpathall ## Batch Sizing in vLLM vLLM uses dynamic continuous batching to maximize hardware utilization. Two key parameters govern this process: - * `max_model_len` — The maximum sequence length (number of tokens per request). + * `max_model_len`, which is the maximum sequence length (number of tokens per request). No single prompt or generated sequence can exceed this limit. - * `max_num_batched_tokens` — The total number of tokens processed in one batch across all requests. + * `max_num_batched_tokens`, which is the total number of tokens processed in one batch across all requests. The sum of input and output tokens from all concurrent requests must stay within this limit. Together, these parameters determine how much memory the model can use and how effectively CPU threads are saturated. @@ -19,7 +19,7 @@ On Arm-based servers, tuning them helps achieve stable throughput while avoiding ## Serve an OpenAI‑compatible API -Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance. +Start vLLM’s OpenAI-compatible API server using the quantized INT4 model and environment variables optimized for performance: ```bash export VLLM_TARGET_DEVICE=cpu @@ -125,9 +125,9 @@ This validates multi‑request behavior and shows aggregate throughput in the se (APIServer pid=4474) INFO: 127.0.0.1:44120 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=4474) INFO 11-10 01:01:06 [loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% ``` -## Optional: Serve a BF16 (Non-Quantized) Model +## Serve a BF16 (non-quantized) model (optional) -For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels via ACL under aarch64). +For a non-quantized path, vLLM on Arm can run BF16 end-to-end using its oneDNN integration (which routes to Arm-optimized kernels using ACL under aarch64). ```bash vllm serve deepseek-ai/DeepSeek-V2-Lite \ @@ -136,17 +136,18 @@ vllm serve deepseek-ai/DeepSeek-V2-Lite \ ``` Use this BF16 setup to establish a quality reference baseline, then compare throughput and latency against your INT4 deployment to quantify the performance/accuracy trade-offs on your Arm system. -## Go Beyond: Power Up Your vLLM Workflow +## Go beyond: power up your vLLM workflow Now that you’ve successfully quantized, served, and benchmarked a model using vLLM on Arm, you can build on what you’ve learned to push performance, scalability, and usability even further. -**Try Different Models** -Extend your workflow to other models on Hugging Face that are compatible with vLLM and can benefit from Arm acceleration: - * Meta Llama 2 / Llama 3 – Strong general-purpose baselines; excellent for comparing BF16 vs INT4 performance. - * Qwen / Qwen-Chat – High-quality multilingual and instruction-tuned models. - * Gemma (Google) – Compact and efficient architecture; ideal for edge or cost-optimized serving. - -You can quantize and serve them using the same `quantize_vllm_models.py` recipe, just update the model name. +## Try different models +Explore other Hugging Face models that work well with vLLM and take advantage of Arm acceleration: -**Connect a chat client:** Link your server with OpenAI-compatible UIs like [Open WebUI](https://github.com/open-webui/open-webui) +- Meta Llama 2 and Llama 3: these versatile models work well for general tasks, and you can try them to compare BF16 and INT4 performance +- Qwen and Qwen-Chat: these models support multiple languages and are tuned for instructions, giving you high-quality results +- Gemma (Google): this compact and efficient model is a good choice for edge devices or deployments where cost matters -You can continue exploring how Arm’s efficiency, oneDNN+ACL acceleration, and vLLM’s dynamic batching combine to deliver fast, sustainable, and scalable AI inference on modern Arm architectures. +You can quantize and serve any of these models using the same `quantize_vllm_models.py` script. Just update the model name in the script. + +You can also try connecting a chat client by linking your server with OpenAI-compatible user interfaces such as [Open WebUI](https://github.com/open-webui/open-webui). + +Continue exploring how Arm efficiency, oneDNN and ACL acceleration, and vLLM dynamic batching work together to provide fast, sustainable, and scalable AI inference on modern Arm architectures. diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md index 08ab7cd66..db43cdce5 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/4-accuracy-benchmarking.md @@ -1,5 +1,5 @@ --- -title: Evaluate Accuracy with LM Evaluation Harness +title: Evaluate accuracy with LM Evaluation Harness weight: 5 ### FIXED, DO NOT MODIFY @@ -8,33 +8,27 @@ layout: learningpathall ## Why accuracy benchmarking -The LM Evaluation Harness (lm-eval-harness) is a widely used open-source framework for evaluating the accuracy of large language models on standardized academic benchmarks such as MMLU, HellaSwag, and GSM8K. -It provides a consistent interface for evaluating models served through various runtimes—such as Hugging Face Transformers, vLLM, or llama.cpp using the same datasets, few-shot templates, and scoring metrics. -In this module, you will measure how quantization impacts model quality by comparing BF16 (non-quantized) and INT4 (quantized) versions of your model running on Arm-based servers. +The lm-evaluation-harness is the standard way to measure model accuracy across common academic benchmarks (for example, MMLU, HellaSwag, GSM8K) and runtimes (such as Hugging Face, vLLM, and llama.cpp). In this Learning Path, you'll run accuracy tests for both BF16 and INT4 deployments of your model served by vLLM on Arm-based servers. You will: - * Install lm-eval-harness with vLLM backend support. - * Run benchmark tasks on both BF16 and INT4 model deployments. - * Analyze and interpret accuracy differences between the two precisions. + * Install lm-eval-harness with vLLM support + * Run benchmarks on a BF16 model and an INT4 (weight-quantized) model + * Interpret key metrics and compare quality across precisions {{% notice Note %}} -Accuracy results can vary depending on CPU, dataset versions, and model choice. Use the same tasks, few-shot settings and evaluation batch size when comparing BF16 and INT4 results to ensure a fair comparison. +Results vary based on your CPU, dataset version, and model selection. For a fair comparison between BF16 and INT4, always use the same tasks and few-shot settings. {{% /notice %}} -## Prerequisites -Before you begin, make sure your environment is ready for evaluation. -You should have: - * Completed the optimized build from the “Overview and Optimized Build” section and successfully validated your vLLM installation. - * (Optional) Quantized a model using the “Quantize an LLM to INT4 for Arm Platform” module. - The quantized model directory (for example, DeepSeek-V2-Lite-w4a8dyn-mse-channelwise) will be used as input for INT4 evaluation. -If you haven’t quantized a model, you can still evaluate your BF16 baseline to establish a reference accuracy. +## Prerequisites -## Install LM Evaluation Harness +Before you start: + * Complete the optimized build in “Overview and Optimized Build” and validate your vLLM install. + * Optionally quantize a model using the “Quantize an LLM to INT4 for Arm Platform” module. We’ll reference the output directory name from that step. -You will install the LM Evaluation Harness with vLLM backend support, allowing direct evaluation against your running vLLM server. +## Install lm-eval-harness -Install it inside your active Python environment: +Install the harness with vLLM extras in your active Python environment: ```bash pip install "lm_eval[vllm]" @@ -42,14 +36,12 @@ pip install ray ``` {{% notice Tip %}} -If your benchmarks include gated models or restricted datasets, run `huggingface-cli login` -This ensures the harness can authenticate with Hugging Face and download any protected resources needed for evaluation. +If your benchmarks include gated models or datasets, run `huggingface-cli login` first so the harness can download what it needs. {{% /notice %}} -## Recommended Runtime Settings for Arm CPU +## Recommended runtime settings for Arm CPU -Before running accuracy benchmarks, export the same performance tuned environment variables you used for serving. -These settings ensure vLLM runs with Arm-optimized kernels (via oneDNN + Arm Compute Library) and consistent thread affinity across all CPU cores during evaluation. +Export the same performance-oriented environment variables used for serving. These enable Arm-optimized kernels through oneDNN+ACL and consistent thread pinning: ```bash export VLLM_TARGET_DEVICE=cpu @@ -61,28 +53,13 @@ export OMP_NUM_THREADS="$(nproc)" export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4 ``` -Explanation of settings - -| Variable | Purpose | -| --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | -| **`VLLM_TARGET_DEVICE=cpu`** | Forces vLLM to run entirely on CPU, ensuring evaluation results use Arm-optimized oneDNN kernels. | -| **`VLLM_CPU_KVCACHE_SPACE=32`** | Reserves 32 GB for key/value caches used in attention. Adjust if evaluating with longer contexts or larger batches. | -| **`VLLM_CPU_OMP_THREADS_BIND="0-$(($(nproc)-1))"`** | Pins OpenMP worker threads to physical cores (0–N-1) to minimize OS thread migration and improve cache locality. | -| **`VLLM_MLA_DISABLE=1`** | Disables GPU/MLA probing for faster initialization in CPU-only mode. | -| **`ONEDNN_DEFAULT_FPMATH_MODE=BF16`** | Enables **bfloat16** math mode, using reduced precision operations for faster compute while maintaining numerical stability. | -| **`OMP_NUM_THREADS="$(nproc)"`** | Uses all available CPU cores to parallelize matrix multiplications and attention layers. | -| **`LD_PRELOAD`** | Preloads **tcmalloc** (Thread-Caching Malloc) to reduce memory allocator contention under high concurrency. | - {{% notice Note %}} -tcmalloc helps reduce allocator overhead when running multiple evaluation tasks in parallel. -If it’s not installed, add it with `sudo apt-get install -y libtcmalloc-minimal4` +`LD_PRELOAD` uses tcmalloc to reduce allocator contention. Install it via `sudo apt-get install -y libtcmalloc-minimal4` if you haven’t already. {{% /notice %}} -## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct (BF16 Model) +## Accuracy Benchmarking Meta‑Llama‑3.1‑8B‑Instruct BF16 model -To establish a baseline accuracy reference, evaluate a non-quantized BF16 model served through vLLM. -This run measures how the original model performs under Arm-optimized BF16 inference before applying INT4 quantization. -Replace the model ID if you are using a different model variant or checkpoint. +Run with a non-quantized model. Replace the model ID as needed. ```bash lm_eval \ @@ -93,16 +70,26 @@ lm_eval \ --batch_size auto \ --output_path results ``` -After completing this test, review the results directory for accuracy metrics (e.g., acc_norm, acc) and record them as your BF16 baseline. -Next, you’ll run the same benchmarks on the INT4 quantized model to compare accuracy across precisions. +## Benchmark INT4 quantized model accuracy -## Accuracy Benchmarking: INT4 quantized model +Run accuracy tests on your INT4 quantized model using the same tasks and settings as the BF16 baseline. Replace the model path with your quantized output directory. -Now that you’ve quantized your model using the INT4 recipe and script from the previous module, you can benchmark its accuracy using the same evaluation harness and task set. -This test compares quantized (INT4) performance against your BF16 baseline, revealing how much accuracy is preserved after compression. -Use the quantized directory generated earlier, for example: -Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise. +```bash +lm_eval \ + --model vllm \ + --model_args \ + pretrained=Meta-Llama-3.1-8B-Instruct-w4a8dyn-mse-channelwise,dtype=float32,max_model_len=4096,enforce_eager=True \ + --tasks mmlu,hellaswag \ + --batch_size auto \ + --output_path results +``` + +The expected output includes per-task accuracy metrics. Compare these results to your BF16 baseline to evaluate the impact of INT4 quantization on model quality. + +Use the INT4 quantization recipe & script from previous steps to quantize `meta-llama/Meta-Llama-3.1-8B-Instruct` model. + +Channelwise INT4 (MSE): ```bash lm_eval \ @@ -113,26 +100,18 @@ lm_eval \ --batch_size auto \ --output_path results ``` -After this evaluation, compare the results metrics from both runs: -## Interpreting results +## Interpret the results -After running evaluations, the LM Evaluation Harness prints per-task and aggregate metrics such as acc, acc_norm, and exact_match. -These represent model accuracy across various datasets and question formats—higher values indicate better performance. -Key metrics include: - * acc – Standard accuracy (fraction of correct predictions). - * acc_norm – Normalized accuracy; adjusts for multiple-choice imbalance. - * exact_match – Strict string-level match, typically used for reasoning or QA tasks. +The harness prints per-task and aggregate scores (for example, `acc`, `acc_norm`, `exact_match`). Higher is generally better. Compare BF16 vs INT4 on the same tasks to assess quality impact. -Compare BF16 and INT4 results on identical tasks to assess the accuracy–efficiency trade-off introduced by quantization. Practical tips: - * Always use identical tasks, few-shot settings, and seeds across runs to ensure fair comparisons. - * Add --limit 200 for quick validation runs during tuning. This limits each task to 200 samples for faster iteration. + * Use the same tasks and few-shot settings across runs. + * For quick iteration, you can add `--limit 200` to run on a subset. -## Example results for Meta‑Llama‑3.1‑8B‑Instruct model +## Explore example results for Meta‑Llama‑3.1‑8B‑Instruct model -The following results are illustrative and serve as reference points. -Your actual scores may differ based on hardware, dataset version, or lm-eval-harness release. +These illustrative results are representative; actual scores may vary across hardware, dataset versions, and harness releases. Higher values indicate better accuracy. | Variant | MMLU (acc±err) | HellaSwag (acc±err) | |---------------------------------|-------------------|---------------------| @@ -140,21 +119,14 @@ Your actual scores may differ based on hardware, dataset version, or lm-eval-har | INT4 Groupwise minmax (G=32) | 0.5831 ± 0.0049 | 0.7819 ± 0.0041 | | INT4 Channelwise MSE | 0.5712 ± 0.0049 | 0.7633 ± 0.0042 | -How to interpret: - - * BF16 baseline – Represents near-FP32 accuracy; serves as your quality reference. - * INT4 Groupwise minmax – Retains almost all performance while reducing model size ~4× and improving throughput substantially. - * INT4 Channelwise MSE – Slightly lower accuracy, often within 2–3 percentage points of BF16, still competitive for most production use cases. +Use these as ballpark expectations to check whether your runs are in a reasonable range, not as official targets. ## Next steps - * Broaden accuracy testing to cover reasoning, math, and commonsense tasks that reflect your real-world use cases: -GSM8K – Arithmetic and logical reasoning (sensitive to quantization). -Winogrande – Commonsense and pronoun disambiguation. -ARC-Easy / ARC-Challenge – Science and multi-step reasoning questions. -Running multiple benchmarks gives a more comprehensive picture of model robustness under different workloads. +Now that you've completed accuracy benchmarking for both BF16 and INT4 models on Arm-based servers, you're ready to deepen your evaluation and optimize for your specific use case. Expanding your benchmarks to additional tasks helps you understand model performance across a wider range of scenarios. Experimenting with different quantization recipes lets you balance accuracy and throughput for your workload. - * Experiment with different quantization configurations to find the best accuracy–throughput trade-off for your hardware. - * Record both throughput and accuracy to choose the best configuration for your workload. +- Try additional tasks to match your use case: `gsm8k`, `winogrande`, `arc_easy`, `arc_challenge`. +- Sweep quantization recipes (minmax vs mse; channelwise vs groupwise, group size) to find a better accuracy/performance balance. +- Record both throughput and accuracy to choose the best configuration for your workload. -By iterating on these steps, you will build a custom performance and accuracy profile for your Arm deployment, helping you select the optimal quantization strategy and runtime configuration for your target workload. +You've learned how to set up lm-evaluation-harness, run benchmarks for BF16 and INT4 models, and interpret key accuracy metrics on Arm platforms. Great job reaching this milestone—your results will help you make informed decisions about model deployment and optimization! diff --git a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md index 1150eef83..701804d4f 100644 --- a/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/vllm-acceleration/_index.md @@ -1,28 +1,24 @@ --- -title: Optimized LLM Inference with vLLM on Arm-Based Servers - -draft: true -cascade: - draft: true +title: Accelerate vLLM inference on Azure Cobalt 100 virtual machines minutes_to_complete: 60 -who_is_this_for: This learning path is designed for software developers and AI engineers who want to build and optimize vLLM for Arm-based servers, quantize large language models (LLMs) to INT4, serve them efficiently through an OpenAI-compatible API, and benchmark model accuracy using the LM Evaluation Harness. +who_is_this_for: This is an introductory topic for developers interested in building and optimizing vLLM for Arm-based servers. This Learning Path shows you how to quantize large language models (LLMs) to INT4, serve them efficiently using an OpenAI-compatible API, and benchmark model accuracy with the LM Evaluation Harness. learning_objectives: - - Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library(ACL). - - Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries. - - Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision. - - Run and serve both quantized and BF16 (non-quantized) variants using vLLM. - - Use OpenAI‑compatible endpoints and understand sequence and batch limits. - - Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM. + - Build an optimized vLLM for aarch64 with oneDNN and the Arm Compute Library (ACL) + - Set up all runtime dependencies including PyTorch, llmcompressor, and Arm-optimized libraries + - Quantize an LLM (DeepSeek‑V2‑Lite) to 4-bit integer (INT4) precision + - Run and serve both quantized and BF16 (non-quantized) variants using vLLM + - Use OpenAI‑compatible endpoints and understand sequence and batch limits + - Evaluate accuracy using the LM Evaluation Harness on BF16 and INT4 models with vLLM prerequisites: - - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space. - - Python 3.12 and basic familiarity with Hugging Face Transformers and quantization. + - An Arm-based Linux server (Ubuntu 22.04+ recommended) with a minimum of 32 vCPUs, 64 GB RAM, and 64 GB free disk space + - Python 3.12 and basic familiarity with Hugging Face Transformers and quantization author: - - Nikhil Gupta, Pareena Verma + - Nikhil Gupta ### Tags skilllevels: Introductory @@ -47,7 +43,7 @@ further_reading: - resource: title: vLLM GitHub Repository link: https://github.com/vllm-project/vllm - type: github + type: website - resource: title: Hugging Face Model Hub link: https://huggingface.co/models @@ -59,7 +55,7 @@ further_reading: - resource: title: LM Evaluation Harness (GitHub) link: https://github.com/EleutherAI/lm-evaluation-harness - type: github + type: website