From fc626322589bd98e935844f507868369aeccd97f Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 3 Nov 2025 20:46:20 +0000 Subject: [PATCH 01/12] Refine introductory content and resources for DGX Spark learning path --- .../dgx_spark_llamacpp/_index.md | 23 +++++++++---------- 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index ed548d391..5697c2af8 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -1,22 +1,21 @@ --- title: Deploy quantized LLMs on DGX Spark using llama.cpp -draft: true -cascade: - draft: true - minutes_to_complete: 60 -who_is_this_for: This Learning Path is for AI practitioners, performance engineers, and system architects who want to understand how the Grace–Blackwell (GB10) platform enables efficient quantized LLM inference through CPU–GPU collaboration. +who_is_this_for: This is an introductory topic for AI practitioners, performance engineers, and system architects who want to learn how to deploy and optimize quantized large language models (LLMs) on NVIDIA DGX Spark systems powered by the Grace-Blackwell (GB10) architecture. learning_objectives: - - Understand the Grace–Blackwell (GB10) architecture and how it supports efficient AI inference - - Build and validate both CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment + - Describe the Grace–Blackwell (GB10) architecture and its support for efficient AI inference + - Build CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment + - Validate the functionality of both builds on the DGX Spark platform - Analyze how Armv9 SIMD instructions accelerate quantized LLM inference on the Grace CPU prerequisites: - - NVIDIA DGX Spark system with at least 15 GB of available disk space + - Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space - Basic understanding of machine learning concepts + - Familiarity with command-line interfaces and basic Linux operations + author: Odin Shen @@ -36,19 +35,19 @@ tools_software_languages: further_reading: - resource: - title: NVIDIA DGX Spark + title: NVIDIA DGX Spark website link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/ type: website - resource: - title: NVIDIA DGX Spark Playbooks + title: NVIDIA DGX Spark Playbooks GitHub repository link: https://github.com/NVIDIA/dgx-spark-playbooks type: documentation - resource: - title: Explore llama.cpp architecture and the inference workflow + title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels Learning Path link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ type: blog - resource: - title: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing + title: Arm Newsroom Blog: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations type: website From 8f357fe8347c487ffb960709c9821c8dcccc8c3d Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Mon, 3 Nov 2025 20:47:50 +0000 Subject: [PATCH 02/12] Update resource title format in DGX Spark learning path --- .../laptops-and-desktops/dgx_spark_llamacpp/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index 5697c2af8..ee20a2219 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -47,7 +47,7 @@ further_reading: link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ type: blog - resource: - title: Arm Newsroom Blog: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing + title: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing (Arm Newsroom Blog) link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations type: website From 65c96e29b7ccc1cb374a37c646ac0c2f1d06d654 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 06:52:05 +0000 Subject: [PATCH 03/12] Refactor introduction sections and enhance clarity in DGX Spark learning path documentation --- .../dgx_spark_llamacpp/1_gb10_introduction.md | 26 ++++++++++++++----- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 26 ++++++++++++++++--- .../dgx_spark_llamacpp/_index.md | 4 +-- 3 files changed, 44 insertions(+), 12 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md index 63dbeaa6a..4d72658f0 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -26,7 +26,7 @@ DGX Spark is a compact yet powerful development platform for modern AI workloads DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed. -### Why Grace Blackwell for quantized LLMs? +## Why should I use Grace Blackwell for quantized LLMs? Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. @@ -47,11 +47,11 @@ In a typical quantized LLM workflow: Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. -### Inspecting your GB10 environment +## Verify your GB10 development environment Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs. -#### Step 1: Check CPU information +## Step 1: Check the CPU information Run the following command to print the CPU information: @@ -119,7 +119,7 @@ Vulnerabilities: Tsx async abort: Not affected ``` -The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. +Great! You've checked your CPU configuration. Your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. @@ -154,7 +154,9 @@ Codename: noble As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution. It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. -#### Step 2: Verify Blackwell GPU and driver +Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, which is well-supported for AI development on Arm. + +## Step 2: Verify Blackwell GPU and driver After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads. @@ -204,8 +206,10 @@ The table below provides more explanation of the `nvidia-smi` output: | Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | | Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. | +Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. + -#### Step 3: Check CUDA Toolkit +## Step 3: Check the CUDA Toolkit To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. @@ -215,6 +219,7 @@ This ensures that CMake can correctly detect and compile the GPU-accelerated com ```bash nvcc --version ``` +You're almost ready! Verifying the CUDA toolkit ensures you can build GPU-enabled versions of llama.cpp for maximum performance. You will see output similar to: @@ -239,3 +244,12 @@ At this point, you have verified that: - The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. + +## What have I achieved? + +You have: +- Verified your Arm-based Grace CPU and its capabilities +- Confirmed your Blackwell GPU and CUDA driver are ready +- Checked your operating system and CUDA toolkit + +You're now ready to move on to building and running quantized LLMs on your DGX Spark! diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index 4114bf838..6fa643885 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -37,6 +37,8 @@ hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B After the download completes, the models will be available in the `~/models` directory. +Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup. + ### Step 2: Clone the llama.cpp repository Use the commands below to download the source code for llama.cpp from GitHub. @@ -47,6 +49,8 @@ git clone https://github.com/ggerganov/llama.cpp.git cd ~/llama.cpp ``` +Nice work! You now have the latest llama.cpp source code on your DGX Spark system. + ### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode) Run the following `cmake` command to configure the build system for GPU acceleration. @@ -110,6 +114,8 @@ After the build completes, the GPU-accelerated binaries will be located under `~ These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step. +Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. + Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. ### Step 4: Validate the CUDA-enabled build @@ -176,9 +182,21 @@ The following screenshot shows GPU utilization during TinyLlama inference on DGX ![image1 nvtop screenshot](nvtop.png "TinyLlama GPU Utilization") The nvtop interface shows: -- GPU Utilization (%) : confirm CUDA kernels are active -- Memory Usage (VRAM) : observe model loading and runtime footprint -- Temperature / Power Draw : monitor thermal stability under sustained workloads + + - GPU Utilization (%): Confirms CUDA kernels are active. + - Memory Usage (VRAM): Shows model loading and runtime footprint. + - Temperature / Power Draw: Monitors thermal stability under sustained workloads. You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark. -In the next section, you will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference. + +Success! You’ve confirmed that the GPU-accelerated version of llama.cpp is correctly built and can run quantized LLM inference on your DGX Spark. + +## What have I achieved? + +You have: +- Installed all required tools and dependencies +- Downloaded a quantized model for testing +- Built the CUDA-enabled version of llama.cpp +- Verified GPU linkage and successful inference + +You’re ready to move on to building and testing the CPU-only version! You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index ee20a2219..698056539 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -47,9 +47,9 @@ further_reading: link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/ type: blog - resource: - title: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing (Arm Newsroom Blog) + title: Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations - type: website + type: blog ### FIXED, DO NOT MODIFY # ================================================================================ From cfbbac2c7e39bbe79b0e0584022897ebb1898dd6 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 10:50:58 +0000 Subject: [PATCH 04/12] Refactor DGX Spark learning path documentation: update titles, enhance clarity, and break some content out into new file --- .../dgx_spark_llamacpp/1_gb10_introduction.md | 251 ++---------------- .../dgx_spark_llamacpp/1a_gb10_setup.md | 214 +++++++++++++++ .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 2 +- .../dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md | 2 +- .../dgx_spark_llamacpp/4_gb10_processwatch.md | 2 +- 5 files changed, 237 insertions(+), 234 deletions(-) create mode 100644 content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md index 4d72658f0..5e07a2307 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -1,255 +1,44 @@ --- -title: Verify Grace Blackwell system readiness for AI inference +title: "Introduction to Grace Blackwell: Unlocking efficient quantized LLMs on Arm-based NVIDIA DGX Spark" weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction to Grace Blackwell architecture +## Introduction to the Grace Blackwell architecture -In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. +In this section, you'll explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. You'll perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. -You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. - -The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop. -The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. +The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines: -- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency. - -- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads. -- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks. - -This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. -DGX Spark is a compact yet powerful development platform for modern AI workloads. +- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency +- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads +- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks -DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed. +This NVIDIA Grace Blackwell DGX Spark (GB10) platform design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop, letting you build and test AI models locally before scaling them to larger systems. -## Why should I use Grace Blackwell for quantized LLMs? +## What are the benefits of using Grace Blackwell for quantized LLMs? Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. -| **Feature** | **Impact on Quantized LLMs** | +The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference: + +| **Feature** | **Impact on quantized LLMs** | |--------------|------------------------------| -| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). | -| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. | -| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. | -| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. | -| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. | +| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle) | +| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers | +| High bandwidth + low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads | +| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer | +| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads | In a typical quantized LLM workflow: -- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks. -- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput. -- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference. +- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks +- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput +- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space - reducing copy overhead and enabling near-real-time inference Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. -## Verify your GB10 development environment - -Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs. - -## Step 1: Check the CPU information - -Run the following command to print the CPU information: - -```bash -lscpu -``` - -Expected output: - -```output -Architecture: aarch64 - CPU op-mode(s): 64-bit - Byte Order: Little Endian -CPU(s): 20 - On-line CPU(s) list: 0-19 -Vendor ID: ARM - Model name: Cortex-X925 - Model: 1 - Thread(s) per core: 1 - Core(s) per socket: 10 - Socket(s): 1 - Stepping: r0p1 - CPU(s) scaling MHz: 89% - CPU max MHz: 4004.0000 - CPU min MHz: 1378.0000 - BogoMIPS: 2000.00 - Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as - imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f - lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt - Model name: Cortex-A725 - Model: 1 - Thread(s) per core: 1 - Core(s) per socket: 10 - Socket(s): 1 - Stepping: r0p1 - CPU(s) scaling MHz: 99% - CPU max MHz: 2860.0000 - CPU min MHz: 338.0000 - BogoMIPS: 2000.00 - Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as - imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f - lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt -Caches (sum of all): - L1d: 1.3 MiB (20 instances) - L1i: 1.3 MiB (20 instances) - L2: 25 MiB (20 instances) - L3: 24 MiB (2 instances) -NUMA: - NUMA node(s): 1 - NUMA node0 CPU(s): 0-19 -Vulnerabilities: - Gather data sampling: Not affected - Itlb multihit: Not affected - L1tf: Not affected - Mds: Not affected - Meltdown: Not affected - Mmio stale data: Not affected - Reg file data sampling: Not affected - Retbleed: Not affected - Spec rstack overflow: Not affected - Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl - Spectre v1: Mitigation; __user pointer sanitization - Spectre v2: Not affected - Srbds: Not affected - Tsx async abort: Not affected -``` - -Great! You've checked your CPU configuration. Your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. - -The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. - -| **Category** | **Specification** | **Description / Impact for LLM Inference** | -|---------------|-------------------|---------------------------------------------| -| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | -| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | -| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. | -| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | -| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | -| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | -| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | -| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | - -Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. - -You can also verify the operating system running on your DGX Spark by using the following command: - -```bash -lsb_release -a -``` - -Expected output: - -```log -No LSB modules are available. -Distributor ID: Ubuntu -Description: Ubuntu 24.04.3 LTS -Release: 24.04 -Codename: noble -``` -As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution. -It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. - -Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, which is well-supported for AI development on Arm. - -## Step 2: Verify Blackwell GPU and driver - -After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads. - -```bash -nvidia-smi -``` - -You will see output similar to: - -```output -Wed Oct 22 09:26:54 2025 -+-----------------------------------------------------------------------------------------+ -| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | -+-----------------------------------------+------------------------+----------------------+ -| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | -| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | -| | | MIG M. | -|=========================================+========================+======================| -| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A | -| N/A 32C P8 4W / N/A | Not Supported | 0% Default | -| | | N/A | -+-----------------------------------------+------------------------+----------------------+ - -+-----------------------------------------------------------------------------------------+ -| Processes: | -| GPU GI CI PID Type Process name GPU Memory | -| ID ID Usage | -|=========================================================================================| -| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB | -| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB | -+-----------------------------------------------------------------------------------------+ -``` - -The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads. - -The table below provides more explanation of the `nvidia-smi` output: - -| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | -|---------------|--------------------------------------|---------------------------------------------| -| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | -| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | -| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | -| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | -| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | -| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | -| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | -| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | -| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. | - -Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. - - -## Step 3: Check the CUDA Toolkit - -To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. - -The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13. -This ensures that CMake can correctly detect and compile the GPU-accelerated components. - -```bash -nvcc --version -``` -You're almost ready! Verifying the CUDA toolkit ensures you can build GPU-enabled versions of llama.cpp for maximum performance. - -You will see output similar to: - -```output -nvcc: NVIDIA (R) Cuda compiler driver -Copyright (c) 2005-2025 NVIDIA Corporation -Built on Wed_Aug_20_01:57:39_PM_PDT_2025 -Cuda compilation tools, release 13.0, V13.0.88 -Build cuda_13.0.r13.0/compiler.36424714_0 -``` - -{{% notice Note %}} -The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference. -{{% /notice %}} - -This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. -If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). - -At this point, you have verified that: -- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. -- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. -- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. - -Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. - -## What have I achieved? - -You have: -- Verified your Arm-based Grace CPU and its capabilities -- Confirmed your Blackwell GPU and CUDA driver are ready -- Checked your operating system and CUDA toolkit - -You're now ready to move on to building and running quantized LLMs on your DGX Spark! diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md new file mode 100644 index 000000000..e9457f123 --- /dev/null +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md @@ -0,0 +1,214 @@ +--- +title: Verify Grace Blackwell system readiness for AI inference +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Verify your GB10 development environment + +Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs. + +## Check the CPU information + +To check the CPU information, run the following command to print the CPU information: + +```bash +lscpu +``` + +Expected output: + +```output +Architecture: aarch64 + CPU op-mode(s): 64-bit + Byte Order: Little Endian +CPU(s): 20 + On-line CPU(s) list: 0-19 +Vendor ID: ARM + Model name: Cortex-X925 + Model: 1 + Thread(s) per core: 1 + Core(s) per socket: 10 + Socket(s): 1 + Stepping: r0p1 + CPU(s) scaling MHz: 89% + CPU max MHz: 4004.0000 + CPU min MHz: 1378.0000 + BogoMIPS: 2000.00 + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt + Model name: Cortex-A725 + Model: 1 + Thread(s) per core: 1 + Core(s) per socket: 10 + Socket(s): 1 + Stepping: r0p1 + CPU(s) scaling MHz: 99% + CPU max MHz: 2860.0000 + CPU min MHz: 338.0000 + BogoMIPS: 2000.00 + Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as + imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f + lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt +Caches (sum of all): + L1d: 1.3 MiB (20 instances) + L1i: 1.3 MiB (20 instances) + L2: 25 MiB (20 instances) + L3: 24 MiB (2 instances) +NUMA: + NUMA node(s): 1 + NUMA node0 CPU(s): 0-19 +Vulnerabilities: + Gather data sampling: Not affected + Itlb multihit: Not affected + L1tf: Not affected + Mds: Not affected + Meltdown: Not affected + Mmio stale data: Not affected + Reg file data sampling: Not affected + Retbleed: Not affected + Spec rstack overflow: Not affected + Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl + Spectre v1: Mitigation; __user pointer sanitization + Spectre v2: Not affected + Srbds: Not affected + Tsx async abort: Not affected +``` + +Great! You've checked your CPU configuration. Your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. + +The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. + +| **Category** | **Specification** | **Description / Impact for LLM Inference** | +|---------------|-------------------|---------------------------------------------| +| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | +| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | +| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. | +| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | +| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | +| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | +| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | +| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | + +Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. + +You can also verify the operating system running on your DGX Spark by using the following command: + +```bash +lsb_release -a +``` + +Expected output: + +```log +No LSB modules are available. +Distributor ID: Ubuntu +Description: Ubuntu 24.04.3 LTS +Release: 24.04 +Codename: noble +``` +As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution. +It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. + +Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, which is well-supported for AI development on Arm. + +## Step 2: Verify Blackwell GPU and driver + +After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads. + +```bash +nvidia-smi +``` + +You will see output similar to: + +```output +Wed Oct 22 09:26:54 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | ++-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A | +| N/A 32C P8 4W / N/A | Not Supported | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB | +| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB | ++-----------------------------------------------------------------------------------------+ +``` + +The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads. + +The table below provides more explanation of the `nvidia-smi` output: + +| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | +|---------------|--------------------------------------|---------------------------------------------| +| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | +| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | +| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | +| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | +| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | +| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | +| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | +| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | +| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. | + +Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. + + +## Step 3: Check the CUDA Toolkit + +To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. + +The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13. +This ensures that CMake can correctly detect and compile the GPU-accelerated components. + +```bash +nvcc --version +``` +You're almost ready! Verifying the CUDA toolkit ensures you can build GPU-enabled versions of llama.cpp for maximum performance. + +You will see output similar to: + +```output +nvcc: NVIDIA (R) Cuda compiler driver +Copyright (c) 2005-2025 NVIDIA Corporation +Built on Wed_Aug_20_01:57:39_PM_PDT_2025 +Cuda compilation tools, release 13.0, V13.0.88 +Build cuda_13.0.r13.0/compiler.36424714_0 +``` + +{{% notice Note %}} +The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference. +{{% /notice %}} + +This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. +If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). + +At this point, you have verified that: +- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. +- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. +- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. + +Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. + +## What have I achieved? + +You have: +- Verified your Arm-based Grace CPU and its capabilities +- Confirmed your Blackwell GPU and CUDA driver are ready +- Checked your operating system and CUDA toolkit + +You're now ready to move on to building and running quantized LLMs on your DGX Spark! diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index 6fa643885..9e005b1a7 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -1,6 +1,6 @@ --- title: Build the GPU version of llama.cpp on GB10 -weight: 3 +weight: 4 layout: "learningpathall" --- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md index 90d03a16d..31e5a1789 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md @@ -1,6 +1,6 @@ --- title: Build the CPU version of llama.cpp on GB10 -weight: 4 +weight: 5 layout: "learningpathall" --- diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md index 91e251471..0c52fbbce 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md @@ -1,6 +1,6 @@ --- title: Analyze CPU instruction mix using Process Watch -weight: 5 +weight: 6 layout: "learningpathall" --- From b1e1453e331d3792ff4b5f7f6ccd5e7496f5f5cf Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 15:42:46 +0000 Subject: [PATCH 05/12] Refactor DGX Spark learning path documentation: update section titles, enhance clarity, and improve content organization --- .../dgx_spark_llamacpp/1_gb10_introduction.md | 11 ++- .../dgx_spark_llamacpp/1a_gb10_setup.md | 91 ++++++++++--------- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 24 ++--- .../dgx_spark_llamacpp/_index.md | 2 +- 4 files changed, 69 insertions(+), 59 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md index 5e07a2307..9e867cad5 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -1,14 +1,14 @@ --- -title: "Introduction to Grace Blackwell: Unlocking efficient quantized LLMs on Arm-based NVIDIA DGX Spark" +title: Discover the Grace Blackwell architecture weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction to the Grace Blackwell architecture +## Overview -In this section, you'll explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. You'll perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. +In this section, you'll explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. @@ -23,7 +23,9 @@ This NVIDIA Grace Blackwell DGX Spark (GB10) platform design delivers up to one Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. -The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference: +The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference, which are summarized in the table below: + +### Features of the Grace Blackwell architecture and the impact they have on quantized LLMs | **Feature** | **Impact on quantized LLMs** | |--------------|------------------------------| @@ -33,6 +35,7 @@ The Grace Blackwell architecture brings several key advantages to quantized LLM | Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer | | Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads | +### The quantized LLM workflow In a typical quantized LLM workflow: - The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md index e9457f123..16bd63cb1 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md @@ -1,24 +1,28 @@ --- -title: Verify Grace Blackwell system readiness for AI inference +title: Verify your Grace Blackwell system readiness for AI inference weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Verify your GB10 development environment +## Overview -Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs. +Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm. -## Check the CPU information +This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply. -To check the CPU information, run the following command to print the CPU information: +## Step 1: check your CPU configuration + +Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference. + +Start by checking your system's CPU configuration: ```bash lscpu ``` -Expected output: +The output is similar to: ```output Architecture: aarch64 @@ -78,22 +82,24 @@ Vulnerabilities: Tsx async abort: Not affected ``` -Great! You've checked your CPU configuration. Your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. +Great! If you have seen this message your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. + +### Grace CPU specification -The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. +The following table gives you more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference: | **Category** | **Specification** | **Description / Impact for LLM Inference** | |---------------|-------------------|---------------------------------------------| -| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | -| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | -| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. | +| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions| +| Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency | +| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency | | Clock Frequency | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | -| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | -| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | -| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | -| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | +| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads | +| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations | +| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads | +| Security and Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks | -Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. +Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. You can also verify the operating system running on your DGX Spark by using the following command: @@ -101,7 +107,7 @@ You can also verify the operating system running on your DGX Spark by using the lsb_release -a ``` -Expected output: +The expected output is something similar to: ```log No LSB modules are available. @@ -110,14 +116,13 @@ Description: Ubuntu 24.04.3 LTS Release: 24.04 Codename: noble ``` -As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution. -It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. +This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution that provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities. This makes it an ideal environment for building and deploying quantized LLM workloads. -Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, which is well-supported for AI development on Arm. +Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step. -## Step 2: Verify Blackwell GPU and driver +## Step 2: Verify the Blackwell GPU and driver -After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads. +After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following: ```bash nvidia-smi @@ -151,24 +156,25 @@ Wed Oct 22 09:26:54 2025 The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads. -The table below provides more explanation of the `nvidia-smi` output: +### Further information about the output from the nvidia-smi tool -| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | +The table below provides more explanation of the `nvidia-smi` output: +| **Category** | **Specification (from nvidia-smi)** | **Description / impact for LLM inference** | |---------------|--------------------------------------|---------------------------------------------| -| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | -| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | -| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | -| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | -| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | -| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | -| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | -| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | -| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. | +| GPU name | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip | +| Driver version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility | +| CUDA version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads | +| Architecture / Compute capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs | +| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space | +| Power & Thermal status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle | +| GPU-utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs | +| Memory usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed | +| Persistence mode | On | Ensures the GPU remains initialized and ready for rapid inference startup | Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. -## Step 3: Check the CUDA Toolkit +## Step 3: Check the CUDA toolkit To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. @@ -198,17 +204,18 @@ This confirms that the CUDA 13 toolkit is installed and ready for GPU compilatio If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). At this point, you have verified that: -- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. -- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. -- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. +- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions +- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime +- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. ## What have I achieved? -You have: -- Verified your Arm-based Grace CPU and its capabilities -- Confirmed your Blackwell GPU and CUDA driver are ready -- Checked your operating system and CUDA toolkit +In this entire setup section, you have achieved the following: + +- Verified your Arm-based Grace CPU and its capabilities—you've confirmed that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference +- Confirmed your Blackwell GPU and CUDA driver are ready—the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads +- Checked your operating system and CUDA toolkit—Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools -You're now ready to move on to building and running quantized LLMs on your DGX Spark! +You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index 9e005b1a7..eebbebbc7 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -14,7 +14,7 @@ llama.cpp is an open-source project by Georgi Gerganov that provides efficient a ### Step 1: Preparation -In this step, you will install the necessary build tools and download a small quantized model for validation. +In this step, you will install the necessary build tools and download a small quantized model for validation: ```bash sudo apt update @@ -41,7 +41,7 @@ Great! You’ve installed all the required build tools and downloaded a quantize ### Step 2: Clone the llama.cpp repository -Use the commands below to download the source code for llama.cpp from GitHub. +Use the commands below to download the source code for llama.cpp from GitHub: ```bash cd ~ @@ -70,13 +70,15 @@ cmake .. \ -DCMAKE_CUDA_COMPILER=nvcc ``` -Explanation of Key Flags: +### Explanation of key flags: + +The following table provides an explanation of the key flags you used in the previous code: | **Feature** | **Description / Impact** | |--------------|------------------------------| -| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration.| -| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (e.g., Q4, Q5). | -| -DCMAKE_CUDA_ARCHITECTURES=121 | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels. | +| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration| +| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) | +| -DCMAKE_CUDA_ARCHITECTURES=121 | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels| When the configuration process completes successfully, the terminal should display output similar to the following: @@ -87,17 +89,15 @@ When the configuration process completes successfully, the terminal should displ ``` {{% notice Note %}} -1. For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct CUDA 13.0 toolchain. -2. In case of configuration errors, revisit the previous section to verify that your CUDA toolkit and driver versions are properly installed and aligned with Blackwell (sm_121) support. -{{% /notice %}} +- For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct CUDA 13.0 toolchain. +- If you encounter configuration errors, return to the previous section and confirm that your CUDA toolkit and driver versions are correctly installed and compatible with Blackwell (sm_121).{{% /notice Note %}} Once CMake configuration succeeds, start the compilation process: ```bash make -j"$(nproc)" ``` - -This command compiles all CUDA and C++ source files in parallel, utilizing all available CPU cores for optimal build performance. On the Grace CPU in the DGX Spark system, the build process typically completes within 2–4 minutes, demonstrating the efficiency of the Arm-based architecture for software development. +This command compiles all CUDA and C++ source files in parallel, using all available CPU cores. On the Grace CPU, the build typically finishes in 2–4 minutes. The build output is shown below: @@ -110,7 +110,7 @@ The build output is shown below: [100%] Built target llama-server ``` -After the build completes, the GPU-accelerated binaries will be located under `~/llama.cpp/build-gpu/bin/` +After the build completes, the GPU-accelerated binaries are located under `~/llama.cpp/build-gpu/bin/` These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index 698056539..cdf851714 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -1,5 +1,5 @@ --- -title: Deploy quantized LLMs on DGX Spark using llama.cpp +title: Unlock efficient quantized LLMs on Arm-based NVIDIA DGX Spark using Armv9 SIMD instructions minutes_to_complete: 60 From 500ffd826f476988f8c801fd616508818f5184e4 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 22:16:21 +0000 Subject: [PATCH 06/12] Refactor DGX Spark learning path documentation: enhance clarity, improve section organization, and update content for GPU and CPU builds --- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 33 ++++++++------ .../dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md | 45 ++++++++++--------- .../dgx_spark_llamacpp/4_gb10_processwatch.md | 32 +++++++++---- 3 files changed, 66 insertions(+), 44 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index eebbebbc7..ecec7b2ac 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -8,7 +8,7 @@ layout: "learningpathall" In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. -Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. +Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. @@ -23,7 +23,9 @@ sudo apt install -y git cmake build-essential nvtop htop These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp. -To verify your GPU build later, you need at least one quantized model for testing. +### Download a test model + +To test your GPU build, you'll need a quantized model. In this section, you'll download a lightweight model that's perfect for validation. First, ensure that you have the latest Hugging Face Hub CLI installed and download models: @@ -35,7 +37,9 @@ pip install -U huggingface_hub hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B ``` -After the download completes, the models will be available in the `~/models` directory. +{{% notice Note %}} +After the download completes, you'll find the models in the `~/models` directory. +{{% /notice Note %}} Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup. @@ -53,9 +57,8 @@ Nice work! You now have the latest llama.cpp source code on your DGX Spark syste ### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode) -Run the following `cmake` command to configure the build system for GPU acceleration. +Run the following `cmake` command to configure the build system for GPU acceleration: -This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels. ```bash mkdir -p build-gpu @@ -70,9 +73,11 @@ cmake .. \ -DCMAKE_CUDA_COMPILER=nvcc ``` +This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels + ### Explanation of key flags: -The following table provides an explanation of the key flags you used in the previous code: +Here's what each configuration flag does: | **Feature** | **Description / Impact** | |--------------|------------------------------| @@ -110,11 +115,11 @@ The build output is shown below: [100%] Built target llama-server ``` -After the build completes, the GPU-accelerated binaries are located under `~/llama.cpp/build-gpu/bin/` +After the build completes, you'll find the GPU-accelerated binaries located under `~/llama.cpp/build-gpu/bin/`. -These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step. +These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server). -Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. +Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. You are now ready to test quantized LLMs with full GPU acceleration in the next step. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. @@ -140,7 +145,7 @@ The output is similar to: If the CUDA library is correctly linked, it confirms that the binary can access the GPU through the system driver. -Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability. +Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability: ```bash ./bin/llama-server --version @@ -171,15 +176,17 @@ Next, use the downloaded quantized model (for example, TinyLlama-1.1B) to verify If the build is successful, you will see text generation begin within a few seconds. -While `nvidia-smi` can display basic GPU information, `nvtop` provides real-time visualization of utilization, temperature, and power metrics which are useful for verifying CUDA kernel activity during inference. +To monitor GPU utilization during inference, use `nvtop` to view real-time performance metrics: ```bash nvtop ``` +This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference. + The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark. -![image1 nvtop screenshot](nvtop.png "TinyLlama GPU Utilization") +![nvtop terminal interface displaying real-time GPU metrics, including GPU utilization, memory usage, temperature, power consumption, and active processes for the NVIDIA GB10 GPU during model inference on DGX Spark. alt-text#center](nvtop.png "TinyLlama GPU Utilization") The nvtop interface shows: @@ -189,8 +196,6 @@ The nvtop interface shows: You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark. -Success! You’ve confirmed that the GPU-accelerated version of llama.cpp is correctly built and can run quantized LLM inference on your DGX Spark. - ## What have I achieved? You have: diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md index 31e5a1789..fedd1264a 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md @@ -4,19 +4,20 @@ weight: 5 layout: "learningpathall" --- -## How do I build the CPU version of llama.cpp on GB10? +## Overview +In this section, you'll build and test the CPU-only version of llama.cpp, optimized specifically for the Grace CPU's advanced Armv9 capabilities. -Use the steps below to build and test the CPU only versoin of llama.cpp. +The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vector extensions including SVE2, BFloat16, and I8MM. These extensions make the CPU highly efficient for quantized inference workloads, even without GPU acceleration. + +### Configure and build the CPU-only version -### Step 1: Configure and Build the CPU-Only Version In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU. This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration. +To ensure a clean separation from the GPU build artifacts, start from a clean directory. -Start from a clean directory to ensure a clean separation from the GPU build artifacts. - -Run the following commands to configure the build system for the CPU-only version of llama.cpp. +Configure the build system for the CPU-only version of llama.cpp: ```bash cd ~/llama.cpp @@ -34,15 +35,15 @@ cmake .. \ -DCMAKE_CXX_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp" ``` -Explanation of Key Flags: +Explanation of key flags: | **Feature** | **Description / Impact** | |--------------|------------------------------| -| -march=armv9-a | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions.| -| +sve2+bf16+i8mm | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference.| -| -fopenmp | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized.| -| -mtune=native | Optimizes code generation for the local Grace CPU microarchitecture.| -| -DLLAMA_ACCELERATE=ON | Enables llama.cpp’s internal Arm acceleration path (Neon/SVE optimized kernels).| +| -march=armv9-a | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions | +| +sve2+bf16+i8mm | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference | +| -fopenmp | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized | +| -mtune=native | Optimizes code generation for the local Grace CPU microarchitecture | +| -DLLAMA_ACCELERATE=ON | Enables llama.cpp's internal Arm acceleration path (Neon/SVE optimized kernels) | When the configuration process completes successfully, the terminal should display output similar to the following: @@ -52,7 +53,7 @@ When the configuration process completes successfully, the terminal should displ -- Build files have been written to: /home/nvidia/llama.cpp/build-cpu ``` -Then, start the compilation process: +Once you see this, you can now move on to start the compilation process: ```bash make -j"$(nproc)" @@ -81,17 +82,17 @@ The build output is shown below: [100%] Built target llama-server ``` -After the build finishes, the CPU-optimized binaries will be available under `~/llama.cpp/build-cpu/bin/` +After the build finishes, you'll find the CPU-optimized binaries at `~/llama.cpp/build-cpu/bin/` ### Step 2: Validate the CPU-Enabled Build (CPU Mode) -In this step, you will validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU. +First, validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU: ```bash ./bin/llama-server --version ``` -Expected output: +The output confirms the build configuration: ```output version: 6819 (19a5a3ed) @@ -117,7 +118,7 @@ Here is an explanation of the key flags: If the build is successful, you will observe smooth model initialization and token generation, with CPU utilization increasing across all cores. -For live CPU utilization and power metrics, use `htop`: +To monitor live CPU utilization and power metrics during inference, use `htop`: ```bash htop @@ -128,15 +129,17 @@ The following screenshot shows CPU utilization and thread activity during TinyLl The `htop` interface shows: -- CPU Utilization: All 20 cores operate between 75–85%, confirming efficient multi-thread scaling. -- Load Average: Around 5.0, indicating balanced workload distribution. -- Memory Usage: Approximately 4.5 GB total for the TinyLlama Q8_0 model. -- Process List: Displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism +- CPU Utilization: all 20 cores operate between 75–85%, confirming efficient multi-thread scaling +- Load Average: around 5.0, indicating balanced workload distribution +- Memory Usage: approximately 4.5 GB total for the TinyLlama Q8_0 model +- Process List: displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism {{% notice Note %}} In htop, press F6 to sort by CPU% and verify load distribution, or press `t` to toggle the tree view, which shows the `llama-cli` main process and its worker threads. {{% /notice %}} +## What have I accomplished? + In this section you have: - Built and validated the CPU-only version of llama.cpp. - Optimized the Grace CPU build using Armv9 vector extensions (SVE2, BF16, I8MM). diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md index 0c52fbbce..69d577521 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md @@ -19,7 +19,7 @@ sudo apt update sudo apt install -y git cmake build-essential libncurses-dev libtinfo-dev ``` -Clone and build Process Watch: +Now clone and build Process Watch: ```bash cd ~ @@ -91,7 +91,7 @@ If only one `llama-cli` process is running, you can directly launch Process Watc sudo processwatch --pid $(pgrep llama-cli) ``` -If you have multiple processes running, first identify the correct process ID: +If multiple processes are running, first identify the correct process ID: ```bash pgrep llama-cli @@ -103,6 +103,8 @@ Then attach Process Watch to monitor the instruction mix of this process: sudo processwatch --pid ``` +Replace `` with the actual process ID from the previous command. + {{% notice Note %}} `processwatch --list` does not display all system processes. It is intended for internal use and may not list user-level tasks like llama-cli. @@ -174,7 +176,7 @@ Verify the current setting: cat /proc/sys/abi/sve_default_vector_length ``` -The output is: +The output is similar to: ```output 16 @@ -192,12 +194,24 @@ This behavior is expected because SVE is available but fixed at 128 bits. Future kernel updates may introduce SVE2 instructions. {{% /notice %}} -## Summary +## What you've achieved and what's next + +You have completed the Learning Path for analyzing large language model inference on the DGX Spark platform with Arm-based Grace CPUs and Blackwell GPUs. + +Throughout this Learning Path, you have learned how to: + +- Set up your DGX Spark system with the required Arm software stack and CUDA 13 environment +- Build and validate both GPU-accelerated and CPU-only versions of llama.cpp for quantized LLM inference +- Download and run quantized TinyLlama models for efficient testing and benchmarking +- Monitor GPU utilization and performance using tools like nvtop +- Analyze CPU instruction mix with Process Watch to understand how Armv9 vector instructions are used during inference +- Interpret the impact of NEON, SVE, and SVE2 on AI workloads, and recognize current kernel limitations for vector execution -You have learned how to: -- Use Process Watch to monitor CPU instruction activity -- Interpret Armv9 vector instruction usage during LLM inference -- Prepare for future Armv9 enhancements +By completing these steps, you are now equipped to: -This knowledge helps you profile Arm systems effectively and optimize applications. +- Profile and optimize LLM workloads on Arm-based systems +- Identify performance bottlenecks and opportunities for acceleration on both CPU and GPU +- Prepare for future enhancements in Armv9 vector processing and software support +- Confidently deploy and monitor AI inference on modern Arm server platforms +For additional learning, see the resources in the "Further Reading" section. You can continue experimenting with different models and monitoring tools as new kernel updates become available. From 684f1cd7348b27da1691096b669825e31f7c7bff Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 22:49:23 +0000 Subject: [PATCH 07/12] Update content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md --- .../dgx_spark_llamacpp/1_gb10_introduction.md | 24 ++++++++++--------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md index 9e867cad5..6b46abb1b 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -8,39 +8,41 @@ layout: learningpathall ## Overview -In this section, you'll explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. +Explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. -The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. +The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. -The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines: +The GB10 platform combines: - The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency - The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads - A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks -This NVIDIA Grace Blackwell DGX Spark (GB10) platform design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop, letting you build and test AI models locally before scaling them to larger systems. +This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop and letting you build and test AI models locally before scaling them to larger systems. -## What are the benefits of using Grace Blackwell for quantized LLMs? +## Benefits of Grace Blackwell for quantized LLMs Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. -The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference, which are summarized in the table below: +The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference. -### Features of the Grace Blackwell architecture and the impact they have on quantized LLMs +### Grace Blackwell features and their impact on quantized LLMs + +The table below shows how specific hardware features enable efficient quantized model inference: | **Feature** | **Impact on quantized LLMs** | |--------------|------------------------------| -| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle) | +| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high instructions per cycle (IPC) | | Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers | -| High bandwidth + low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads | +| High bandwidth and low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU-GPU workloads | | Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer | | Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads | -### The quantized LLM workflow +### Quantized LLM workflow In a typical quantized LLM workflow: - The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks - The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput -- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space - reducing copy overhead and enabling near-real-time inference +- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space, reducing copy overhead and enabling near-real-time inference Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. From e253cdc77f9be46a66f9a76cb71e8618fad5570b Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Tue, 4 Nov 2025 23:08:59 +0000 Subject: [PATCH 08/12] Refactor documentation for clarity and consistency: update phrasing, section titles, and formatting in setup and build guides for CPU and GPU versions of llama.cpp --- .../dgx_spark_llamacpp/1a_gb10_setup.md | 12 ++++++------ .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 2 +- .../dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md | 11 +++++------ 3 files changed, 12 insertions(+), 13 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md index 16bd63cb1..14671484b 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md @@ -86,7 +86,7 @@ Great! If you have seen this message your system is using Armv9 cores, which are ### Grace CPU specification -The following table gives you more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference: +The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference: | **Category** | **Specification** | **Description / Impact for LLM Inference** | |---------------|-------------------|---------------------------------------------| @@ -99,7 +99,7 @@ The following table gives you more information about the key specifications of t | NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads | | Security and Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks | -Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. +Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing. You can also verify the operating system running on your DGX Spark by using the following command: @@ -120,7 +120,7 @@ This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendl Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step. -## Step 2: Verify the Blackwell GPU and driver +## Step 2: verify the Blackwell GPU and driver After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following: @@ -161,7 +161,7 @@ The `nvidia-smi` tool reports GPU hardware specifications and provides valuable The table below provides more explanation of the `nvidia-smi` output: | **Category** | **Specification (from nvidia-smi)** | **Description / impact for LLM inference** | |---------------|--------------------------------------|---------------------------------------------| -| GPU name | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip | +| GPU name | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace Blackwell Superchip | | Driver version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility | | CUDA version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads | | Architecture / Compute capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs | @@ -174,7 +174,7 @@ The table below provides more explanation of the `nvidia-smi` output: Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. -## Step 3: Check the CUDA toolkit +## Step 3: check the CUDA toolkit To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. @@ -210,7 +210,7 @@ At this point, you have verified that: Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. -## What have I achieved? +## What you have accomplished In this entire setup section, you have achieved the following: diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index ecec7b2ac..e508b8c18 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -125,7 +125,7 @@ Together, these options ensure that the build targets the Grace Blackwell GPU wi ### Step 4: Validate the CUDA-enabled build -After the build completes successfully, verify that the GPU-enabled binary of *llama.cpp is correctly linked to the NVIDIA CUDA runtime. +After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime. To verify CUDA linkage, run the following command: diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md index fedd1264a..a889e60f0 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md @@ -9,7 +9,7 @@ In this section, you'll build and test the CPU-only version of llama.cpp, optimi The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vector extensions including SVE2, BFloat16, and I8MM. These extensions make the CPU highly efficient for quantized inference workloads, even without GPU acceleration. -### Configure and build the CPU-only version +## Configure and build the CPU-only version In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU. @@ -83,8 +83,7 @@ The build output is shown below: ``` After the build finishes, you'll find the CPU-optimized binaries at `~/llama.cpp/build-cpu/bin/` - -### Step 2: Validate the CPU-Enabled Build (CPU Mode) +## Validate the CPU-enabled build (CPU mode) First, validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU: @@ -116,7 +115,7 @@ Here is an explanation of the key flags: - `-ngl 0` disables GPU offloading (CPU-only execution) - `-t 20` uses 20 threads (1 per Grace CPU core) -If the build is successful, you will observe smooth model initialization and token generation, with CPU utilization increasing across all cores. +If the build is successful, you will see smooth model initialization and token generation, with CPU utilization increasing across all cores. To monitor live CPU utilization and power metrics during inference, use `htop`: @@ -125,7 +124,7 @@ htop ``` The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement. -![image2 htop screenshot](htop.png "TinyLlama CPU Utilization") +![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png)](htop.png "TinyLlama CPU Utilization") The `htop` interface shows: @@ -138,7 +137,7 @@ The `htop` interface shows: In htop, press F6 to sort by CPU% and verify load distribution, or press `t` to toggle the tree view, which shows the `llama-cli` main process and its worker threads. {{% /notice %}} -## What have I accomplished? +## What you have accomplished In this section you have: - Built and validated the CPU-only version of llama.cpp. From 68dca01b590a06103810f77e9605defd07e93994 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Wed, 5 Nov 2025 06:29:17 +0000 Subject: [PATCH 09/12] Fixed typo in DGX Spark LLaMA.cpp GPU guide --- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index e508b8c18..ccafc38a8 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -39,7 +39,7 @@ hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B {{% notice Note %}} After the download completes, you'll find the models in the `~/models` directory. -{{% /notice Note %}} +{{% /notice %}} Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup. From 2f6382d1e11fb64f5e1b1239006269afb2c1b123 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Wed, 5 Nov 2025 06:31:11 +0000 Subject: [PATCH 10/12] Fixed typo# Please enter the commit message for your changes. Lines starting --- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index ccafc38a8..db11e04aa 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -95,7 +95,7 @@ When the configuration process completes successfully, the terminal should displ {{% notice Note %}} - For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct CUDA 13.0 toolchain. -- If you encounter configuration errors, return to the previous section and confirm that your CUDA toolkit and driver versions are correctly installed and compatible with Blackwell (sm_121).{{% /notice Note %}} +- If you encounter configuration errors, return to the previous section and confirm that your CUDA toolkit and driver versions are correctly installed and compatible with Blackwell (sm_121).{{% /notice %}} Once CMake configuration succeeds, start the compilation process: From c8a2a36698500bf5042d94cecd634d3935b482f1 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Wed, 5 Nov 2025 07:28:57 +0000 Subject: [PATCH 11/12] Update prerequisites in DGX Spark LLaMA.cpp guide to include additional machine learning concepts --- .../laptops-and-desktops/dgx_spark_llamacpp/_index.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index cdf851714..d734675d5 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -13,8 +13,10 @@ learning_objectives: prerequisites: - Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space - - Basic understanding of machine learning concepts - Familiarity with command-line interfaces and basic Linux operations + - understanding of CUDA programming basics and GPU/CPU compute concepts + - Basic knowledge of quantized large language models (LLMs) and machine learning inference + - Experience building software from source using CMake and Make author: Odin Shen From c0f4e1a555d0cc6b899d6508412964373df08a22 Mon Sep 17 00:00:00 2001 From: Maddy Underwood <167196745+madeline-underwood@users.noreply.github.com> Date: Wed, 5 Nov 2025 11:17:29 +0000 Subject: [PATCH 12/12] Final tweaks --- .../dgx_spark_llamacpp/1_gb10_introduction.md | 20 +++++------ .../dgx_spark_llamacpp/1a_gb10_setup.md | 20 ++++++----- .../dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md | 33 +++++++++---------- .../dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md | 11 ++----- .../dgx_spark_llamacpp/4_gb10_processwatch.md | 22 +++++-------- .../dgx_spark_llamacpp/_index.md | 6 ++-- 6 files changed, 51 insertions(+), 61 deletions(-) diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md index 6b46abb1b..e3902ca04 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md @@ -1,5 +1,5 @@ --- -title: Discover the Grace Blackwell architecture +title: Explore Grace Blackwell architecture for efficient quantized LLM inference weight: 2 ### FIXED, DO NOT MODIFY @@ -8,24 +8,24 @@ layout: learningpathall ## Overview -Explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. - -The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. +In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. The GB10 platform combines: - The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency - The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads - A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks -This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop and letting you build and test AI models locally before scaling them to larger systems. +This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems. + +You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). -## Benefits of Grace Blackwell for quantized LLMs +## Benefits of Grace Blackwell for quantized LLM inference -Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. +Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference. -The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference. +On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development. -### Grace Blackwell features and their impact on quantized LLMs +## Grace Blackwell features and their impact on quantized LLMs The table below shows how specific hardware features enable efficient quantized model inference: @@ -37,7 +37,7 @@ The table below shows how specific hardware features enable efficient quantized | Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer | | Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads | -### Quantized LLM workflow +## Overview of a typical quantized LLM workflow In a typical quantized LLM workflow: - The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md index 14671484b..337ada253 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md @@ -6,13 +6,13 @@ weight: 3 layout: learningpathall --- -## Overview +## Set up your Grace Blackwell environment Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm. This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply. -## Step 1: check your CPU configuration +## Step 1: Verify your CPU configuration Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference. @@ -82,13 +82,13 @@ Vulnerabilities: Tsx async abort: Not affected ``` -Great! If you have seen this message your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. +If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. ### Grace CPU specification The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference: -| **Category** | **Specification** | **Description / Impact for LLM Inference** | +| **Category** | **Specification** | **Description/Impact for LLM Inference** | |---------------|-------------------|---------------------------------------------| | Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions| | Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency | @@ -101,6 +101,8 @@ The following table provides more information about the key specifications of th Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing. +### Verify OS + You can also verify the operating system running on your DGX Spark by using the following command: ```bash @@ -120,7 +122,7 @@ This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendl Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step. -## Step 2: verify the Blackwell GPU and driver +## Step 2: Verify the Blackwell GPU and driver After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following: @@ -174,7 +176,7 @@ The table below provides more explanation of the `nvidia-smi` output: Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference. -## Step 3: check the CUDA toolkit +## Step 3: Check the CUDA toolkit To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. @@ -214,8 +216,8 @@ Your DGX Spark environment is now fully prepared for the next section, where yo In this entire setup section, you have achieved the following: -- Verified your Arm-based Grace CPU and its capabilities—you've confirmed that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference -- Confirmed your Blackwell GPU and CUDA driver are ready—the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads -- Checked your operating system and CUDA toolkit—Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools +- Verified your Arm-based Grace CPU and its capabilities by confirming that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference +- Confirmed your Blackwell GPU and CUDA driver are ready by seeing that the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads +- Checked your operating system and CUDA toolkit - Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md index db11e04aa..41b854a08 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md @@ -6,13 +6,9 @@ layout: "learningpathall" ## How do I build the GPU version of llama.cpp on GB10? -In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. +In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. Llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. -Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. - -llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. - -### Step 1: Preparation +## Step 1: Install dependencies In this step, you will install the necessary build tools and download a small quantized model for validation: @@ -23,7 +19,7 @@ sudo apt install -y git cmake build-essential nvtop htop These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp. -### Download a test model +## Download a test model To test your GPU build, you'll need a quantized model. In this section, you'll download a lightweight model that's perfect for validation. @@ -33,17 +29,20 @@ First, ensure that you have the latest Hugging Face Hub CLI installed and downlo mkdir ~/models cd ~/models python3 -m venv venv +source venv/bin/activate pip install -U huggingface_hub hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B ``` {{% notice Note %}} After the download completes, you'll find the models in the `~/models` directory. + +**Tip:** Always activate your Python virtual environment with `source venv/bin/activate` before installing packages or running Python-based tools. This ensures dependencies are isolated and prevents conflicts with system-wide packages. {{% /notice %}} Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup. -### Step 2: Clone the llama.cpp repository +## Step 2: Clone the llama.cpp repository Use the commands below to download the source code for llama.cpp from GitHub: @@ -55,7 +54,7 @@ cd ~/llama.cpp Nice work! You now have the latest llama.cpp source code on your DGX Spark system. -### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode) +## Step 3: Configure and build the CUDA-enabled version (GPU Mode) Run the following `cmake` command to configure the build system for GPU acceleration: @@ -73,13 +72,13 @@ cmake .. \ -DCMAKE_CUDA_COMPILER=nvcc ``` -This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels +This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels. ### Explanation of key flags: Here's what each configuration flag does: -| **Feature** | **Description / Impact** | +| **Feature** | **Description/Impact** | |--------------|------------------------------| | -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration| | -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) | @@ -119,11 +118,9 @@ After the build completes, you'll find the GPU-accelerated binaries located unde These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server). -Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. You are now ready to test quantized LLMs with full GPU acceleration in the next step. - -Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. +Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. You are now ready to test quantized LLMs with full GPU acceleration in the next step. -### Step 4: Validate the CUDA-enabled build +## Step 4: Validate the CUDA-enabled build After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime. @@ -184,7 +181,7 @@ nvtop This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference. -The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark. +The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark: ![nvtop terminal interface displaying real-time GPU metrics, including GPU utilization, memory usage, temperature, power consumption, and active processes for the NVIDIA GB10 GPU during model inference on DGX Spark. alt-text#center](nvtop.png "TinyLlama GPU Utilization") @@ -196,7 +193,7 @@ The nvtop interface shows: You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark. -## What have I achieved? +## What you have accomplished You have: - Installed all required tools and dependencies @@ -204,4 +201,4 @@ You have: - Built the CUDA-enabled version of llama.cpp - Verified GPU linkage and successful inference -You’re ready to move on to building and testing the CPU-only version! You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference. +You’re ready to move on to building and testing the CPU-only version. You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md index a889e60f0..b7c87b35e 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md @@ -11,9 +11,6 @@ The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vecto ## Configure and build the CPU-only version - -In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU. - This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration. To ensure a clean separation from the GPU build artifacts, start from a clean directory. @@ -123,8 +120,8 @@ To monitor live CPU utilization and power metrics during inference, use `htop`: htop ``` -The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement. -![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png)](htop.png "TinyLlama CPU Utilization") +The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement: +![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png "TinyLlama CPU utilization") The `htop` interface shows: @@ -145,6 +142,4 @@ In this section you have: - Tested quantized model inference using the TinyLlama Q8_0 model. - Used monitoring tools (htop) to confirm efficient CPU utilization. -You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. - -In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU. +You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md index 69d577521..f27182b42 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md @@ -10,7 +10,7 @@ In this section, you'll explore how the Grace CPU executes Armv9 vector instruct Process Watch helps you observe Neon SIMD instruction execution on the Grace CPU and understand why SVE and SVE2 remain inactive under the current kernel configuration. This demonstrates how Armv9 vector execution works in AI workloads and shows the evolution from traditional SIMD pipelines to scalable vector computation. -### Install and configure Process Watch +## Install and configure Process Watch First, install the required packages: @@ -79,11 +79,7 @@ cd ~/llama.cpp/build-cpu/bin -p "Explain the benefits of vector processing in modern Arm CPUs." ``` -Keep this terminal running while the model generates text output. - -You can now attach Process Watch to this active process. - -Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity. +Keep this terminal running while the model generates text output. You can now attach Process Watch to this active process. Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity. If only one `llama-cli` process is running, you can directly launch Process Watch without manually checking its PID: @@ -160,13 +156,13 @@ ALL ALL 2.52 8.37 0.00 0.00 100.00 26566 ``` Here is an interpretation of the values: -- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations. -- FPARMv8 : Scalar FP operations such as activation and normalization. -- SVE/SVE2 = 0 : The kernel does not issue SVE instructions. +- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations +- FPARMv8 : Scalar FP operations such as activation and normalization +- SVE/SVE2 = 0 : The kernel does not issue SVE instructions This confirms that the Grace CPU performs quantized inference primarily using NEON. -### Why are SVE and SVE2 inactive? +## Why are SVE and SVE2 inactive? Although the Grace CPU supports SVE and SVE2, the vector length is 16 bytes (128-bit). @@ -191,10 +187,10 @@ echo 256 | sudo tee /proc/sys/abi/sve_default_vector_length This behavior is expected because SVE is available but fixed at 128 bits. {{% notice Note %}} -Future kernel updates may introduce SVE2 instructions. +Future kernel updates might introduce SVE2 instructions. {{% /notice %}} -## What you've achieved and what's next +## What you've accomplished and what's next You have completed the Learning Path for analyzing large language model inference on the DGX Spark platform with Arm-based Grace CPUs and Blackwell GPUs. @@ -213,5 +209,5 @@ By completing these steps, you are now equipped to: - Identify performance bottlenecks and opportunities for acceleration on both CPU and GPU - Prepare for future enhancements in Armv9 vector processing and software support - Confidently deploy and monitor AI inference on modern Arm server platforms -For additional learning, see the resources in the "Further Reading" section. You can continue experimenting with different models and monitoring tools as new kernel updates become available. +For additional learning, see the resources in the Further Reading section. You can continue experimenting with different models and monitoring tools as new kernel updates become available. diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md index d734675d5..e2a8df1cf 100644 --- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md +++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md @@ -1,5 +1,5 @@ --- -title: Unlock efficient quantized LLMs on Arm-based NVIDIA DGX Spark using Armv9 SIMD instructions +title: Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark using Armv9 SIMD instructions minutes_to_complete: 60 @@ -14,9 +14,9 @@ learning_objectives: prerequisites: - Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space - Familiarity with command-line interfaces and basic Linux operations - - understanding of CUDA programming basics and GPU/CPU compute concepts + - Understanding of CUDA programming basics and GPU/CPU compute concepts - Basic knowledge of quantized large language models (LLMs) and machine learning inference - - Experience building software from source using CMake and Make + - Experience building software from source using CMake and make author: Odin Shen