diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
index 63dbeaa6a..e3902ca04 100644
--- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
@@ -1,241 +1,49 @@
---
-title: Verify Grace Blackwell system readiness for AI inference
+title: Explore Grace Blackwell architecture for efficient quantized LLM inference
weight: 2
### FIXED, DO NOT MODIFY
layout: learningpathall
---
-## Introduction to Grace Blackwell architecture
+## Overview
-In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads.
+In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
-You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
+The GB10 platform combines:
+- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency
+- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads
+- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks
-The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop.
-The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
+This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems.
-The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines:
-- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
+You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/).
-- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
-- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
+## Benefits of Grace Blackwell for quantized LLM inference
-This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision.
-DGX Spark is a compact yet powerful development platform for modern AI workloads.
+Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
-DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed.
+On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development.
-### Why Grace Blackwell for quantized LLMs?
+## Grace Blackwell features and their impact on quantized LLMs
-Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
+The table below shows how specific hardware features enable efficient quantized model inference:
-| **Feature** | **Impact on Quantized LLMs** |
+| **Feature** | **Impact on quantized LLMs** |
|--------------|------------------------------|
-| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
-| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
-| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. |
-| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
-| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
+| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high instructions per cycle (IPC) |
+| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers |
+| High bandwidth and low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU-GPU workloads |
+| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer |
+| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads |
+## Overview of a typical quantized LLM workflow
In a typical quantized LLM workflow:
-- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
-- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
-- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference.
+- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks
+- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput
+- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space, reducing copy overhead and enabling near-real-time inference
Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
-### Inspecting your GB10 environment
-
-Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs.
-
-#### Step 1: Check CPU information
-
-Run the following command to print the CPU information:
-
-```bash
-lscpu
-```
-
-Expected output:
-
-```output
-Architecture: aarch64
- CPU op-mode(s): 64-bit
- Byte Order: Little Endian
-CPU(s): 20
- On-line CPU(s) list: 0-19
-Vendor ID: ARM
- Model name: Cortex-X925
- Model: 1
- Thread(s) per core: 1
- Core(s) per socket: 10
- Socket(s): 1
- Stepping: r0p1
- CPU(s) scaling MHz: 89%
- CPU max MHz: 4004.0000
- CPU min MHz: 1378.0000
- BogoMIPS: 2000.00
- Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
- imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
- lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
- Model name: Cortex-A725
- Model: 1
- Thread(s) per core: 1
- Core(s) per socket: 10
- Socket(s): 1
- Stepping: r0p1
- CPU(s) scaling MHz: 99%
- CPU max MHz: 2860.0000
- CPU min MHz: 338.0000
- BogoMIPS: 2000.00
- Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
- imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
- lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
-Caches (sum of all):
- L1d: 1.3 MiB (20 instances)
- L1i: 1.3 MiB (20 instances)
- L2: 25 MiB (20 instances)
- L3: 24 MiB (2 instances)
-NUMA:
- NUMA node(s): 1
- NUMA node0 CPU(s): 0-19
-Vulnerabilities:
- Gather data sampling: Not affected
- Itlb multihit: Not affected
- L1tf: Not affected
- Mds: Not affected
- Meltdown: Not affected
- Mmio stale data: Not affected
- Reg file data sampling: Not affected
- Retbleed: Not affected
- Spec rstack overflow: Not affected
- Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
- Spectre v1: Mitigation; __user pointer sanitization
- Spectre v2: Not affected
- Srbds: Not affected
- Tsx async abort: Not affected
-```
-
-The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
-
-The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference.
-
-| **Category** | **Specification** | **Description / Impact for LLM Inference** |
-|---------------|-------------------|---------------------------------------------|
-| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
-| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
-| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. |
-| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
-| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
-| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
-| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
-| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
-
-Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
-
-You can also verify the operating system running on your DGX Spark by using the following command:
-
-```bash
-lsb_release -a
-```
-
-Expected output:
-
-```log
-No LSB modules are available.
-Distributor ID: Ubuntu
-Description: Ubuntu 24.04.3 LTS
-Release: 24.04
-Codename: noble
-```
-As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution.
-It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
-
-#### Step 2: Verify Blackwell GPU and driver
-
-After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads.
-
-```bash
-nvidia-smi
-```
-
-You will see output similar to:
-
-```output
-Wed Oct 22 09:26:54 2025
-+-----------------------------------------------------------------------------------------+
-| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
-+-----------------------------------------+------------------------+----------------------+
-| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
-| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
-| | | MIG M. |
-|=========================================+========================+======================|
-| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
-| N/A 32C P8 4W / N/A | Not Supported | 0% Default |
-| | | N/A |
-+-----------------------------------------+------------------------+----------------------+
-
-+-----------------------------------------------------------------------------------------+
-| Processes: |
-| GPU GI CI PID Type Process name GPU Memory |
-| ID ID Usage |
-|=========================================================================================|
-| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB |
-| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB |
-+-----------------------------------------------------------------------------------------+
-```
-
-The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
-
-The table below provides more explanation of the `nvidia-smi` output:
-
-| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
-|---------------|--------------------------------------|---------------------------------------------|
-| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
-| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
-| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
-| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
-| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
-| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
-| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
-| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
-| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
-
-
-#### Step 3: Check CUDA Toolkit
-
-To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
-
-The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
-This ensures that CMake can correctly detect and compile the GPU-accelerated components.
-
-```bash
-nvcc --version
-```
-
-You will see output similar to:
-
-```output
-nvcc: NVIDIA (R) Cuda compiler driver
-Copyright (c) 2005-2025 NVIDIA Corporation
-Built on Wed_Aug_20_01:57:39_PM_PDT_2025
-Cuda compilation tools, release 13.0, V13.0.88
-Build cuda_13.0.r13.0/compiler.36424714_0
-```
-
-{{% notice Note %}}
-The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
-{{% /notice %}}
-
-This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
-If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
-
-At this point, you have verified that:
-- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions.
-- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
-- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
-
-Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md
new file mode 100644
index 000000000..337ada253
--- /dev/null
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md
@@ -0,0 +1,223 @@
+---
+title: Verify your Grace Blackwell system readiness for AI inference
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Set up your Grace Blackwell environment
+
+Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm.
+
+This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply.
+
+## Step 1: Verify your CPU configuration
+
+Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference.
+
+Start by checking your system's CPU configuration:
+
+```bash
+lscpu
+```
+
+The output is similar to:
+
+```output
+Architecture: aarch64
+ CPU op-mode(s): 64-bit
+ Byte Order: Little Endian
+CPU(s): 20
+ On-line CPU(s) list: 0-19
+Vendor ID: ARM
+ Model name: Cortex-X925
+ Model: 1
+ Thread(s) per core: 1
+ Core(s) per socket: 10
+ Socket(s): 1
+ Stepping: r0p1
+ CPU(s) scaling MHz: 89%
+ CPU max MHz: 4004.0000
+ CPU min MHz: 1378.0000
+ BogoMIPS: 2000.00
+ Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
+ imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
+ lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
+ Model name: Cortex-A725
+ Model: 1
+ Thread(s) per core: 1
+ Core(s) per socket: 10
+ Socket(s): 1
+ Stepping: r0p1
+ CPU(s) scaling MHz: 99%
+ CPU max MHz: 2860.0000
+ CPU min MHz: 338.0000
+ BogoMIPS: 2000.00
+ Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
+ imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
+ lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
+Caches (sum of all):
+ L1d: 1.3 MiB (20 instances)
+ L1i: 1.3 MiB (20 instances)
+ L2: 25 MiB (20 instances)
+ L3: 24 MiB (2 instances)
+NUMA:
+ NUMA node(s): 1
+ NUMA node0 CPU(s): 0-19
+Vulnerabilities:
+ Gather data sampling: Not affected
+ Itlb multihit: Not affected
+ L1tf: Not affected
+ Mds: Not affected
+ Meltdown: Not affected
+ Mmio stale data: Not affected
+ Reg file data sampling: Not affected
+ Retbleed: Not affected
+ Spec rstack overflow: Not affected
+ Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
+ Spectre v1: Mitigation; __user pointer sanitization
+ Spectre v2: Not affected
+ Srbds: Not affected
+ Tsx async abort: Not affected
+```
+
+If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
+
+### Grace CPU specification
+
+The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference:
+
+| **Category** | **Specification** | **Description/Impact for LLM Inference** |
+|---------------|-------------------|---------------------------------------------|
+| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions|
+| Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency |
+| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency |
+| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)
Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
+| Cache Hierarchy | L1: 1.3 MiB × 20
L2: 25 MiB × 20
L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads |
+| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations |
+| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads |
+| Security and Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks |
+
+Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing.
+
+### Verify OS
+
+You can also verify the operating system running on your DGX Spark by using the following command:
+
+```bash
+lsb_release -a
+```
+
+The expected output is something similar to:
+
+```log
+No LSB modules are available.
+Distributor ID: Ubuntu
+Description: Ubuntu 24.04.3 LTS
+Release: 24.04
+Codename: noble
+```
+This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution that provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities. This makes it an ideal environment for building and deploying quantized LLM workloads.
+
+Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step.
+
+## Step 2: Verify the Blackwell GPU and driver
+
+After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following:
+
+```bash
+nvidia-smi
+```
+
+You will see output similar to:
+
+```output
+Wed Oct 22 09:26:54 2025
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
++-----------------------------------------+------------------------+----------------------+
+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
+| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
+| | | MIG M. |
+|=========================================+========================+======================|
+| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
+| N/A 32C P8 4W / N/A | Not Supported | 0% Default |
+| | | N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes: |
+| GPU GI CI PID Type Process name GPU Memory |
+| ID ID Usage |
+|=========================================================================================|
+| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB |
+| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB |
++-----------------------------------------------------------------------------------------+
+```
+
+The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
+
+### Further information about the output from the nvidia-smi tool
+
+The table below provides more explanation of the `nvidia-smi` output:
+| **Category** | **Specification (from nvidia-smi)** | **Description / impact for LLM inference** |
+|---------------|--------------------------------------|---------------------------------------------|
+| GPU name | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace Blackwell Superchip |
+| Driver version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility |
+| CUDA version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads |
+| Architecture / Compute capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs |
+| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space |
+| Power & Thermal status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle |
+| GPU-utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs |
+| Memory usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed |
+| Persistence mode | On | Ensures the GPU remains initialized and ready for rapid inference startup |
+
+Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference.
+
+
+## Step 3: Check the CUDA toolkit
+
+To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
+
+The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
+This ensures that CMake can correctly detect and compile the GPU-accelerated components.
+
+```bash
+nvcc --version
+```
+You're almost ready! Verifying the CUDA toolkit ensures you can build GPU-enabled versions of llama.cpp for maximum performance.
+
+You will see output similar to:
+
+```output
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2025 NVIDIA Corporation
+Built on Wed_Aug_20_01:57:39_PM_PDT_2025
+Cuda compilation tools, release 13.0, V13.0.88
+Build cuda_13.0.r13.0/compiler.36424714_0
+```
+
+{{% notice Note %}}
+The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
+{{% /notice %}}
+
+This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
+If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
+
+At this point, you have verified that:
+- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions
+- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime
+- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp
+
+Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.
+
+## What you have accomplished
+
+In this entire setup section, you have achieved the following:
+
+- Verified your Arm-based Grace CPU and its capabilities by confirming that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
+- Confirmed your Blackwell GPU and CUDA driver are ready by seeing that the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
+- Checked your operating system and CUDA toolkit - Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
+
+You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
index 4114bf838..41b854a08 100644
--- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
@@ -1,20 +1,16 @@
---
title: Build the GPU version of llama.cpp on GB10
-weight: 3
+weight: 4
layout: "learningpathall"
---
## How do I build the GPU version of llama.cpp on GB10?
-In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
+In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. Llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
-Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs.
+## Step 1: Install dependencies
-llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
-
-### Step 1: Preparation
-
-In this step, you will install the necessary build tools and download a small quantized model for validation.
+In this step, you will install the necessary build tools and download a small quantized model for validation:
```bash
sudo apt update
@@ -23,7 +19,9 @@ sudo apt install -y git cmake build-essential nvtop htop
These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp.
-To verify your GPU build later, you need at least one quantized model for testing.
+## Download a test model
+
+To test your GPU build, you'll need a quantized model. In this section, you'll download a lightweight model that's perfect for validation.
First, ensure that you have the latest Hugging Face Hub CLI installed and download models:
@@ -31,15 +29,22 @@ First, ensure that you have the latest Hugging Face Hub CLI installed and downlo
mkdir ~/models
cd ~/models
python3 -m venv venv
+source venv/bin/activate
pip install -U huggingface_hub
hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B
```
-After the download completes, the models will be available in the `~/models` directory.
+{{% notice Note %}}
+After the download completes, you'll find the models in the `~/models` directory.
+
+**Tip:** Always activate your Python virtual environment with `source venv/bin/activate` before installing packages or running Python-based tools. This ensures dependencies are isolated and prevents conflicts with system-wide packages.
+{{% /notice %}}
+
+Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup.
-### Step 2: Clone the llama.cpp repository
+## Step 2: Clone the llama.cpp repository
-Use the commands below to download the source code for llama.cpp from GitHub.
+Use the commands below to download the source code for llama.cpp from GitHub:
```bash
cd ~
@@ -47,11 +52,12 @@ git clone https://github.com/ggerganov/llama.cpp.git
cd ~/llama.cpp
```
-### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode)
+Nice work! You now have the latest llama.cpp source code on your DGX Spark system.
-Run the following `cmake` command to configure the build system for GPU acceleration.
+## Step 3: Configure and build the CUDA-enabled version (GPU Mode)
+
+Run the following `cmake` command to configure the build system for GPU acceleration:
-This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
```bash
mkdir -p build-gpu
@@ -66,13 +72,17 @@ cmake .. \
-DCMAKE_CUDA_COMPILER=nvcc
```
-Explanation of Key Flags:
+This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
-| **Feature** | **Description / Impact** |
+### Explanation of key flags:
+
+Here's what each configuration flag does:
+
+| **Feature** | **Description/Impact** |
|--------------|------------------------------|
-| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration.|
-| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (e.g., Q4, Q5). |
-| -DCMAKE_CUDA_ARCHITECTURES=121 | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels. |
+| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration|
+| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) |
+| -DCMAKE_CUDA_ARCHITECTURES=121 | Specifies the compute capability for the NVIDIA Blackwell GPU (GB10 = sm_121), ensuring the CUDA compiler (nvcc) generates optimized GPU kernels|
When the configuration process completes successfully, the terminal should display output similar to the following:
@@ -83,17 +93,15 @@ When the configuration process completes successfully, the terminal should displ
```
{{% notice Note %}}
-1. For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct CUDA 13.0 toolchain.
-2. In case of configuration errors, revisit the previous section to verify that your CUDA toolkit and driver versions are properly installed and aligned with Blackwell (sm_121) support.
-{{% /notice %}}
+- For systems with multiple CUDA versions installed, explicitly specifying the compilers (`-DCMAKE_C_COMPILER`, `-DCMAKE_CXX_COMPILER`, `-DCMAKE_CUDA_COMPILER`) ensures that CMake uses the correct CUDA 13.0 toolchain.
+- If you encounter configuration errors, return to the previous section and confirm that your CUDA toolkit and driver versions are correctly installed and compatible with Blackwell (sm_121).{{% /notice %}}
Once CMake configuration succeeds, start the compilation process:
```bash
make -j"$(nproc)"
```
-
-This command compiles all CUDA and C++ source files in parallel, utilizing all available CPU cores for optimal build performance. On the Grace CPU in the DGX Spark system, the build process typically completes within 2–4 minutes, demonstrating the efficiency of the Arm-based architecture for software development.
+This command compiles all CUDA and C++ source files in parallel, using all available CPU cores. On the Grace CPU, the build typically finishes in 2–4 minutes.
The build output is shown below:
@@ -106,15 +114,15 @@ The build output is shown below:
[100%] Built target llama-server
```
-After the build completes, the GPU-accelerated binaries will be located under `~/llama.cpp/build-gpu/bin/`
+After the build completes, you'll find the GPU-accelerated binaries located under `~/llama.cpp/build-gpu/bin/`.
-These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference via HTTP API (llama-server). You are now ready to test quantized LLMs with full GPU acceleration in the next step.
+These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server).
-Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility.
+Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
-### Step 4: Validate the CUDA-enabled build
+## Step 4: Validate the CUDA-enabled build
-After the build completes successfully, verify that the GPU-enabled binary of *llama.cpp is correctly linked to the NVIDIA CUDA runtime.
+After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime.
To verify CUDA linkage, run the following command:
@@ -134,7 +142,7 @@ The output is similar to:
If the CUDA library is correctly linked, it confirms that the binary can access the GPU through the system driver.
-Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability.
+Next, confirm that the binary initializes the GPU correctly by checking device detection and compute capability:
```bash
./bin/llama-server --version
@@ -165,20 +173,32 @@ Next, use the downloaded quantized model (for example, TinyLlama-1.1B) to verify
If the build is successful, you will see text generation begin within a few seconds.
-While `nvidia-smi` can display basic GPU information, `nvtop` provides real-time visualization of utilization, temperature, and power metrics which are useful for verifying CUDA kernel activity during inference.
+To monitor GPU utilization during inference, use `nvtop` to view real-time performance metrics:
```bash
nvtop
```
-The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark.
+This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference.
+
+The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark:
-
+
The nvtop interface shows:
-- GPU Utilization (%) : confirm CUDA kernels are active
-- Memory Usage (VRAM) : observe model loading and runtime footprint
-- Temperature / Power Draw : monitor thermal stability under sustained workloads
+
+ - GPU Utilization (%): Confirms CUDA kernels are active.
+ - Memory Usage (VRAM): Shows model loading and runtime footprint.
+ - Temperature / Power Draw: Monitors thermal stability under sustained workloads.
You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark.
-In the next section, you will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
+
+## What you have accomplished
+
+You have:
+- Installed all required tools and dependencies
+- Downloaded a quantized model for testing
+- Built the CUDA-enabled version of llama.cpp
+- Verified GPU linkage and successful inference
+
+You’re ready to move on to building and testing the CPU-only version. You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
index 90d03a16d..b7c87b35e 100644
--- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
@@ -1,22 +1,20 @@
---
title: Build the CPU version of llama.cpp on GB10
-weight: 4
+weight: 5
layout: "learningpathall"
---
-## How do I build the CPU version of llama.cpp on GB10?
+## Overview
+In this section, you'll build and test the CPU-only version of llama.cpp, optimized specifically for the Grace CPU's advanced Armv9 capabilities.
-Use the steps below to build and test the CPU only versoin of llama.cpp.
+The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vector extensions including SVE2, BFloat16, and I8MM. These extensions make the CPU highly efficient for quantized inference workloads, even without GPU acceleration.
-### Step 1: Configure and Build the CPU-Only Version
-
-In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU.
+## Configure and build the CPU-only version
This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration.
+To ensure a clean separation from the GPU build artifacts, start from a clean directory.
-Start from a clean directory to ensure a clean separation from the GPU build artifacts.
-
-Run the following commands to configure the build system for the CPU-only version of llama.cpp.
+Configure the build system for the CPU-only version of llama.cpp:
```bash
cd ~/llama.cpp
@@ -34,15 +32,15 @@ cmake .. \
-DCMAKE_CXX_FLAGS="-O3 -march=armv9-a+sve2+bf16+i8mm -mtune=native -fopenmp"
```
-Explanation of Key Flags:
+Explanation of key flags:
| **Feature** | **Description / Impact** |
|--------------|------------------------------|
-| -march=armv9-a | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions.|
-| +sve2+bf16+i8mm | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference.|
-| -fopenmp | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized.|
-| -mtune=native | Optimizes code generation for the local Grace CPU microarchitecture.|
-| -DLLAMA_ACCELERATE=ON | Enables llama.cpp’s internal Arm acceleration path (Neon/SVE optimized kernels).|
+| -march=armv9-a | Targets the Armv9-A architecture used by the Grace CPU and enables advanced vector extensions |
+| +sve2+bf16+i8mm | Activates Scalable Vector Extensions (SVE2), INT8 matrix multiply (I8MM), and BFloat16 operations for quantized inference |
+| -fopenmp | Enables multi-threaded execution via OpenMP, allowing all 20 Grace cores to be utilized |
+| -mtune=native | Optimizes code generation for the local Grace CPU microarchitecture |
+| -DLLAMA_ACCELERATE=ON | Enables llama.cpp's internal Arm acceleration path (Neon/SVE optimized kernels) |
When the configuration process completes successfully, the terminal should display output similar to the following:
@@ -52,7 +50,7 @@ When the configuration process completes successfully, the terminal should displ
-- Build files have been written to: /home/nvidia/llama.cpp/build-cpu
```
-Then, start the compilation process:
+Once you see this, you can now move on to start the compilation process:
```bash
make -j"$(nproc)"
@@ -81,17 +79,16 @@ The build output is shown below:
[100%] Built target llama-server
```
-After the build finishes, the CPU-optimized binaries will be available under `~/llama.cpp/build-cpu/bin/`
-
-### Step 2: Validate the CPU-Enabled Build (CPU Mode)
+After the build finishes, you'll find the CPU-optimized binaries at `~/llama.cpp/build-cpu/bin/`
+## Validate the CPU-enabled build (CPU mode)
-In this step, you will validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU.
+First, validate that the binary was compiled in CPU-only mode and runs correctly on the Grace CPU:
```bash
./bin/llama-server --version
```
-Expected output:
+The output confirms the build configuration:
```output
version: 6819 (19a5a3ed)
@@ -115,34 +112,34 @@ Here is an explanation of the key flags:
- `-ngl 0` disables GPU offloading (CPU-only execution)
- `-t 20` uses 20 threads (1 per Grace CPU core)
-If the build is successful, you will observe smooth model initialization and token generation, with CPU utilization increasing across all cores.
+If the build is successful, you will see smooth model initialization and token generation, with CPU utilization increasing across all cores.
-For live CPU utilization and power metrics, use `htop`:
+To monitor live CPU utilization and power metrics during inference, use `htop`:
```bash
htop
```
-The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement.
-
+The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement:
+
The `htop` interface shows:
-- CPU Utilization: All 20 cores operate between 75–85%, confirming efficient multi-thread scaling.
-- Load Average: Around 5.0, indicating balanced workload distribution.
-- Memory Usage: Approximately 4.5 GB total for the TinyLlama Q8_0 model.
-- Process List: Displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism
+- CPU Utilization: all 20 cores operate between 75–85%, confirming efficient multi-thread scaling
+- Load Average: around 5.0, indicating balanced workload distribution
+- Memory Usage: approximately 4.5 GB total for the TinyLlama Q8_0 model
+- Process List: displays multiple `llama-cli` threads (each 7–9% CPU), confirming OpenMP parallelism
{{% notice Note %}}
In htop, press F6 to sort by CPU% and verify load distribution, or press `t` to toggle the tree view, which shows the `llama-cli` main process and its worker threads.
{{% /notice %}}
+## What you have accomplished
+
In this section you have:
- Built and validated the CPU-only version of llama.cpp.
- Optimized the Grace CPU build using Armv9 vector extensions (SVE2, BF16, I8MM).
- Tested quantized model inference using the TinyLlama Q8_0 model.
- Used monitoring tools (htop) to confirm efficient CPU utilization.
-You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU.
-
-In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
+You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
index 91e251471..f27182b42 100644
--- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
@@ -1,6 +1,6 @@
---
title: Analyze CPU instruction mix using Process Watch
-weight: 5
+weight: 6
layout: "learningpathall"
---
@@ -10,7 +10,7 @@ In this section, you'll explore how the Grace CPU executes Armv9 vector instruct
Process Watch helps you observe Neon SIMD instruction execution on the Grace CPU and understand why SVE and SVE2 remain inactive under the current kernel configuration. This demonstrates how Armv9 vector execution works in AI workloads and shows the evolution from traditional SIMD pipelines to scalable vector computation.
-### Install and configure Process Watch
+## Install and configure Process Watch
First, install the required packages:
@@ -19,7 +19,7 @@ sudo apt update
sudo apt install -y git cmake build-essential libncurses-dev libtinfo-dev
```
-Clone and build Process Watch:
+Now clone and build Process Watch:
```bash
cd ~
@@ -79,11 +79,7 @@ cd ~/llama.cpp/build-cpu/bin
-p "Explain the benefits of vector processing in modern Arm CPUs."
```
-Keep this terminal running while the model generates text output.
-
-You can now attach Process Watch to this active process.
-
-Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity.
+Keep this terminal running while the model generates text output. You can now attach Process Watch to this active process. Once the llama.cpp process is running on the Grace CPU, attach Process Watch to observe its live instruction activity.
If only one `llama-cli` process is running, you can directly launch Process Watch without manually checking its PID:
@@ -91,7 +87,7 @@ If only one `llama-cli` process is running, you can directly launch Process Watc
sudo processwatch --pid $(pgrep llama-cli)
```
-If you have multiple processes running, first identify the correct process ID:
+If multiple processes are running, first identify the correct process ID:
```bash
pgrep llama-cli
@@ -103,6 +99,8 @@ Then attach Process Watch to monitor the instruction mix of this process:
sudo processwatch --pid
```
+Replace `` with the actual process ID from the previous command.
+
{{% notice Note %}}
`processwatch --list` does not display all system processes.
It is intended for internal use and may not list user-level tasks like llama-cli.
@@ -158,13 +156,13 @@ ALL ALL 2.52 8.37 0.00 0.00 100.00 26566
```
Here is an interpretation of the values:
-- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations.
-- FPARMv8 : Scalar FP operations such as activation and normalization.
-- SVE/SVE2 = 0 : The kernel does not issue SVE instructions.
+- NEON (≈ 7–15 %) : Active SIMD integer and floating-point operations
+- FPARMv8 : Scalar FP operations such as activation and normalization
+- SVE/SVE2 = 0 : The kernel does not issue SVE instructions
This confirms that the Grace CPU performs quantized inference primarily using NEON.
-### Why are SVE and SVE2 inactive?
+## Why are SVE and SVE2 inactive?
Although the Grace CPU supports SVE and SVE2, the vector length is 16 bytes (128-bit).
@@ -174,7 +172,7 @@ Verify the current setting:
cat /proc/sys/abi/sve_default_vector_length
```
-The output is:
+The output is similar to:
```output
16
@@ -189,15 +187,27 @@ echo 256 | sudo tee /proc/sys/abi/sve_default_vector_length
This behavior is expected because SVE is available but fixed at 128 bits.
{{% notice Note %}}
-Future kernel updates may introduce SVE2 instructions.
+Future kernel updates might introduce SVE2 instructions.
{{% /notice %}}
-## Summary
+## What you've accomplished and what's next
+
+You have completed the Learning Path for analyzing large language model inference on the DGX Spark platform with Arm-based Grace CPUs and Blackwell GPUs.
+
+Throughout this Learning Path, you have learned how to:
+
+- Set up your DGX Spark system with the required Arm software stack and CUDA 13 environment
+- Build and validate both GPU-accelerated and CPU-only versions of llama.cpp for quantized LLM inference
+- Download and run quantized TinyLlama models for efficient testing and benchmarking
+- Monitor GPU utilization and performance using tools like nvtop
+- Analyze CPU instruction mix with Process Watch to understand how Armv9 vector instructions are used during inference
+- Interpret the impact of NEON, SVE, and SVE2 on AI workloads, and recognize current kernel limitations for vector execution
-You have learned how to:
-- Use Process Watch to monitor CPU instruction activity
-- Interpret Armv9 vector instruction usage during LLM inference
-- Prepare for future Armv9 enhancements
+By completing these steps, you are now equipped to:
-This knowledge helps you profile Arm systems effectively and optimize applications.
+- Profile and optimize LLM workloads on Arm-based systems
+- Identify performance bottlenecks and opportunities for acceleration on both CPU and GPU
+- Prepare for future enhancements in Armv9 vector processing and software support
+- Confidently deploy and monitor AI inference on modern Arm server platforms
+For additional learning, see the resources in the Further Reading section. You can continue experimenting with different models and monitoring tools as new kernel updates become available.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md
index ed548d391..e2a8df1cf 100644
--- a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md
+++ b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md
@@ -1,22 +1,23 @@
---
-title: Deploy quantized LLMs on DGX Spark using llama.cpp
-
-draft: true
-cascade:
- draft: true
+title: Unlock quantized LLM performance on Arm-based NVIDIA DGX Spark using Armv9 SIMD instructions
minutes_to_complete: 60
-who_is_this_for: This Learning Path is for AI practitioners, performance engineers, and system architects who want to understand how the Grace–Blackwell (GB10) platform enables efficient quantized LLM inference through CPU–GPU collaboration.
+who_is_this_for: This is an introductory topic for AI practitioners, performance engineers, and system architects who want to learn how to deploy and optimize quantized large language models (LLMs) on NVIDIA DGX Spark systems powered by the Grace-Blackwell (GB10) architecture.
learning_objectives:
- - Understand the Grace–Blackwell (GB10) architecture and how it supports efficient AI inference
- - Build and validate both CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment
+ - Describe the Grace–Blackwell (GB10) architecture and its support for efficient AI inference
+ - Build CUDA-enabled and CPU-only versions of llama.cpp for flexible deployment
+ - Validate the functionality of both builds on the DGX Spark platform
- Analyze how Armv9 SIMD instructions accelerate quantized LLM inference on the Grace CPU
prerequisites:
- - NVIDIA DGX Spark system with at least 15 GB of available disk space
- - Basic understanding of machine learning concepts
+ - Access to an NVIDIA DGX Spark system with at least 15 GB of available disk space
+ - Familiarity with command-line interfaces and basic Linux operations
+ - Understanding of CUDA programming basics and GPU/CPU compute concepts
+ - Basic knowledge of quantized large language models (LLMs) and machine learning inference
+ - Experience building software from source using CMake and make
+
author: Odin Shen
@@ -36,21 +37,21 @@ tools_software_languages:
further_reading:
- resource:
- title: NVIDIA DGX Spark
+ title: NVIDIA DGX Spark website
link: https://www.nvidia.com/en-gb/products/workstations/dgx-spark/
type: website
- resource:
- title: NVIDIA DGX Spark Playbooks
+ title: NVIDIA DGX Spark Playbooks GitHub repository
link: https://github.com/NVIDIA/dgx-spark-playbooks
type: documentation
- resource:
- title: Explore llama.cpp architecture and the inference workflow
+ title: Profile llama.cpp performance with Arm Streamline and KleidiAI LLM kernels Learning Path
link: https://learn.arm.com/learning-paths/servers-and-cloud-computing/llama_cpp_streamline/
type: blog
- resource:
- title: The Dawn of New Desktop Devices Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI Computing
+ title: Arm-Powered NVIDIA DGX Spark Workstations to Redefine AI
link: https://newsroom.arm.com/blog/arm-powered-nvidia-dgx-spark-ai-workstations
- type: website
+ type: blog
### FIXED, DO NOT MODIFY
# ================================================================================