#45 - Add DGX Spark (GB10) support to docs and examples#49
Conversation
Wraps the existing IGX Orin content under a top-of-page mkdocs-material content tab and adds a parallel "DGX Spark" tab with a full walkthrough for the Grace Blackwell GB10 superchip platform. The Spark tab calls out the platform deltas: - CX-7 hotplug warning (cable required; PFs disappear without one) - 4 PFs / 2 chips / tied chassis sockets, with the cable-yank diagnostic - Skip nvidia_peermem (EINVAL on Grace) and the daqiri DMA-BUF path (CUDA reports DMA_BUF_SUPPORTED=0); use kind: host_pinned in YAML - Skip GPU BAR1 sizing and the PIX/PXB topology requirement (N/A on unified-memory single-SoC integrated GPUs) - grub drop-in pattern under /etc/default/grub.d/ for hugepages and isolcpus=16-19 (Spark composes the cmdline from drop-ins) - systemd unit pattern for MRRS via raw setpci CAP_EXP+8.w (Secure Boot must be off) - nmcli profile pattern that folds IP + MTU=9000 into one step Anchor stability: existing cross-page links from benchmarking_examples.md and configuration-walkthrough.md (#enable-gpudirect, #step-4-enable-huge-pages, #step-5-isolate-cpu-cores) keep landing on the IGX heading because mkdocs gives the first heading occurrence the bare slug; Spark duplicates get _1 suffixes. mkdocs --strict build passes. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Drop in two ready-to-run YAML configs for the DGX Spark workstation so users do not have to read three docs to know which substitutions to make: - daqiri_bench_raw_tx_rx_spark.yaml: kind: host_pinned (mandatory on GB10 since PR #41), PCIe 0000:01:00.0 / 0002:01:00.0 (mlx5_0 ↔ mlx5_2 across the chassis QSFP loop), cpu_core 17/18 from the isolated big-cluster cores 16-19, ip_src/ip_dst matching the daqiri-tx / daqiri-rx nmcli profiles documented in the system configuration tutorial. - daqiri_bench_rdma_tx_rx_spark.yaml: same IP/core conventions for the RDMA bench. host_pinned was already correct upstream, IPs swap from 10.100.x.x to 1.1.1.1 / 2.2.2.2 to match Spark's nmcli profiles. The original templated *_tx_rx.yaml files are preserved unchanged. eth_dst_addr remains a placeholder in the raw bench since the RX MAC is per-system; the comment block tells the user how to populate it. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
…ence The current memory_regions kind table flags host_pinned as "not recommended for high-throughput RX/TX", which is true on discrete GPUs but actively misleading on integrated GPUs (e.g. NVIDIA GB10 / DGX Spark) where it is the only working path: nvidia_peermem returns EINVAL by design and CUDA reports DMA_BUF_SUPPORTED=0 on the unified-memory platform, so device is unreachable. Update both the configuration reference and the configuration-walkthrough annotation #6 to call out the integrated-GPU recommendation while preserving the discrete-GPU guidance for users on RTX 6000 Ada / A100 / H100 setups. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Extend CMAKE_CUDA_ARCHITECTURES from "80;90" to "80;90;121" so the default build targets the Grace Blackwell GB10 superchip (DGX Spark) in addition to A100 / H100. Purely additive: A100/H100 builds keep sm_80 and sm_90, GB10 builds gain sm_121, and downstream users can still override via -DCMAKE_CUDA_ARCHITECTURES. Sync the corresponding CLAUDE.md note in the Build & run section per the docs-sync rules. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Several --check assertions in tune_system.py were written for discrete GPUs and emit WARN-level false positives on platforms like NVIDIA GB10 (DGX Spark) where the GPU is integrated and shares memory with the CPU. Add an is_any_integrated_gpu() helper that queries CU_DEVICE_ATTRIBUTE_INTEGRATED via the CUDA driver API. Thread it through: - check_peermem_kernel: when integrated, downgrade "module not loaded" to INFO with the host_pinned guidance, since peermem is not expected to load on Grace platforms. - check_gpudirect_support: per-GPU; for an integrated GPU reporting GPUDirect-RDMA-supported=0 (expected on GB10), log INFO instead of WARNING. - check_topology_connections: skip the PIX/PXB requirement and log INFO when any visible GPU is integrated; there is no separable PCIe path GPU↔NIC to optimize. - check_bar1_size: skip and log INFO when integrated; no resizable BAR1 on unified memory. Also harden update_mrrs_for_nvidia_devices: read back after the write and report an error if the high nibble didn't land. The most common cause is kernel lockdown from Secure Boot, which makes setpci appear to succeed. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
docs/getting-started.md's build-section table still listed CUDA architectures as "80;90" (A100, H100). The CMakeLists.txt default and CLAUDE.md note were updated to "80;90;121" in 0308c76 — bring getting-started.md into line with the same wording so the docs-sync rule (.claude/rules/docs-sync.md mapping for src/CMakeLists.txt) is fully covered. Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Reviewer Request 1
Result: PASS The container build succeeded. To isolate the toolchain question (since the Dockerfile pins $ docker run --rm daqiri:local bash -c \
'echo "__global__ void k(){}" > /tmp/k.cu && \
nvcc -arch=sm_121 -c /tmp/k.cu -o /tmp/k.o && echo "sm_121 OK" || echo "sm_121 FAILED"'
CUDA Version 13.1.0
sm_121 OKReviewer Request 2
Result: PASS (with caveats) Bench ran end-to-end on real GB10 hardware over the chassis QSFP loop. Zero drops on either direction.
Caveats**1. Throughput ~95% of expected. This box is missing several tuning items the Spark walkthrough recommends ( # Reclaim daqiri/DPDK orphan hugepage files (run between bench iterations)
sudo rm -f /dev/hugepages/rtemap_* /dev/hugepages/*map_*
# Or, if no passwordless sudo, via any image already on disk:
docker run --rm -v /dev/hugepages:/dev/hugepages daqiri:local \
sh -c 'rm -f /dev/hugepages/rtemap_* /dev/hugepages/*map_*'Reviewer Request 3
Result: PASS All four PR-targeted checks correctly flipped to INFO, each with a Spark-specific rationale.
Full output$ sudo python3 python/tune_system.py --check all
2026-05-05 15:16:47 - INFO - CPU 0: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 1: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 2: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 3: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 4: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 5: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 6: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 7: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 8: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 9: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 10: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 11: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 12: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 13: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 14: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 15: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 16: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 17: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 18: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 19: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - WARNING - enp1s0f0np0/0000:01:00.0: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enp1s0f1np1/0000:01:00.1: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enP2p1s0f0np0/0002:01:00.0: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enP2p1s0f1np1/0002:01:00.1: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - INFO - enp1s0f0np0/0000:01:00.0: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enp1s0f1np1/0000:01:00.1: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enP2p1s0f0np0/0002:01:00.0: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enP2p1s0f1np1/0002:01:00.1: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - HugePages_Total: 2048
2026-05-05 15:16:47 - INFO - HugePage Size: 2.00 MB
2026-05-05 15:16:47 - INFO - Total Allocated HugePage Memory: 4096.00 MB
2026-05-05 15:16:47 - INFO - Hugepages are sufficiently allocated with at least 500 MB.
2026-05-05 15:16:47 - WARNING - GPU 0: SM Clock is set to 208 MHz, but should be within 500 MHz of the 3003 MHz theoretical Max.
2026-05-05 15:16:47 - INFO - GPU 0: Memory clock reported as N/A (not applicable for this GPU).
2026-05-05 15:16:47 - INFO - BAR1 size check skipped: integrated GPU detected (unified memory). There is no resizable BAR1 to enlarge on platforms like GB10 / DGX Spark.
2026-05-05 15:16:47 - INFO - Skipping PIX/PXB topology requirement: integrated GPU detected. On single-SoC unified-memory platforms (e.g. GB10 / DGX Spark) there is no separable PCIe path GPU↔NIC to optimize.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'isolcpus'. Please ensure it is configured.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'rcu_nocbs'. Please ensure it is configured.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'irqaffinity'. Please ensure it is configured.
2026-05-05 15:16:48 - INFO - Interface enp1s0f0np0 has an acceptable MTU of 9082 bytes.
2026-05-05 15:16:48 - WARNING - Interface enp1s0f1np1 has an MTU of 1500 bytes. If possible use larger frame sizes ( > 1518B) for better performance
2026-05-05 15:16:48 - INFO - Interface enP2p1s0f0np0 has an acceptable MTU of 8046 bytes.
2026-05-05 15:16:48 - WARNING - Interface enP2p1s0f1np1 has an MTU of 1500 bytes. If possible use larger frame sizes ( > 1518B) for better performance
2026-05-05 15:16:48 - INFO - GPU 0: NVIDIA GB10 is integrated (unified memory). GPUDirect-RDMA-supported reported as 0 is expected on this platform; use kind: host_pinned in the daqiri YAML.
2026-05-05 15:16:48 - INFO - nvidia-peermem module is not loaded, but the platform has an integrated GPU (e.g. GB10 / DGX Spark) where peermem does not apply. Use kind: host_pinned in the daqiri YAML for GPUDirect on this platform. |
|
IGX Orin reviewer verification — Ran the PR branch on an IGX Orin DevKit with an NVIDIA RTX 4000 Ada Generation (discrete GPU).
No "integrated GPU detected" or "host_pinned" messages in the output. Remaining checks (CPU governors, MRRS, hugepages, GPU clocks, isolcpus, MTU) are unaffected by the PR and produced expected output for a partially-tuned IGX Orin system. |
chloecrozier
left a comment
There was a problem hiding this comment.
All checks passed when Ramya and I tested it, looks good.
|
I'm adding in a couple small commits to update docs and CLAUDE.md based on these changes. And then I'll merge it in! |
Populate the previously empty client_address field in rdma_bench_client with 1.1.1.1, matching the client interface IP used elsewhere in the Spark config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
Add a tip admonition under "Update the loopback configuration" pointing
DGX Spark users at the two ready-to-run Spark YAMLs
(daqiri_bench_raw_tx_rx_spark.yaml and daqiri_bench_rdma_tx_rx_spark.yaml)
so they can skip the manual placeholder edits below.
The admonition leads with the prerequisite ("for systems configured per
the DGX Spark profile") and calls out the asymmetry between the two
configs: the raw variant still needs eth_dst_addr filled in from the RX
interface MAC, while the rdma variant needs no further edits. Cross-
references the DGX Spark profile section in the system configuration
tutorial.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
Surface the new Spark variants (daqiri_bench_raw_tx_rx_spark.yaml and daqiri_bench_rdma_tx_rx_spark.yaml) alongside their non-Spark counterparts in the benchmark table so they are discoverable from developer onboarding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
|
Three tiny docs-related commits addressing follow-ups from review:
CI checks run locally before push ( Known gap not addressed hereThe "Run the loopback test" section further down in |
Closes #45.
Summary
docs/tutorials/system_configuration.mdwith top-of-page=== "IGX Orin"/=== "DGX Spark"content tabs (mkdocs-materialcontent.tabs.linkis already enabled). The IGX content is unchanged; the new Spark walkthrough covers CX-7 hotplug, the 4-PFs/2-chips/tied-ports topology with cable-yank diagnostic, whynvidia_peermemand DMA-BUF are unreachable on GB10, the grub drop-in / systemd-unit / nmcli profile patterns, and the steps that are physically N/A on a single-SoC integrated GPU.daqiri_bench_raw_tx_rx_spark.yaml,daqiri_bench_rdma_tx_rx_spark.yaml) so Spark users don't have to read three docs to know which substitutions to make: PCIe0000:01:00.0/0002:01:00.0,kind: host_pinned, cores from the isolated 16-19 set, IPs 1.1.1.1/2.2.2.2 to match the documented nmcli profiles.docs/configuration.mdand the walkthrough annotation [FEA] Automatic Packet <-> Ordered GPU Tensor Functionality #6 to recommendhost_pinnedon integrated GPUs (wheredeviceis not reachable), preserving the discrete-GPU guidance for everyone else.CMAKE_CUDA_ARCHITECTURESfrom"80;90"to"80;90;121"(purely additive — A100/H100 builds keepsm_80;sm_90, GB10 builds gainsm_121). SyncsCLAUDE.mdanddocs/getting-started.mdper the docs-sync rules.python/tune_system.pyfor integrated GPUs: a newis_any_integrated_gpu()helper demotes the false-positiveWARNs on the peermem / GPUDirect-RDMA / PIX-PXB-topology / BAR1 checks toINFOwith a Spark-specific rationale;--set mrrsnow reads the value back so a Secure-Boot-blocked write no longer reports success.Anchor stability
Cross-page links from
benchmarking_examples.mdandconfiguration-walkthrough.md(#enable-gpudirect,#step-4-enable-huge-pages,#step-5-isolate-cpu-cores) keep landing on the IGX heading because mkdocs gives the first heading occurrence the bare slug; Spark duplicates get_1suffixes, and Spark's own self-references (#step-4-enable-huge-pages-grub-drop-in-pattern,#configure-the-ip-addresses-of-the-nic-ports_1, etc.) point at those.Out of scope
Container/Dockerfile changes, backend code, README updates (no backend or default-
DAQIRI_MGRchange).Test plan
mkdocs build --strictpasses (anchors + nav clean).python -m py_compile python/tune_system.pyclean.Signed-off-by:(DCO).BASE_TARGET=dpdk DAQIRI_MGR="dpdk rdma socket" scripts/build-container.shto confirmsm_121doesn't trip the host nvcc.daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5should reach ~94 Gbps unicast over the QSFP loop after fillingeth_dst_addrwith the RX MAC.sudo python3 python/tune_system.py --check allshould now reportINFO(notWARNING) on the peermem / GPUDirect / topo / BAR1 lines.--check allcommand should be unchanged —is_any_integrated_gpu()returnsFalseon dGPUs.🤖 Generated with Claude Code