Skip to content

#45 - Add DGX Spark (GB10) support to docs and examples#49

Merged
RamyaGuru merged 9 commits into
mainfrom
docs/dgx-spark-system-tuning
May 6, 2026
Merged

#45 - Add DGX Spark (GB10) support to docs and examples#49
RamyaGuru merged 9 commits into
mainfrom
docs/dgx-spark-system-tuning

Conversation

@dleshchev
Copy link
Copy Markdown
Collaborator

@dleshchev dleshchev commented May 5, 2026

Closes #45.

Summary

  • Restructures docs/tutorials/system_configuration.md with top-of-page === "IGX Orin" / === "DGX Spark" content tabs (mkdocs-material content.tabs.link is already enabled). The IGX content is unchanged; the new Spark walkthrough covers CX-7 hotplug, the 4-PFs/2-chips/tied-ports topology with cable-yank diagnostic, why nvidia_peermem and DMA-BUF are unreachable on GB10, the grub drop-in / systemd-unit / nmcli profile patterns, and the steps that are physically N/A on a single-SoC integrated GPU.
  • Drops in two ready-to-run YAMLs (daqiri_bench_raw_tx_rx_spark.yaml, daqiri_bench_rdma_tx_rx_spark.yaml) so Spark users don't have to read three docs to know which substitutions to make: PCIe 0000:01:00.0 / 0002:01:00.0, kind: host_pinned, cores from the isolated 16-19 set, IPs 1.1.1.1/2.2.2.2 to match the documented nmcli profiles.
  • Updates docs/configuration.md and the walkthrough annotation [FEA] Automatic Packet <-> Ordered GPU Tensor Functionality #6 to recommend host_pinned on integrated GPUs (where device is not reachable), preserving the discrete-GPU guidance for everyone else.
  • Extends CMAKE_CUDA_ARCHITECTURES from "80;90" to "80;90;121" (purely additive — A100/H100 builds keep sm_80;sm_90, GB10 builds gain sm_121). Syncs CLAUDE.md and docs/getting-started.md per the docs-sync rules.
  • Hardens python/tune_system.py for integrated GPUs: a new is_any_integrated_gpu() helper demotes the false-positive WARNs on the peermem / GPUDirect-RDMA / PIX-PXB-topology / BAR1 checks to INFO with a Spark-specific rationale; --set mrrs now reads the value back so a Secure-Boot-blocked write no longer reports success.

Anchor stability

Cross-page links from benchmarking_examples.md and configuration-walkthrough.md (#enable-gpudirect, #step-4-enable-huge-pages, #step-5-isolate-cpu-cores) keep landing on the IGX heading because mkdocs gives the first heading occurrence the bare slug; Spark duplicates get _1 suffixes, and Spark's own self-references (#step-4-enable-huge-pages-grub-drop-in-pattern, #configure-the-ip-addresses-of-the-nic-ports_1, etc.) point at those.

Out of scope

Container/Dockerfile changes, backend code, README updates (no backend or default-DAQIRI_MGR change).

Test plan

  • mkdocs build --strict passes (anchors + nav clean).
  • Both new YAMLs parse with PyYAML.
  • python -m py_compile python/tune_system.py clean.
  • All 6 commits carry Signed-off-by: (DCO).
  • Reviewer: build inside the daqiri container with BASE_TARGET=dpdk DAQIRI_MGR="dpdk rdma socket" scripts/build-container.sh to confirm sm_121 doesn't trip the host nvcc.
  • Reviewer with Spark hardware: daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5 should reach ~94 Gbps unicast over the QSFP loop after filling eth_dst_addr with the RX MAC.
  • Reviewer with Spark hardware: sudo python3 python/tune_system.py --check all should now report INFO (not WARNING) on the peermem / GPUDirect / topo / BAR1 lines.
  • Reviewer with IGX hardware: same --check all command should be unchanged — is_any_integrated_gpu() returns False on dGPUs.

🤖 Generated with Claude Code

dleshchev added 6 commits May 5, 2026 17:39
Wraps the existing IGX Orin content under a top-of-page mkdocs-material
content tab and adds a parallel "DGX Spark" tab with a full walkthrough
for the Grace Blackwell GB10 superchip platform.

The Spark tab calls out the platform deltas:
- CX-7 hotplug warning (cable required; PFs disappear without one)
- 4 PFs / 2 chips / tied chassis sockets, with the cable-yank diagnostic
- Skip nvidia_peermem (EINVAL on Grace) and the daqiri DMA-BUF path
  (CUDA reports DMA_BUF_SUPPORTED=0); use kind: host_pinned in YAML
- Skip GPU BAR1 sizing and the PIX/PXB topology requirement (N/A on
  unified-memory single-SoC integrated GPUs)
- grub drop-in pattern under /etc/default/grub.d/ for hugepages and
  isolcpus=16-19 (Spark composes the cmdline from drop-ins)
- systemd unit pattern for MRRS via raw setpci CAP_EXP+8.w
  (Secure Boot must be off)
- nmcli profile pattern that folds IP + MTU=9000 into one step

Anchor stability: existing cross-page links from
benchmarking_examples.md and configuration-walkthrough.md
(#enable-gpudirect, #step-4-enable-huge-pages, #step-5-isolate-cpu-cores)
keep landing on the IGX heading because mkdocs gives the first
heading occurrence the bare slug; Spark duplicates get _1 suffixes.
mkdocs --strict build passes.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Drop in two ready-to-run YAML configs for the DGX Spark workstation
so users do not have to read three docs to know which substitutions
to make:

- daqiri_bench_raw_tx_rx_spark.yaml: kind: host_pinned (mandatory on
  GB10 since PR #41), PCIe 0000:01:00.0 / 0002:01:00.0 (mlx5_0 ↔
  mlx5_2 across the chassis QSFP loop), cpu_core 17/18 from the
  isolated big-cluster cores 16-19, ip_src/ip_dst matching the
  daqiri-tx / daqiri-rx nmcli profiles documented in the system
  configuration tutorial.

- daqiri_bench_rdma_tx_rx_spark.yaml: same IP/core conventions for
  the RDMA bench. host_pinned was already correct upstream, IPs swap
  from 10.100.x.x to 1.1.1.1 / 2.2.2.2 to match Spark's nmcli
  profiles.

The original templated *_tx_rx.yaml files are preserved unchanged.
eth_dst_addr remains a placeholder in the raw bench since the RX
MAC is per-system; the comment block tells the user how to populate it.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
…ence

The current memory_regions kind table flags host_pinned as "not
recommended for high-throughput RX/TX", which is true on discrete GPUs
but actively misleading on integrated GPUs (e.g. NVIDIA GB10 /
DGX Spark) where it is the only working path: nvidia_peermem returns
EINVAL by design and CUDA reports DMA_BUF_SUPPORTED=0 on the
unified-memory platform, so device is unreachable.

Update both the configuration reference and the
configuration-walkthrough annotation #6 to call out the integrated-GPU
recommendation while preserving the discrete-GPU guidance for users on
RTX 6000 Ada / A100 / H100 setups.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Extend CMAKE_CUDA_ARCHITECTURES from "80;90" to "80;90;121" so the
default build targets the Grace Blackwell GB10 superchip (DGX Spark)
in addition to A100 / H100. Purely additive: A100/H100 builds keep
sm_80 and sm_90, GB10 builds gain sm_121, and downstream users can
still override via -DCMAKE_CUDA_ARCHITECTURES.

Sync the corresponding CLAUDE.md note in the Build & run section per
the docs-sync rules.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
Several --check assertions in tune_system.py were written for discrete
GPUs and emit WARN-level false positives on platforms like NVIDIA GB10
(DGX Spark) where the GPU is integrated and shares memory with the CPU.

Add an is_any_integrated_gpu() helper that queries
CU_DEVICE_ATTRIBUTE_INTEGRATED via the CUDA driver API. Thread it
through:

- check_peermem_kernel: when integrated, downgrade "module not loaded"
  to INFO with the host_pinned guidance, since peermem is not expected
  to load on Grace platforms.
- check_gpudirect_support: per-GPU; for an integrated GPU reporting
  GPUDirect-RDMA-supported=0 (expected on GB10), log INFO instead of
  WARNING.
- check_topology_connections: skip the PIX/PXB requirement and log INFO
  when any visible GPU is integrated; there is no separable PCIe path
  GPU↔NIC to optimize.
- check_bar1_size: skip and log INFO when integrated; no resizable
  BAR1 on unified memory.

Also harden update_mrrs_for_nvidia_devices: read back after the write
and report an error if the high nibble didn't land. The most common
cause is kernel lockdown from Secure Boot, which makes setpci appear
to succeed.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
docs/getting-started.md's build-section table still listed CUDA
architectures as "80;90" (A100, H100). The CMakeLists.txt default and
CLAUDE.md note were updated to "80;90;121" in 0308c76 — bring
getting-started.md into line with the same wording so the docs-sync
rule (.claude/rules/docs-sync.md mapping for src/CMakeLists.txt) is
fully covered.

Signed-off-by: Denis Leshchev <dleshchev@nvidia.com>
@chloecrozier
Copy link
Copy Markdown
Member

chloecrozier commented May 5, 2026

Reviewer Request 1

Build inside the daqiri container with BASE_TARGET=dpdk DAQIRI_MGR="dpdk rdma socket" scripts/build-container.sh to confirm sm_121 doesn't trip the host nvcc.

Result: PASS

The container build succeeded. To isolate the toolchain question (since the Dockerfile pins -DCMAKE_CUDA_ARCHITECTURES=all-major and bypasses the in-tree default), I also asked the in-container nvcc directly:

$ docker run --rm daqiri:local bash -c \
    'echo "__global__ void k(){}" > /tmp/k.cu && \
     nvcc -arch=sm_121 -c /tmp/k.cu -o /tmp/k.o && echo "sm_121 OK" || echo "sm_121 FAILED"'

CUDA Version 13.1.0
sm_121 OK

Reviewer Request 2

Reviewer with Spark hardware: daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5 should reach ~94 Gbps unicast over the QSFP loop after filling eth_dst_addr with the RX MAC.

Result: PASS (with caveats)

Bench ran end-to-end on real GB10 hardware over the chassis QSFP loop. Zero drops on either direction.

Direction Bytes Throughput
TX unicast 55,622,173,824 88.99 Gbps
RX unicast 55,705,853,952 89.13 Gbps
App RX 55,490,641,920 88.79 Gbps
[INFO] Port 0:
  - Transmit packets:    6899712
  - Transmit bytes:      55639277568
  - Missed packets:      0
  - Errored packets:     0
    ** Extended Stats **
       tx_good_packets:        6907136
       tx_good_bytes:          55700692992
       tx_unicast_bytes:       55622173824
       tx_unicast_packets:     6897591
[INFO] Port 1:
  - Received packets:    6899407
  - Received bytes:      55636818048
  - Missed packets:      0
  - Errored packets:     0
    ** Extended Stats **
       rx_good_packets:        6911040
       rx_good_bytes:          55730626560
       rx_unicast_bytes:       55705853952
       rx_unicast_packets:     6907968
RX complete: packets=6881280 bytes=55490641920 bursts=672

Caveats

**1. Throughput ~95% of expected. This box is missing several tuning items the Spark walkthrough recommends (isolcpus, rcu_nocbs, irqaffinity not in cmdline; MRRS=512 not 4096; idle SM clock 208 MHz based on the tuning script output, so that can be fixed later.

# Reclaim daqiri/DPDK orphan hugepage files (run between bench iterations)
sudo rm -f /dev/hugepages/rtemap_* /dev/hugepages/*map_*

# Or, if no passwordless sudo, via any image already on disk:
docker run --rm -v /dev/hugepages:/dev/hugepages daqiri:local \
  sh -c 'rm -f /dev/hugepages/rtemap_* /dev/hugepages/*map_*'

Reviewer Request 3

Reviewer with Spark hardware: sudo python3 python/tune_system.py --check all should now report INFO (not WARNING) on the peermem / GPUDirect / topo / BAR1 lines.

Result: PASS

All four PR-targeted checks correctly flipped to INFO, each with a Spark-specific rationale. is_any_integrated_gpu() correctly identified the GB10 via CU_DEVICE_ATTRIBUTE_INTEGRATED == 1.

# Check New output
1 check_bar1_size INFO - BAR1 size check skipped: integrated GPU detected (unified memory). …
2 check_topology_connections INFO - Skipping PIX/PXB topology requirement: integrated GPU detected. …
3 check_gpudirect_support INFO - GPU 0: NVIDIA GB10 is integrated (unified memory). GPUDirect-RDMA-supported reported as 0 is expected …
4 check_peermem_kernel INFO - nvidia-peermem module is not loaded, but the platform has an integrated GPU …

Full output

$ sudo python3 python/tune_system.py --check all
2026-05-05 15:16:47 - INFO - CPU 0: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 1: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 2: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 3: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 4: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 5: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 6: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 7: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 8: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 9: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 10: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 11: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 12: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 13: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 14: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 15: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 16: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 17: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 18: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - INFO - CPU 19: Governor is correctly set to 'performance'.
2026-05-05 15:16:47 - WARNING - enp1s0f0np0/0000:01:00.0: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enp1s0f1np1/0000:01:00.1: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enP2p1s0f0np0/0002:01:00.0: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - WARNING - enP2p1s0f1np1/0002:01:00.1: MRRS is set to 512, not 4096.
2026-05-05 15:16:47 - INFO - enp1s0f0np0/0000:01:00.0: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enp1s0f1np1/0000:01:00.1: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enP2p1s0f0np0/0002:01:00.0: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - enP2p1s0f1np1/0002:01:00.1: PCIe Max Payload Size is correctly set to 512 bytes (hardware maximum).
2026-05-05 15:16:47 - INFO - HugePages_Total: 2048
2026-05-05 15:16:47 - INFO - HugePage Size: 2.00 MB
2026-05-05 15:16:47 - INFO - Total Allocated HugePage Memory: 4096.00 MB
2026-05-05 15:16:47 - INFO - Hugepages are sufficiently allocated with at least 500 MB.
2026-05-05 15:16:47 - WARNING - GPU 0: SM Clock is set to 208 MHz, but should be within 500 MHz of the 3003 MHz theoretical Max.
2026-05-05 15:16:47 - INFO - GPU 0: Memory clock reported as N/A (not applicable for this GPU).
2026-05-05 15:16:47 - INFO - BAR1 size check skipped: integrated GPU detected (unified memory). There is no resizable BAR1 to enlarge on platforms like GB10 / DGX Spark.
2026-05-05 15:16:47 - INFO - Skipping PIX/PXB topology requirement: integrated GPU detected. On single-SoC unified-memory platforms (e.g. GB10 / DGX Spark) there is no separable PCIe path GPU↔NIC to optimize.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'isolcpus'. Please ensure it is configured.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'rcu_nocbs'. Please ensure it is configured.
2026-05-05 15:16:47 - WARNING - The kernel command line is missing 'irqaffinity'. Please ensure it is configured.
2026-05-05 15:16:48 - INFO - Interface enp1s0f0np0 has an acceptable MTU of 9082 bytes.
2026-05-05 15:16:48 - WARNING - Interface enp1s0f1np1 has an MTU of 1500 bytes. If possible use larger frame sizes ( > 1518B) for better performance
2026-05-05 15:16:48 - INFO - Interface enP2p1s0f0np0 has an acceptable MTU of 8046 bytes.
2026-05-05 15:16:48 - WARNING - Interface enP2p1s0f1np1 has an MTU of 1500 bytes. If possible use larger frame sizes ( > 1518B) for better performance
2026-05-05 15:16:48 - INFO - GPU 0: NVIDIA GB10 is integrated (unified memory). GPUDirect-RDMA-supported reported as 0 is expected on this platform; use kind: host_pinned in the daqiri YAML.
2026-05-05 15:16:48 - INFO - nvidia-peermem module is not loaded, but the platform has an integrated GPU (e.g. GB10 / DGX Spark) where peermem does not apply. Use kind: host_pinned in the daqiri YAML for GPUDirect on this platform.

@RamyaGuru
Copy link
Copy Markdown
Collaborator

IGX Orin reviewer verification — tune_system.py --check all unchanged on dGPU

Ran the PR branch on an IGX Orin DevKit with an NVIDIA RTX 4000 Ada Generation (discrete GPU).

sudo python3 python/tune_system.py --check all

is_any_integrated_gpu() returns False as expected. All four guarded checks ran their full discrete-GPU logic with no change from main:

Check Result Notes
check_peermem_kernel INFO — peermem module is loaded Original success path (loaded peermem via modprobe), not the new host_pinned info
check_gpudirect_support INFO — RTX 4000 Ada has GPUDirect support Normal success path
check_bar1_size WARNING — BAR1 is 256 MiB Full check ran, no early-return skip
check_topology_connections INFO — GPU0 has PXB connection to NIC0 Full topology check ran, no skip

No "integrated GPU detected" or "host_pinned" messages in the output. Remaining checks (CPU governors, MRRS, hugepages, GPU clocks, isolcpus, MTU) are unaffected by the PR and produced expected output for a partially-tuned IGX Orin system.

Copy link
Copy Markdown
Member

@chloecrozier chloecrozier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All checks passed when Ramya and I tested it, looks good.

@chloecrozier chloecrozier marked this pull request as ready for review May 6, 2026 14:50
@RamyaGuru
Copy link
Copy Markdown
Collaborator

I'm adding in a couple small commits to update docs and CLAUDE.md based on these changes. And then I'll merge it in!

RamyaGuru and others added 3 commits May 6, 2026 10:57
Populate the previously empty client_address field in rdma_bench_client
with 1.1.1.1, matching the client interface IP used elsewhere in the
Spark config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
Add a tip admonition under "Update the loopback configuration" pointing
DGX Spark users at the two ready-to-run Spark YAMLs
(daqiri_bench_raw_tx_rx_spark.yaml and daqiri_bench_rdma_tx_rx_spark.yaml)
so they can skip the manual placeholder edits below.

The admonition leads with the prerequisite ("for systems configured per
the DGX Spark profile") and calls out the asymmetry between the two
configs: the raw variant still needs eth_dst_addr filled in from the RX
interface MAC, while the rdma variant needs no further edits. Cross-
references the DGX Spark profile section in the system configuration
tutorial.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
Surface the new Spark variants (daqiri_bench_raw_tx_rx_spark.yaml and
daqiri_bench_rdma_tx_rx_spark.yaml) alongside their non-Spark
counterparts in the benchmark table so they are discoverable from
developer onboarding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ramya Gurunathan <rgurunathan@nvidia.com>
@RamyaGuru
Copy link
Copy Markdown
Collaborator

Three tiny docs-related commits addressing follow-ups from review:

  • 1ceb710 — Set client_address in DGX Spark RDMA bench config. The rdma_bench_client.client_address field in examples/daqiri_bench_rdma_tx_rx_spark.yaml was
    empty; populated with 1.1.1.1 to match the client interface IP used elsewhere in the same config. Without this, daqiri_bench_rdma falls back to whatever default the
    binary picks, which won't match the documented Spark profile.

  • 684c084 — Surface DGX Spark configs in the benchmarking tutorial. Adds a !!! tip "DGX Spark" admonition under "Update the loopback configuration" pointing readers
    at both Spark YAMLs. The admonition leads with the prerequisite ("for systems configured per the DGX Spark profile") and calls out the asymmetry between the two configs —
    the raw variant still needs eth_dst_addr filled in from the RX MAC; the rdma variant needs no further edits. Cross-links to the DGX Spark profile section in the system
    configuration tutorial.

  • 1950c55 — Add the new Spark YAMLs to the CLAUDE.md benchmark table. Two-line change so the Spark variants sit alongside their non-Spark counterparts
    (daqiri_bench_raw_tx_rx_spark.yaml next to daqiri_bench_raw_tx_rx.yaml, etc.), making them discoverable from developer onboarding.

CI checks run locally before push (mkdocs build --strict, scripts/check_html_links.py, scripts/check_doc_refs.py) — all green.

Known gap not addressed here

The "Run the loopback test" section further down in docs/tutorials/benchmarking_examples.md still hardcodes daqiri_bench_raw_tx_rx.yaml in its run commands (lines 144 /
152). A reader following the new Spark shortcut from the top of the section won't get a corresponding Spark-aware run command. Happy to add a follow-up commit converting
those into === "Standard" / === "DGX Spark" tabs if you'd like; left out of this batch since the gap is independent of the admonition itself.

@RamyaGuru RamyaGuru merged commit 2532bb6 into main May 6, 2026
2 checks passed
@dleshchev dleshchev deleted the docs/dgx-spark-system-tuning branch May 12, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DOC] Update system tuning tutorial with DGX Spark specific details

3 participants