#15 - Add C++ bench infrastructure for DGX Spark performance report by RamyaGuru · Pull Request #72 · NVIDIA/daqiri

RamyaGuru · 2026-05-11T21:20:43Z

Summary

Lays the C++ groundwork for the DGX Spark performance report (#15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on this same PR before it's marked ready-for-review.

Adds TokenBucketPacer (--target-gbps software pacer) and seconds= field on bench output across DPDK / RoCE / Socket benches. Zero src/ changes.
Adds examples/bench_capture_environment.sh (slow-moving system state snapshot) and examples/run_spark_bench.sh (sweep wrapper that runs the bench under mpstat + nvidia-smi dmon and emits one CSV row per
cell).
Adds docs/performance-dgx-spark.md skeleton with full final structure: native-shape and matched 8 KB headline tables (so cross-backend Gbps comparisons are honest), per-backend sweep dimensions (DPDK packets
/ RoCE messages / TCP stream), and explicit "HDS deferred on GB10" note.
Wires the doc into mkdocs.yml, docs/index.html News card, README.md, and AGENTS.md drift-hotspot list.

Drop accounting routes through the wrapper (no manager changes): DPDK via DAQIRI_LOG_INFO parsing, RDMA via "CQ error" lines, Socket UDP via /proc/net/udp diff, Socket TCP via nstat.

Test plan

DPDK smoke (daqiri_bench_raw_gpudirect): TX/RX complete: lines have the new seconds= format; ~84 Gbps cross-chip on Spark, 0 drops.
Pacer accuracy: --target-gbps 1 holds 1.055 Gbps (within the +/-5% software pacer accuracy documented in the report).
Socket UDP smoke (daqiri_bench_socket): new sent_packets/recv_packets/sent_bytes/recv_bytes/seconds= format; iteration cap behavior preserved.
bench_capture_environment.sh captures kernel cmdline, CPU, NUMA, hugepages, mlx5 PCIe, OFED, NVIDIA GPU state, isolcpus, governor, IRQ.
run_spark_bench.sh dpdk smoke produces a valid CSV row with gbps~=96, drops=0, drops_kind tagged correctly.
scripts/check_doc_refs.py clean (every YAML covered by decision tree; every bench cross-referenced).
scripts/check_html_links.py clean after mkdocs build --strict.
RDMA loopback smoke is deferred — kernel routes local-to-local IPs through lo, requires a network namespace for single-host testing. Lands with the data-fill commit on this PR.
Data-fill commit (this PR, second commit): run run_spark_bench.sh {dpdk,rdma,socket-udp,socket-tcp} {sweep,drop-curve} and populate the C++ loopback cells in docs/performance-dgx-spark.md.

Follow-up PRs

[FEA] Benchmark DAQIRI on DGX Spark (C++) #15 workloads: introduces examples/bench_post_process.{h,cu} (cuFFT + cuBLAS post-process layer, examples-only — no library dependency change), refactors rx_count_worker to defer burst free until the
post-process CUDA event signals, and fills FFT/GEMM cells.
[FEA] Benchmark DAQIRI on DGX Spark (Python) #16 Python loopback: daqiri_bench_rdma.py + daqiri_bench_socket.py mirroring this PR's CLI/output. Uses the existing pybind11 surface — no new bindings.
[FEA] Benchmark DAQIRI on DGX Spark (Python) #16 Python workloads: pybind11 wrappers for the [FEA] Benchmark DAQIRI on DGX Spark (C++) #15-workloads post-process layer; completes Spark v1.

🤖 Generated with Claude Code

Lays the groundwork for the DGX Spark performance report (issue #15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on the same PR. Bench changes (examples/ only, no src/ touched): - raw_bench_common: add TokenBucketPacer (--target-gbps software pacer) and parse_target_gbps helper; emit seconds= in the shared rx_count_worker print; build the print in a stringstream so concurrent RX/TX worker output doesn't interleave on stdout. - raw_gpudirect_bench: wire pacer into the TX worker; track packet/byte counts; print TX complete: with seconds=. - rdma_bench: pacer for SEND path only; split send/recv completion counts and bytes per role; seconds= in server/client complete lines. - socket_bench: pacer; make iteration cap opt-in (iterations <= 0 means time-bounded by --seconds); sent_bytes/recv_bytes tracking. - CMakeLists: link raw_bench_common.cpp + CUDA::cudart into rdma/socket bench targets now that they consume the pacer helper. Drop accounting (without modifying managers, per zero-src/ constraint): - DPDK: parsed from DAQIRI_LOG_INFO output of PrintDpdkStats by wrapper. - RDMA: parsed from "CQ error" lines in DAQIRI_LOG_ERROR by wrapper. Bench-side CQE counting isn't possible -- the manager filters error completions before they reach get_rx_burst. - Socket UDP: kernel drops via /proc/net/udp diff by wrapper. - Socket TCP: nstat retrans/inerrs diff by wrapper (TCP has no clean "drops" semantic). New tooling: - examples/bench_capture_environment.sh: snapshots uname, kernel cmdline, CPU/NUMA, hugepages, NIC/PCIe state, OFED, GPU, governor, isolcpus, IRQ affinity, git rev once per result set. - examples/run_spark_bench.sh: sweep wrapper (smoke|sweep|drop-curve modes) per backend; runs bench under mpstat + nvidia-smi dmon; computes pps/Gbps in one place from packets=/bytes=/seconds= stdout fields; writes one CSV row per cell. Documentation: - docs/performance-dgx-spark.md: complete skeleton with TBD cells. Two headline tables (native-shape peak + matched 8 KB op size) so cross-backend comparison is honest. Per-backend sweep dimensions table addresses the DPDK packets / RoCE messages / TCP stream semantic mismatch. Documents that HDS is deferred on GB10 (host_pinned collapses the HDS vs. plain GPUDirect distinction). - mkdocs.yml, docs/index.html, README.md, AGENTS.md: wire the new doc into nav, landing-page News card, README Documentation table, AGENTS Documentation section + drift-hotspot list. - .gitignore: bench-results/ (wrapper output). Verified on a DGX Spark inside the project container: DPDK loopback smoke test produces correctly-formatted TX/RX complete: lines at ~84 Gbps cross-chip; pacer holds 1.055 Gbps when --target-gbps 1 is set (within the +/-5% software pacer accuracy documented in the report); socket UDP smoke produces the new sent_packets/recv_packets/seconds output format; wrapper produces a clean CSV row with non-zero gbps and drops=0. RDMA single-host smoke is deferred -- kernel routes local local IPs through lo (not the RoCE NICs), which requires a network namespace to test on one host; that setup lands with the data-fill commit. Planned follow-up PRs that build on this one: - This PR (#15), second commit: run the full sweep via run_spark_bench.sh and populate the C++ loopback cells in docs/performance-dgx-spark.md. Lands before this PR is marked ready-for-review. - PR for #15 workloads: introduce examples/bench_post_process.{h,cu} (cuFFT + cuBLAS post-process layer, examples-only -- no library dependency change), refactor rx_count_worker to defer burst free until the post-process CUDA event signals, and fill FFT/GEMM cells in the performance doc. - PR for #16 (Python loopback): add daqiri_bench_rdma.py and daqiri_bench_socket.py mirroring the C++ benches' --target-gbps and stdout format; reuses the existing pybind11 surface, no new bindings. Fills Python loopback cells. - PR for #16 workloads: pybind11 wrappers for the PR 2 post-process layer in python/bench_post_process_pybind.cpp; fills Python FFT/GEMM cells and completes the Spark v1 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>

Fills the DPDK GPUDirect numbers (sweep, drop curve, headline tables) into docs/performance-dgx-spark.md from the 2026-05-12 bench runs, and adds the supporting infra for ongoing data fills: - New examples/run_spark_bench.sh mode `drop-curve-matrix` sweeps payload x target_gbps at the headline batch (feeds an upcoming payload x target_gbps heatmap, distinct from the existing 1D drop curve). - New scripts/spark_data_fill.sh one-shot driver runs the DPDK + socket sweep and drop-curve modes back-to-back, with pre-flight checks and orphan-hugepage cleanup between runs. RDMA is deferred from PR 1 (single-host loopback over the cable needs a netns + two-process refactor; tracked separately). - docs/stylesheets/extra.css adds a .perf-matrix heatmap (green / yellow / red cells) used by the payload x batch matrix and upcoming target_gbps matrix. - mkdocs.yml enables the footnotes extension so the deferred-row footnotes ([^1]/[^2]/[^3]) render. - Container snippet in the perf doc now auto-injects ETH_DST_ADDR (and RX_IFACE) from the host so the DPDK benches just work after `docker run`. RoCE and socket rows in the headline tables are marked deferred with footnotes; the corresponding follow-up issues (drafts at claude_plans/daqiri-pr15-spark-bench-followups.md) are: [^1] RoCE single-host loopback shortcuts through `lo` because both endpoints live in the root netns, so the QSFP cable carries no traffic; fix is an examples-only two-netns + two-process orchestration in run_spark_bench.sh. [^2] Socket UDP --mode both deadlocks on peer learning — both ends spin send-then-receive in one process, the server never learns a peer, and only ~1000 packets / 30 s trickle through under a flood of "no learned peer" ERROR spam. [^3] Socket TCP --mode both aborts immediately after the second accept with a glibc malloc.c:2599 (sysmalloc) heap-integrity assertion — likely a double-free or OOB write in the TCP socket-mgr init path. Issue numbers will be back-filled into the footnotes once filed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>

`python/tune_system.py` writes a PCIe topology schematic to `pcie_schematic.png` in the working directory by default (introduced in #61). Anyone who runs the tuner from the repo root then has it sitting untracked in `git status` forever. Ignore the default path so it stops showing up in working-tree status. Custom output paths passed via `--output` are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>

Adds the payload x target_gbps heatmap to the DPDK GPUDirect results (5 payloads x 8 target rates, batch=10240) from the 2026-05-13 drop-curve-matrix run. Coloring is relative to the effective target min(target_gbps, 96 Gbps link cap) so the heatmap reads as "does the configuration sustain the requested rate?" rather than "what's the absolute peak?". The 8000 B / 4096 B rows are all green; 1024 B is green-except-unpaced (where the master core peaks at 93%); 256 B and 64 B turn red once target crosses the PPS ceiling. Restructures the report around four top-level sections to keep backend results scannable as more platforms and workloads land: - Summary (headline tables; unchanged content, just stays at top) - Introduction (System under test + Methodology, demoted to H3) - C++ Results (DPDK / RoCE / Socket / Workload variants, demoted to H3; their per-cell subsections demoted to H4; workload variants renamed to "DPDK GPUDirect - FFT/GEMM" and "RoCE - FFT/GEMM" so anchors don't collide with the C++ Results backend headings) - Python Results (renamed from "Python results"; unchanged otherwise) - Reproduce these results (renamed from "Reproducibility appendix"; same content) - TODO: Not Yet Implemented / Known Limitations (moved from below Summary to the bottom; relabeled to make clear these are pending items, not platform constraints) Adds a compact in-page Contents list under the intro paragraphs linking to each top-level section. The Material sidebar TOC still works for fine-grained navigation; the inline list is for orienting the reader on first arrival. Plain-text references to "Known limitations" updated to "TODO / Known Limitations" with anchor links where appropriate (RoCE / Socket deferred-results subsections, one-shot driver note). Also tunes the .perf-matrix CSS so the new 9-column matrix fits within the content area without text overflow: width 100%, table-layout: auto, white-space: nowrap on cells, slightly tighter padding. The narrower 5-column payload x batch matrix continues to render fine under the same rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>

greptile-apps · 2026-05-13T21:54:19Z

Greptile Summary

Adds the C++ benchmarking infrastructure for the DGX Spark performance report: a TokenBucketPacer class shared across all three bench executables (DPDK, RoCE, Socket), seconds= timing on all completion lines, and two new shell scripts (run_spark_bench.sh sweep wrapper, bench_capture_environment.sh environment snapshot). The doc skeleton for docs/performance-dgx-spark.md is wired into mkdocs.yml, the index page, and README.md; DPDK GPUDirect data is fully populated.

TokenBucketPacer and seconds= output are added to raw_bench_common.{h,cpp} and consumed uniformly by all three bench binaries; raw_bench_common.cpp is linked into the rdma and socket targets via CMakeLists.txt.
run_spark_bench.sh drives the sweep and drop-curve matrices, captures per-run CPU/GPU counters, and emits one CSV row per cell; two parsing bugs affect socket backends (hex mis-decode of a decimal /proc/net/udp column and RDMA field names used in the socket fallback).
Socket and RoCE results are explicitly deferred with documented follow-up issues; the DPDK GPUDirect section is complete with drop-curve, payload×target, and payload×batch matrices."

Confidence Score: 3/5

Safe to merge for the DPDK path; two shell-script parsing bugs will silently produce all-zero CSV rows for socket backends once those bench bugs are fixed.

The C++ changes are clean and zero src/ changes are made. The DPDK data in the performance doc is fully populated and the sweep/drop-curve infrastructure works correctly for DPDK. The two issues are both in run_spark_bench.sh's output-parsing logic for socket backends: the /proc/net/udp drops column is a decimal integer but is fed through strtonum("0x" …), inflating any non-zero drop count; and the socket bench's completion field names (sent_packets, sent_bytes) don't match the RDMA-named fallback the script uses (send_completions, send_bytes), so every socket CSV row would show 0 Gbps. Neither bug affects the DPDK results already in the doc, but both would corrupt socket data when the deferred socket fixes land on this same PR.

examples/run_spark_bench.sh — the snapshot_proc_net_udp hex-decode and the socket/RDMA field-name mismatch in the third fallback block

Important Files Changed

Filename	Overview
examples/run_spark_bench.sh	New sweep wrapper; two parsing bugs: /proc/net/udp drops column decoded as hex instead of decimal, and socket bench field names (sent_packets/sent_bytes) don't match the RDMA-named fallback (send_completions/send_bytes), causing all-zero socket CSV rows.
examples/raw_bench_common.cpp	Adds TokenBucketPacer implementation and seconds= timing to rx_count_worker. Sliced-sleep approach correctly handles stop-flag interruptibility; ostringstream trick for atomic stdout write is sound.
examples/raw_bench_common.h	Adds TokenBucketPacer class declaration with correct in-class initializers; default constructor is safe because wait_for_bytes() short-circuits when target_bps_==0.0.
examples/raw_gpudirect_bench.cpp	Wires TokenBucketPacer into tx_worker, adds per-worker timing, and emits structured TX complete line; changes are minimal and correct.
examples/rdma_bench.cpp	Adds pacer, byte counters, and structured completion lines. Uses wall-clock secs from main() (post-join) consistently for both server and client.
examples/socket_bench.cpp	Adds pacer, byte tracking, and structured Client/Server completion lines. Field names (sent_packets/sent_bytes) are inconsistent with the run_spark_bench.sh parsing fallback, but the C++ output itself is correct.
examples/bench_capture_environment.sh	New environment snapshot script; run_section handles missing commands gracefully; cat_section uses deliberate unquoted glob expansion for multi-path sections.
scripts/spark_data_fill.sh	Driver loop with pre-flight, hugepage cleanup, and PIPESTATUS-aware exit capture. RDMA explicitly rejected; socket backends will inherit the run_spark_bench.sh parsing bugs.
docs/performance-dgx-spark.md	Performance report skeleton with full data for DPDK GPUDirect; deferred sections clearly documented with footnotes and known-limitations section.
examples/CMakeLists.txt	Links raw_bench_common.cpp and CUDA::cudart into rdma and socket bench targets to support the new TokenBucketPacer and chrono usage.

Sequence Diagram

sequenceDiagram
    participant Fill as spark_data_fill.sh
    participant Wrapper as run_spark_bench.sh
    participant Env as bench_capture_environment.sh
    participant Bench as daqiri_bench_*
    participant CSV as runs.csv

    Fill->>Fill: preflight checks (hugepages, binary, carrier)
    Fill->>Wrapper: run_backend_mode(backend, sweep)
    Wrapper->>Env: capture environment.txt once per result set
    loop each (payload x batch x target_gbps) cell
        Wrapper->>Wrapper: generate_yaml(payload, batch)
        Wrapper->>Wrapper: snapshot_cpu_stat before
        Wrapper->>Bench: bench_bin yaml --seconds 30 [--target-gbps G] [--mode both]
        Note over Bench: TokenBucketPacer paces TX worker
        Bench-->>Wrapper: "stdout (complete lines) + stderr (DAQIRI_LOG_*)"
        Wrapper->>Wrapper: snapshot_cpu_stat after
        Wrapper->>Wrapper: extract pkts/bytes/secs from stdout
        Wrapper->>Wrapper: parse drops per backend
        Wrapper->>CSV: append CSV row
    end
    Fill->>Fill: clean_orphan_hugepages between modes

_{Reviews (1): Last reviewed commit: "#15 - Add payload x target_gbps matrix a..." | Re-trigger Greptile}

greptile-apps · 2026-05-13T21:54:22Z

+snapshot_proc_net_udp() {
+  awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
+}


The drops column at field 13 in /proc/net/udp is printed by the kernel as a plain decimal integer (%d/%u in udp4_format_sock). Wrapping it in strtonum("0x" …) interprets it as hexadecimal, so any drop count ≥ 10 is inflated — e.g., 10 drops becomes 16, 100 drops becomes 256. The before/after delta subtraction preserves the error, so the socket-udp CSV rows will carry wrong drop counts the moment UDP drops start occurring during the drop-curve sweep.

Suggested change

snapshot_proc_net_udp() {

awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0

}

snapshot_proc_net_udp() {

awk 'NR>1 { sum += $13+0 } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0

}

greptile-apps · 2026-05-13T21:54:23Z

+  if [[ -z "$pkts" ]]; then
+    # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S"
+    pkts="$(extract_field 'Client complete' send_completions "$stdout")"
+    bytes="$(extract_field 'Client complete' send_bytes "$stdout")"
+    secs="$(extract_field 'Client complete' seconds "$stdout")"
+  fi


Socket field-name mismatch in third fallback

socket_bench.cpp emits sent_packets and sent_bytes, but this block queries the RDMA field names send_completions and send_bytes. Neither will match the socket bench's stdout, so pkts and bytes both fall through to the :-0 default. Every socket-udp and socket-tcp CSV row will report packets=0, bytes=0, pps=0, gbps=0 regardless of how the bench actually performed — this would silently produce a zero-filled data-fill for socket backends once those bugs are resolved.

RamyaGuru and others added 3 commits May 13, 2026 09:44

RamyaGuru force-pushed the 15-bench-spark-cpp-loopback branch from cca82c5 to c5fffd5 Compare May 13, 2026 13:44

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

This was referenced May 15, 2026

#75 - Fix Socket UDP --mode both; error spam impacts startup #85

Closed

[BUG] Socket UDP --mode both deadlocks on peer learning; only ~1000 packets / 30 s trickle through #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#15 - Add C++ bench infrastructure for DGX Spark performance report#72

#15 - Add C++ bench infrastructure for DGX Spark performance report#72
RamyaGuru wants to merge 4 commits into
mainfrom
15-bench-spark-cpp-loopback

RamyaGuru commented May 11, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RamyaGuru commented May 11, 2026

Summary

Test plan

Follow-up PRs

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant