#15 - Add C++ bench infrastructure for DGX Spark performance report#72
#15 - Add C++ bench infrastructure for DGX Spark performance report#72RamyaGuru wants to merge 4 commits into
Conversation
Lays the groundwork for the DGX Spark performance report (issue #15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on the same PR. Bench changes (examples/ only, no src/ touched): - raw_bench_common: add TokenBucketPacer (--target-gbps software pacer) and parse_target_gbps helper; emit seconds= in the shared rx_count_worker print; build the print in a stringstream so concurrent RX/TX worker output doesn't interleave on stdout. - raw_gpudirect_bench: wire pacer into the TX worker; track packet/byte counts; print TX complete: with seconds=. - rdma_bench: pacer for SEND path only; split send/recv completion counts and bytes per role; seconds= in server/client complete lines. - socket_bench: pacer; make iteration cap opt-in (iterations <= 0 means time-bounded by --seconds); sent_bytes/recv_bytes tracking. - CMakeLists: link raw_bench_common.cpp + CUDA::cudart into rdma/socket bench targets now that they consume the pacer helper. Drop accounting (without modifying managers, per zero-src/ constraint): - DPDK: parsed from DAQIRI_LOG_INFO output of PrintDpdkStats by wrapper. - RDMA: parsed from "CQ error" lines in DAQIRI_LOG_ERROR by wrapper. Bench-side CQE counting isn't possible -- the manager filters error completions before they reach get_rx_burst. - Socket UDP: kernel drops via /proc/net/udp diff by wrapper. - Socket TCP: nstat retrans/inerrs diff by wrapper (TCP has no clean "drops" semantic). New tooling: - examples/bench_capture_environment.sh: snapshots uname, kernel cmdline, CPU/NUMA, hugepages, NIC/PCIe state, OFED, GPU, governor, isolcpus, IRQ affinity, git rev once per result set. - examples/run_spark_bench.sh: sweep wrapper (smoke|sweep|drop-curve modes) per backend; runs bench under mpstat + nvidia-smi dmon; computes pps/Gbps in one place from packets=/bytes=/seconds= stdout fields; writes one CSV row per cell. Documentation: - docs/performance-dgx-spark.md: complete skeleton with TBD cells. Two headline tables (native-shape peak + matched 8 KB op size) so cross-backend comparison is honest. Per-backend sweep dimensions table addresses the DPDK packets / RoCE messages / TCP stream semantic mismatch. Documents that HDS is deferred on GB10 (host_pinned collapses the HDS vs. plain GPUDirect distinction). - mkdocs.yml, docs/index.html, README.md, AGENTS.md: wire the new doc into nav, landing-page News card, README Documentation table, AGENTS Documentation section + drift-hotspot list. - .gitignore: bench-results/ (wrapper output). Verified on a DGX Spark inside the project container: DPDK loopback smoke test produces correctly-formatted TX/RX complete: lines at ~84 Gbps cross-chip; pacer holds 1.055 Gbps when --target-gbps 1 is set (within the +/-5% software pacer accuracy documented in the report); socket UDP smoke produces the new sent_packets/recv_packets/seconds output format; wrapper produces a clean CSV row with non-zero gbps and drops=0. RDMA single-host smoke is deferred -- kernel routes local local IPs through lo (not the RoCE NICs), which requires a network namespace to test on one host; that setup lands with the data-fill commit. Planned follow-up PRs that build on this one: - This PR (#15), second commit: run the full sweep via run_spark_bench.sh and populate the C++ loopback cells in docs/performance-dgx-spark.md. Lands before this PR is marked ready-for-review. - PR for #15 workloads: introduce examples/bench_post_process.{h,cu} (cuFFT + cuBLAS post-process layer, examples-only -- no library dependency change), refactor rx_count_worker to defer burst free until the post-process CUDA event signals, and fill FFT/GEMM cells in the performance doc. - PR for #16 (Python loopback): add daqiri_bench_rdma.py and daqiri_bench_socket.py mirroring the C++ benches' --target-gbps and stdout format; reuses the existing pybind11 surface, no new bindings. Fills Python loopback cells. - PR for #16 workloads: pybind11 wrappers for the PR 2 post-process layer in python/bench_post_process_pybind.cpp; fills Python FFT/GEMM cells and completes the Spark v1 report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Fills the DPDK GPUDirect numbers (sweep, drop curve, headline tables)
into docs/performance-dgx-spark.md from the 2026-05-12 bench runs, and
adds the supporting infra for ongoing data fills:
- New examples/run_spark_bench.sh mode `drop-curve-matrix` sweeps
payload x target_gbps at the headline batch (feeds an upcoming
payload x target_gbps heatmap, distinct from the existing 1D drop
curve).
- New scripts/spark_data_fill.sh one-shot driver runs the DPDK +
socket sweep and drop-curve modes back-to-back, with pre-flight
checks and orphan-hugepage cleanup between runs. RDMA is deferred
from PR 1 (single-host loopback over the cable needs a netns +
two-process refactor; tracked separately).
- docs/stylesheets/extra.css adds a .perf-matrix heatmap (green /
yellow / red cells) used by the payload x batch matrix and
upcoming target_gbps matrix.
- mkdocs.yml enables the footnotes extension so the deferred-row
footnotes ([^1]/[^2]/[^3]) render.
- Container snippet in the perf doc now auto-injects ETH_DST_ADDR
(and RX_IFACE) from the host so the DPDK benches just work after
`docker run`.
RoCE and socket rows in the headline tables are marked deferred with
footnotes; the corresponding follow-up issues (drafts at
claude_plans/daqiri-pr15-spark-bench-followups.md) are:
[^1] RoCE single-host loopback shortcuts through `lo` because both
endpoints live in the root netns, so the QSFP cable carries no
traffic; fix is an examples-only two-netns + two-process
orchestration in run_spark_bench.sh.
[^2] Socket UDP --mode both deadlocks on peer learning — both ends
spin send-then-receive in one process, the server never learns
a peer, and only ~1000 packets / 30 s trickle through under a
flood of "no learned peer" ERROR spam.
[^3] Socket TCP --mode both aborts immediately after the second
accept with a glibc malloc.c:2599 (sysmalloc) heap-integrity
assertion — likely a double-free or OOB write in the TCP
socket-mgr init path.
Issue numbers will be back-filled into the footnotes once filed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
`python/tune_system.py` writes a PCIe topology schematic to `pcie_schematic.png` in the working directory by default (introduced in #61). Anyone who runs the tuner from the repo root then has it sitting untracked in `git status` forever. Ignore the default path so it stops showing up in working-tree status. Custom output paths passed via `--output` are unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
cca82c5 to
c5fffd5
Compare
Adds the payload x target_gbps heatmap to the DPDK GPUDirect results
(5 payloads x 8 target rates, batch=10240) from the 2026-05-13
drop-curve-matrix run. Coloring is relative to the effective target
min(target_gbps, 96 Gbps link cap) so the heatmap reads as "does the
configuration sustain the requested rate?" rather than "what's the
absolute peak?". The 8000 B / 4096 B rows are all green; 1024 B is
green-except-unpaced (where the master core peaks at 93%); 256 B and
64 B turn red once target crosses the PPS ceiling.
Restructures the report around four top-level sections to keep
backend results scannable as more platforms and workloads land:
- Summary (headline tables; unchanged content, just stays at top)
- Introduction (System under test + Methodology, demoted to H3)
- C++ Results (DPDK / RoCE / Socket / Workload variants, demoted
to H3; their per-cell subsections demoted to H4; workload
variants renamed to "DPDK GPUDirect - FFT/GEMM" and
"RoCE - FFT/GEMM" so anchors don't collide with the C++ Results
backend headings)
- Python Results (renamed from "Python results"; unchanged otherwise)
- Reproduce these results (renamed from "Reproducibility
appendix"; same content)
- TODO: Not Yet Implemented / Known Limitations (moved from below
Summary to the bottom; relabeled to make clear these are pending
items, not platform constraints)
Adds a compact in-page Contents list under the intro paragraphs
linking to each top-level section. The Material sidebar TOC still
works for fine-grained navigation; the inline list is for orienting
the reader on first arrival.
Plain-text references to "Known limitations" updated to
"TODO / Known Limitations" with anchor links where appropriate
(RoCE / Socket deferred-results subsections, one-shot driver note).
Also tunes the .perf-matrix CSS so the new 9-column matrix fits
within the content area without text overflow: width 100%,
table-layout: auto, white-space: nowrap on cells, slightly tighter
padding. The narrower 5-column payload x batch matrix continues to
render fine under the same rule.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
|
| Filename | Overview |
|---|---|
| examples/run_spark_bench.sh | New sweep wrapper; two parsing bugs: /proc/net/udp drops column decoded as hex instead of decimal, and socket bench field names (sent_packets/sent_bytes) don't match the RDMA-named fallback (send_completions/send_bytes), causing all-zero socket CSV rows. |
| examples/raw_bench_common.cpp | Adds TokenBucketPacer implementation and seconds= timing to rx_count_worker. Sliced-sleep approach correctly handles stop-flag interruptibility; ostringstream trick for atomic stdout write is sound. |
| examples/raw_bench_common.h | Adds TokenBucketPacer class declaration with correct in-class initializers; default constructor is safe because wait_for_bytes() short-circuits when target_bps_==0.0. |
| examples/raw_gpudirect_bench.cpp | Wires TokenBucketPacer into tx_worker, adds per-worker timing, and emits structured TX complete line; changes are minimal and correct. |
| examples/rdma_bench.cpp | Adds pacer, byte counters, and structured completion lines. Uses wall-clock secs from main() (post-join) consistently for both server and client. |
| examples/socket_bench.cpp | Adds pacer, byte tracking, and structured Client/Server completion lines. Field names (sent_packets/sent_bytes) are inconsistent with the run_spark_bench.sh parsing fallback, but the C++ output itself is correct. |
| examples/bench_capture_environment.sh | New environment snapshot script; run_section handles missing commands gracefully; cat_section uses deliberate unquoted glob expansion for multi-path sections. |
| scripts/spark_data_fill.sh | Driver loop with pre-flight, hugepage cleanup, and PIPESTATUS-aware exit capture. RDMA explicitly rejected; socket backends will inherit the run_spark_bench.sh parsing bugs. |
| docs/performance-dgx-spark.md | Performance report skeleton with full data for DPDK GPUDirect; deferred sections clearly documented with footnotes and known-limitations section. |
| examples/CMakeLists.txt | Links raw_bench_common.cpp and CUDA::cudart into rdma and socket bench targets to support the new TokenBucketPacer and chrono usage. |
Sequence Diagram
sequenceDiagram
participant Fill as spark_data_fill.sh
participant Wrapper as run_spark_bench.sh
participant Env as bench_capture_environment.sh
participant Bench as daqiri_bench_*
participant CSV as runs.csv
Fill->>Fill: preflight checks (hugepages, binary, carrier)
Fill->>Wrapper: run_backend_mode(backend, sweep)
Wrapper->>Env: capture environment.txt once per result set
loop each (payload x batch x target_gbps) cell
Wrapper->>Wrapper: generate_yaml(payload, batch)
Wrapper->>Wrapper: snapshot_cpu_stat before
Wrapper->>Bench: bench_bin yaml --seconds 30 [--target-gbps G] [--mode both]
Note over Bench: TokenBucketPacer paces TX worker
Bench-->>Wrapper: "stdout (complete lines) + stderr (DAQIRI_LOG_*)"
Wrapper->>Wrapper: snapshot_cpu_stat after
Wrapper->>Wrapper: extract pkts/bytes/secs from stdout
Wrapper->>Wrapper: parse drops per backend
Wrapper->>CSV: append CSV row
end
Fill->>Fill: clean_orphan_hugepages between modes
Reviews (1): Last reviewed commit: "#15 - Add payload x target_gbps matrix a..." | Re-trigger Greptile
| snapshot_proc_net_udp() { | ||
| awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0 | ||
| } |
There was a problem hiding this comment.
The
drops column at field 13 in /proc/net/udp is printed by the kernel as a plain decimal integer (%d/%u in udp4_format_sock). Wrapping it in strtonum("0x" …) interprets it as hexadecimal, so any drop count ≥ 10 is inflated — e.g., 10 drops becomes 16, 100 drops becomes 256. The before/after delta subtraction preserves the error, so the socket-udp CSV rows will carry wrong drop counts the moment UDP drops start occurring during the drop-curve sweep.
| snapshot_proc_net_udp() { | |
| awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0 | |
| } | |
| snapshot_proc_net_udp() { | |
| awk 'NR>1 { sum += $13+0 } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0 | |
| } |
| if [[ -z "$pkts" ]]; then | ||
| # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S" | ||
| pkts="$(extract_field 'Client complete' send_completions "$stdout")" | ||
| bytes="$(extract_field 'Client complete' send_bytes "$stdout")" | ||
| secs="$(extract_field 'Client complete' seconds "$stdout")" | ||
| fi |
There was a problem hiding this comment.
Socket field-name mismatch in third fallback
socket_bench.cpp emits sent_packets and sent_bytes, but this block queries the RDMA field names send_completions and send_bytes. Neither will match the socket bench's stdout, so pkts and bytes both fall through to the :-0 default. Every socket-udp and socket-tcp CSV row will report packets=0, bytes=0, pps=0, gbps=0 regardless of how the bench actually performed — this would silently produce a zero-filled data-fill for socket backends once those bugs are resolved.
Summary
Lays the C++ groundwork for the DGX Spark performance report (#15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on this same PR before it's marked ready-for-review.
TokenBucketPacer(--target-gbpssoftware pacer) andseconds=field on bench output across DPDK / RoCE / Socket benches. Zerosrc/changes.examples/bench_capture_environment.sh(slow-moving system state snapshot) andexamples/run_spark_bench.sh(sweep wrapper that runs the bench undermpstat+nvidia-smi dmonand emits one CSV row percell).
docs/performance-dgx-spark.mdskeleton with full final structure: native-shape and matched 8 KB headline tables (so cross-backend Gbps comparisons are honest), per-backend sweep dimensions (DPDK packets/ RoCE messages / TCP stream), and explicit "HDS deferred on GB10" note.
mkdocs.yml,docs/index.htmlNews card,README.md, andAGENTS.mddrift-hotspot list.Drop accounting routes through the wrapper (no manager changes): DPDK via
DAQIRI_LOG_INFOparsing, RDMA via"CQ error"lines, Socket UDP via/proc/net/udpdiff, Socket TCP vianstat.Test plan
daqiri_bench_raw_gpudirect): TX/RXcomplete:lines have the newseconds=format; ~84 Gbps cross-chip on Spark, 0 drops.--target-gbps 1holds 1.055 Gbps (within the +/-5% software pacer accuracy documented in the report).daqiri_bench_socket): newsent_packets/recv_packets/sent_bytes/recv_bytes/seconds=format; iteration cap behavior preserved.bench_capture_environment.shcaptures kernel cmdline, CPU, NUMA, hugepages, mlx5 PCIe, OFED, NVIDIA GPU state, isolcpus, governor, IRQ.run_spark_bench.sh dpdk smokeproduces a valid CSV row withgbps~=96,drops=0, drops_kind tagged correctly.scripts/check_doc_refs.pyclean (every YAML covered by decision tree; every bench cross-referenced).scripts/check_html_links.pyclean aftermkdocs build --strict.lo, requires a network namespace for single-host testing. Lands with the data-fill commit on this PR.run_spark_bench.sh {dpdk,rdma,socket-udp,socket-tcp} {sweep,drop-curve}and populate the C++ loopback cells indocs/performance-dgx-spark.md.Follow-up PRs
examples/bench_post_process.{h,cu}(cuFFT + cuBLAS post-process layer, examples-only — no library dependency change), refactorsrx_count_workerto defer burst free until thepost-process CUDA event signals, and fills FFT/GEMM cells.
daqiri_bench_rdma.py+daqiri_bench_socket.pymirroring this PR's CLI/output. Uses the existing pybind11 surface — no new bindings.🤖 Generated with Claude Code