Skip to content

#15 - Add C++ bench infrastructure for DGX Spark performance report#72

Draft
RamyaGuru wants to merge 4 commits into
mainfrom
15-bench-spark-cpp-loopback
Draft

#15 - Add C++ bench infrastructure for DGX Spark performance report#72
RamyaGuru wants to merge 4 commits into
mainfrom
15-bench-spark-cpp-loopback

Conversation

@RamyaGuru
Copy link
Copy Markdown
Collaborator

Summary

Lays the C++ groundwork for the DGX Spark performance report (#15). Code and doc-skeleton only; per-cell numbers fill in via a follow-on commit on this same PR before it's marked ready-for-review.

  • Adds TokenBucketPacer (--target-gbps software pacer) and seconds= field on bench output across DPDK / RoCE / Socket benches. Zero src/ changes.
  • Adds examples/bench_capture_environment.sh (slow-moving system state snapshot) and examples/run_spark_bench.sh (sweep wrapper that runs the bench under mpstat + nvidia-smi dmon and emits one CSV row per
    cell).
  • Adds docs/performance-dgx-spark.md skeleton with full final structure: native-shape and matched 8 KB headline tables (so cross-backend Gbps comparisons are honest), per-backend sweep dimensions (DPDK packets
    / RoCE messages / TCP stream), and explicit "HDS deferred on GB10" note.
  • Wires the doc into mkdocs.yml, docs/index.html News card, README.md, and AGENTS.md drift-hotspot list.

Drop accounting routes through the wrapper (no manager changes): DPDK via DAQIRI_LOG_INFO parsing, RDMA via "CQ error" lines, Socket UDP via /proc/net/udp diff, Socket TCP via nstat.

Test plan

  • DPDK smoke (daqiri_bench_raw_gpudirect): TX/RX complete: lines have the new seconds= format; ~84 Gbps cross-chip on Spark, 0 drops.
  • Pacer accuracy: --target-gbps 1 holds 1.055 Gbps (within the +/-5% software pacer accuracy documented in the report).
  • Socket UDP smoke (daqiri_bench_socket): new sent_packets/recv_packets/sent_bytes/recv_bytes/seconds= format; iteration cap behavior preserved.
  • bench_capture_environment.sh captures kernel cmdline, CPU, NUMA, hugepages, mlx5 PCIe, OFED, NVIDIA GPU state, isolcpus, governor, IRQ.
  • run_spark_bench.sh dpdk smoke produces a valid CSV row with gbps~=96, drops=0, drops_kind tagged correctly.
  • scripts/check_doc_refs.py clean (every YAML covered by decision tree; every bench cross-referenced).
  • scripts/check_html_links.py clean after mkdocs build --strict.
  • RDMA loopback smoke is deferred — kernel routes local-to-local IPs through lo, requires a network namespace for single-host testing. Lands with the data-fill commit on this PR.
  • Data-fill commit (this PR, second commit): run run_spark_bench.sh {dpdk,rdma,socket-udp,socket-tcp} {sweep,drop-curve} and populate the C++ loopback cells in docs/performance-dgx-spark.md.

Follow-up PRs

🤖 Generated with Claude Code

RamyaGuru and others added 3 commits May 13, 2026 09:44
Lays the groundwork for the DGX Spark performance report (issue #15). Code
and doc-skeleton only; per-cell numbers fill in via a follow-on commit on
the same PR.

Bench changes (examples/ only, no src/ touched):
- raw_bench_common: add TokenBucketPacer (--target-gbps software pacer) and
  parse_target_gbps helper; emit seconds= in the shared rx_count_worker
  print; build the print in a stringstream so concurrent RX/TX worker
  output doesn't interleave on stdout.
- raw_gpudirect_bench: wire pacer into the TX worker; track packet/byte
  counts; print TX complete: with seconds=.
- rdma_bench: pacer for SEND path only; split send/recv completion counts
  and bytes per role; seconds= in server/client complete lines.
- socket_bench: pacer; make iteration cap opt-in (iterations <= 0 means
  time-bounded by --seconds); sent_bytes/recv_bytes tracking.
- CMakeLists: link raw_bench_common.cpp + CUDA::cudart into rdma/socket
  bench targets now that they consume the pacer helper.

Drop accounting (without modifying managers, per zero-src/ constraint):
- DPDK: parsed from DAQIRI_LOG_INFO output of PrintDpdkStats by wrapper.
- RDMA: parsed from "CQ error" lines in DAQIRI_LOG_ERROR by wrapper.
  Bench-side CQE counting isn't possible -- the manager filters error
  completions before they reach get_rx_burst.
- Socket UDP: kernel drops via /proc/net/udp diff by wrapper.
- Socket TCP: nstat retrans/inerrs diff by wrapper (TCP has no clean
  "drops" semantic).

New tooling:
- examples/bench_capture_environment.sh: snapshots uname, kernel cmdline,
  CPU/NUMA, hugepages, NIC/PCIe state, OFED, GPU, governor, isolcpus,
  IRQ affinity, git rev once per result set.
- examples/run_spark_bench.sh: sweep wrapper (smoke|sweep|drop-curve
  modes) per backend; runs bench under mpstat + nvidia-smi dmon;
  computes pps/Gbps in one place from packets=/bytes=/seconds= stdout
  fields; writes one CSV row per cell.

Documentation:
- docs/performance-dgx-spark.md: complete skeleton with TBD cells.
  Two headline tables (native-shape peak + matched 8 KB op size) so
  cross-backend comparison is honest. Per-backend sweep dimensions
  table addresses the DPDK packets / RoCE messages / TCP stream
  semantic mismatch. Documents that HDS is deferred on GB10 (host_pinned
  collapses the HDS vs. plain GPUDirect distinction).
- mkdocs.yml, docs/index.html, README.md, AGENTS.md: wire the new doc
  into nav, landing-page News card, README Documentation table,
  AGENTS Documentation section + drift-hotspot list.
- .gitignore: bench-results/ (wrapper output).

Verified on a DGX Spark inside the project container: DPDK loopback
smoke test produces correctly-formatted TX/RX complete: lines at ~84
Gbps cross-chip; pacer holds 1.055 Gbps when --target-gbps 1 is set
(within the +/-5% software pacer accuracy documented in the report);
socket UDP smoke produces the new sent_packets/recv_packets/seconds
output format; wrapper produces a clean CSV row with non-zero gbps and
drops=0. RDMA single-host smoke is deferred -- kernel routes local
local IPs through lo (not the RoCE NICs), which requires a network
namespace to test on one host; that setup lands with the data-fill commit.

Planned follow-up PRs that build on this one:
- This PR (#15), second commit: run the full sweep via run_spark_bench.sh
  and populate the C++ loopback cells in docs/performance-dgx-spark.md.
  Lands before this PR is marked ready-for-review.
- PR for #15 workloads: introduce examples/bench_post_process.{h,cu}
  (cuFFT + cuBLAS post-process layer, examples-only -- no library
  dependency change), refactor rx_count_worker to defer burst free until
  the post-process CUDA event signals, and fill FFT/GEMM cells in the
  performance doc.
- PR for #16 (Python loopback): add daqiri_bench_rdma.py and
  daqiri_bench_socket.py mirroring the C++ benches' --target-gbps and
  stdout format; reuses the existing pybind11 surface, no new bindings.
  Fills Python loopback cells.
- PR for #16 workloads: pybind11 wrappers for the PR 2 post-process layer
  in python/bench_post_process_pybind.cpp; fills Python FFT/GEMM cells
  and completes the Spark v1 report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Fills the DPDK GPUDirect numbers (sweep, drop curve, headline tables)
into docs/performance-dgx-spark.md from the 2026-05-12 bench runs, and
adds the supporting infra for ongoing data fills:

- New examples/run_spark_bench.sh mode `drop-curve-matrix` sweeps
  payload x target_gbps at the headline batch (feeds an upcoming
  payload x target_gbps heatmap, distinct from the existing 1D drop
  curve).
- New scripts/spark_data_fill.sh one-shot driver runs the DPDK +
  socket sweep and drop-curve modes back-to-back, with pre-flight
  checks and orphan-hugepage cleanup between runs. RDMA is deferred
  from PR 1 (single-host loopback over the cable needs a netns +
  two-process refactor; tracked separately).
- docs/stylesheets/extra.css adds a .perf-matrix heatmap (green /
  yellow / red cells) used by the payload x batch matrix and
  upcoming target_gbps matrix.
- mkdocs.yml enables the footnotes extension so the deferred-row
  footnotes ([^1]/[^2]/[^3]) render.
- Container snippet in the perf doc now auto-injects ETH_DST_ADDR
  (and RX_IFACE) from the host so the DPDK benches just work after
  `docker run`.

RoCE and socket rows in the headline tables are marked deferred with
footnotes; the corresponding follow-up issues (drafts at
claude_plans/daqiri-pr15-spark-bench-followups.md) are:

  [^1] RoCE single-host loopback shortcuts through `lo` because both
       endpoints live in the root netns, so the QSFP cable carries no
       traffic; fix is an examples-only two-netns + two-process
       orchestration in run_spark_bench.sh.
  [^2] Socket UDP --mode both deadlocks on peer learning — both ends
       spin send-then-receive in one process, the server never learns
       a peer, and only ~1000 packets / 30 s trickle through under a
       flood of "no learned peer" ERROR spam.
  [^3] Socket TCP --mode both aborts immediately after the second
       accept with a glibc malloc.c:2599 (sysmalloc) heap-integrity
       assertion — likely a double-free or OOB write in the TCP
       socket-mgr init path.

Issue numbers will be back-filled into the footnotes once filed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
`python/tune_system.py` writes a PCIe topology schematic to
`pcie_schematic.png` in the working directory by default (introduced
in #61). Anyone who runs the tuner from the repo root then has it
sitting untracked in `git status` forever. Ignore the default path
so it stops showing up in working-tree status. Custom output paths
passed via `--output` are unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
@RamyaGuru RamyaGuru force-pushed the 15-bench-spark-cpp-loopback branch from cca82c5 to c5fffd5 Compare May 13, 2026 13:44
Adds the payload x target_gbps heatmap to the DPDK GPUDirect results
(5 payloads x 8 target rates, batch=10240) from the 2026-05-13
drop-curve-matrix run. Coloring is relative to the effective target
min(target_gbps, 96 Gbps link cap) so the heatmap reads as "does the
configuration sustain the requested rate?" rather than "what's the
absolute peak?". The 8000 B / 4096 B rows are all green; 1024 B is
green-except-unpaced (where the master core peaks at 93%); 256 B and
64 B turn red once target crosses the PPS ceiling.

Restructures the report around four top-level sections to keep
backend results scannable as more platforms and workloads land:

  - Summary (headline tables; unchanged content, just stays at top)
  - Introduction (System under test + Methodology, demoted to H3)
  - C++ Results (DPDK / RoCE / Socket / Workload variants, demoted
    to H3; their per-cell subsections demoted to H4; workload
    variants renamed to "DPDK GPUDirect - FFT/GEMM" and
    "RoCE - FFT/GEMM" so anchors don't collide with the C++ Results
    backend headings)
  - Python Results (renamed from "Python results"; unchanged otherwise)
  - Reproduce these results (renamed from "Reproducibility
    appendix"; same content)
  - TODO: Not Yet Implemented / Known Limitations (moved from below
    Summary to the bottom; relabeled to make clear these are pending
    items, not platform constraints)

Adds a compact in-page Contents list under the intro paragraphs
linking to each top-level section. The Material sidebar TOC still
works for fine-grained navigation; the inline list is for orienting
the reader on first arrival.

Plain-text references to "Known limitations" updated to
"TODO / Known Limitations" with anchor links where appropriate
(RoCE / Socket deferred-results subsections, one-shot driver note).

Also tunes the .perf-matrix CSS so the new 9-column matrix fits
within the content area without text overflow: width 100%,
table-layout: auto, white-space: nowrap on cells, slightly tighter
padding. The narrower 5-column payload x batch matrix continues to
render fine under the same rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 13, 2026

Greptile Summary

Adds the C++ benchmarking infrastructure for the DGX Spark performance report: a TokenBucketPacer class shared across all three bench executables (DPDK, RoCE, Socket), seconds= timing on all completion lines, and two new shell scripts (run_spark_bench.sh sweep wrapper, bench_capture_environment.sh environment snapshot). The doc skeleton for docs/performance-dgx-spark.md is wired into mkdocs.yml, the index page, and README.md; DPDK GPUDirect data is fully populated.

  • TokenBucketPacer and seconds= output are added to raw_bench_common.{h,cpp} and consumed uniformly by all three bench binaries; raw_bench_common.cpp is linked into the rdma and socket targets via CMakeLists.txt.
  • run_spark_bench.sh drives the sweep and drop-curve matrices, captures per-run CPU/GPU counters, and emits one CSV row per cell; two parsing bugs affect socket backends (hex mis-decode of a decimal /proc/net/udp column and RDMA field names used in the socket fallback).
  • Socket and RoCE results are explicitly deferred with documented follow-up issues; the DPDK GPUDirect section is complete with drop-curve, payload×target, and payload×batch matrices."

Confidence Score: 3/5

Safe to merge for the DPDK path; two shell-script parsing bugs will silently produce all-zero CSV rows for socket backends once those bench bugs are fixed.

The C++ changes are clean and zero src/ changes are made. The DPDK data in the performance doc is fully populated and the sweep/drop-curve infrastructure works correctly for DPDK. The two issues are both in run_spark_bench.sh's output-parsing logic for socket backends: the /proc/net/udp drops column is a decimal integer but is fed through strtonum("0x" …), inflating any non-zero drop count; and the socket bench's completion field names (sent_packets, sent_bytes) don't match the RDMA-named fallback the script uses (send_completions, send_bytes), so every socket CSV row would show 0 Gbps. Neither bug affects the DPDK results already in the doc, but both would corrupt socket data when the deferred socket fixes land on this same PR.

examples/run_spark_bench.sh — the snapshot_proc_net_udp hex-decode and the socket/RDMA field-name mismatch in the third fallback block

Important Files Changed

Filename Overview
examples/run_spark_bench.sh New sweep wrapper; two parsing bugs: /proc/net/udp drops column decoded as hex instead of decimal, and socket bench field names (sent_packets/sent_bytes) don't match the RDMA-named fallback (send_completions/send_bytes), causing all-zero socket CSV rows.
examples/raw_bench_common.cpp Adds TokenBucketPacer implementation and seconds= timing to rx_count_worker. Sliced-sleep approach correctly handles stop-flag interruptibility; ostringstream trick for atomic stdout write is sound.
examples/raw_bench_common.h Adds TokenBucketPacer class declaration with correct in-class initializers; default constructor is safe because wait_for_bytes() short-circuits when target_bps_==0.0.
examples/raw_gpudirect_bench.cpp Wires TokenBucketPacer into tx_worker, adds per-worker timing, and emits structured TX complete line; changes are minimal and correct.
examples/rdma_bench.cpp Adds pacer, byte counters, and structured completion lines. Uses wall-clock secs from main() (post-join) consistently for both server and client.
examples/socket_bench.cpp Adds pacer, byte tracking, and structured Client/Server completion lines. Field names (sent_packets/sent_bytes) are inconsistent with the run_spark_bench.sh parsing fallback, but the C++ output itself is correct.
examples/bench_capture_environment.sh New environment snapshot script; run_section handles missing commands gracefully; cat_section uses deliberate unquoted glob expansion for multi-path sections.
scripts/spark_data_fill.sh Driver loop with pre-flight, hugepage cleanup, and PIPESTATUS-aware exit capture. RDMA explicitly rejected; socket backends will inherit the run_spark_bench.sh parsing bugs.
docs/performance-dgx-spark.md Performance report skeleton with full data for DPDK GPUDirect; deferred sections clearly documented with footnotes and known-limitations section.
examples/CMakeLists.txt Links raw_bench_common.cpp and CUDA::cudart into rdma and socket bench targets to support the new TokenBucketPacer and chrono usage.

Sequence Diagram

sequenceDiagram
    participant Fill as spark_data_fill.sh
    participant Wrapper as run_spark_bench.sh
    participant Env as bench_capture_environment.sh
    participant Bench as daqiri_bench_*
    participant CSV as runs.csv

    Fill->>Fill: preflight checks (hugepages, binary, carrier)
    Fill->>Wrapper: run_backend_mode(backend, sweep)
    Wrapper->>Env: capture environment.txt once per result set
    loop each (payload x batch x target_gbps) cell
        Wrapper->>Wrapper: generate_yaml(payload, batch)
        Wrapper->>Wrapper: snapshot_cpu_stat before
        Wrapper->>Bench: bench_bin yaml --seconds 30 [--target-gbps G] [--mode both]
        Note over Bench: TokenBucketPacer paces TX worker
        Bench-->>Wrapper: "stdout (complete lines) + stderr (DAQIRI_LOG_*)"
        Wrapper->>Wrapper: snapshot_cpu_stat after
        Wrapper->>Wrapper: extract pkts/bytes/secs from stdout
        Wrapper->>Wrapper: parse drops per backend
        Wrapper->>CSV: append CSV row
    end
    Fill->>Fill: clean_orphan_hugepages between modes
Loading

Reviews (1): Last reviewed commit: "#15 - Add payload x target_gbps matrix a..." | Re-trigger Greptile

Comment on lines +131 to +133
snapshot_proc_net_udp() {
awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The drops column at field 13 in /proc/net/udp is printed by the kernel as a plain decimal integer (%d/%u in udp4_format_sock). Wrapping it in strtonum("0x" …) interprets it as hexadecimal, so any drop count ≥ 10 is inflated — e.g., 10 drops becomes 16, 100 drops becomes 256. The before/after delta subtraction preserves the error, so the socket-udp CSV rows will carry wrong drop counts the moment UDP drops start occurring during the drop-curve sweep.

Suggested change
snapshot_proc_net_udp() {
awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
}
snapshot_proc_net_udp() {
awk 'NR>1 { sum += $13+0 } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
}

Comment on lines +233 to +238
if [[ -z "$pkts" ]]; then
# RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S"
pkts="$(extract_field 'Client complete' send_completions "$stdout")"
bytes="$(extract_field 'Client complete' send_bytes "$stdout")"
secs="$(extract_field 'Client complete' seconds "$stdout")"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Socket field-name mismatch in third fallback

socket_bench.cpp emits sent_packets and sent_bytes, but this block queries the RDMA field names send_completions and send_bytes. Neither will match the socket bench's stdout, so pkts and bytes both fall through to the :-0 default. Every socket-udp and socket-tcp CSV row will report packets=0, bytes=0, pps=0, gbps=0 regardless of how the bench actually performed — this would silently produce a zero-filled data-fill for socket backends once those bugs are resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant