diff --git a/.gitignore b/.gitignore
index e221700..472185d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,9 @@
build*/
site/
+bench-results/
+
+# tune_system.py default output
+pcie_schematic.png
# macOS
.DS_Store
diff --git a/AGENTS.md b/AGENTS.md
index c45ef72..79d9568 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -93,14 +93,16 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf
- `docs/index.html` — custom HTML landing page (not generated by MkDocs, hand-maintained)
- `docs/daqiri-api.html` — standalone HTML API reference (hand-maintained)
- `docs/api-guide.md`, `docs/getting-started.md`, `docs/configuration.md` — core markdown docs
+- `docs/performance-dgx-spark.md` — per-platform performance report (DGX Spark; more platforms to follow)
- `docs/tutorials/` — tutorial walkthroughs (background, system config, benchmarking, config files)
- `docs/stylesheets/extra.css` — custom theme overrides
**Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots:
- **Backend list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/configuration.md`
- **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section
-- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage)
+- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`)
- **Public API** (`src/common.h`, `src/types.h`, `src/manager.h`) — `docs/api-guide.md`, `docs/daqiri-api.html`
+- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic
- **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table
The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time.
diff --git a/README.md b/README.md
index 50112c6..30c17e1 100644
--- a/README.md
+++ b/README.md
@@ -81,6 +81,7 @@ Reference material for the DAQIRI codebase:
- [Getting Started](docs/getting-started.md) — System requirements, build/install instructions, and CMake options
- [Configuration Reference](docs/configuration.md) — Full YAML config reference for all backends
- [API Guide](docs/api-guide.md) — BurstParams, RX/TX workflows, buffer lifecycle, status codes
+- [Performance: DGX Spark](docs/performance-dgx-spark.md) — Per-platform throughput, drop, and utilization numbers for all backends
- [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off
## Tutorials
diff --git a/docs/index.html b/docs/index.html
index 61a9eb2..53f2ffe 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -595,6 +595,12 @@
News
+
+
Performance2026
+
DAQIRI Performance on DGX Spark
+
NVIDIA — Throughput, drops, and resource utilization for DPDK GPUDirect, RoCE, and socket backends measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.
+
+
GitHub2025
DAQIRI Open-Sourced on GitHub
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md
new file mode 100644
index 0000000..9854243
--- /dev/null
+++ b/docs/performance-dgx-spark.md
@@ -0,0 +1,566 @@
+# Performance: DGX Spark
+
+This page reports DAQIRI throughput, drop, and resource-utilization numbers
+measured on a DGX Spark (GB10) workstation. It is the first in a series of
+per-platform performance reports; the same section layout will be reused for
+IGX, x86-server, and other targets.
+
+The numbers below are reproducible — every cell is generated by
+[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh)
+against the YAML configs in `examples/`, with the system state captured by
+[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh)
+alongside each result set.
+
+**Contents**
+
+- [Summary](#summary)
+- [Introduction](#introduction) — system under test, methodology
+- [C++ Results](#c-results) — DPDK, RoCE, Socket, workload variants
+- [Python Results](#python-results)
+- [Reproduce these results](#reproduce-these-results)
+- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations)
+
+## Summary
+
+Two headline tables. **Native-shape peak** reports each backend at its
+best-case operation size (the configuration the backend was designed for).
+**Matched 8 KB** drives DPDK / RoCE / Socket UDP at a common ~8 KB unit of
+work so the cross-backend comparison is apples-to-apples; TCP is omitted from
+that table since it has no operation boundary.
+
+### Native-shape peak — max no-drop throughput (Gbps)
+
+| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM |
+| --------------------- | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect (8 KB) | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ |
+| RoCE (8 MB SEND) | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ |
+| Socket UDP (MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a |
+| Socket TCP (stream) | _deferred_[^3] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a |
+
+### Matched 8 KB operation — cross-backend Gbps
+
+| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM |
+| ------------------------ | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- |
+| DPDK GPUDirect | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ |
+| RoCE | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ |
+| Socket UDP (1472 B, MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a |
+
+[^1]: RoCE single-host loopback is deferred from PR 1 — the bench's `--mode both` runs both ends in one process; with both IPs locally bound, the kernel shortcuts RC connection setup through `lo` rather than the cable. Real loopback measurement requires two netns (one per port), `rdma system set netns exclusive`, RDMA-device netns transfer, a YAML split, and two-process orchestration. Tracked in a follow-up issue.
+[^2]: Socket UDP `--mode both` deadlocks on peer learning: both server and client try to transmit before either has received, so neither side learns its peer. Only ~1000 packets / 30 s trickle through. Tracked in a follow-up issue.
+[^3]: Socket TCP `--mode both` aborts with a glibc heap-corruption assertion (`malloc.c:2599`) immediately after the second TCP connection accept. Bench is unrunnable. Tracked in a follow-up issue.
+
+!!! note "Why two tables"
+ A single Gbps number isn't enough to compare backends fairly. A DPDK
+ backend at peak with 8 KB packets is doing very different work than a RoCE
+ backend at peak with 8 MB messages, even when both report the same Gbps.
+ The matched 8 KB table makes the cross-backend comparison honest; the
+ native-shape table shows each backend's design ceiling.
+
+## Introduction
+
+### System under test
+
+The reproducibility appendix has the full capture. Key fields:
+
+| Field | Value |
+| ----------------- | ------------------------------------ |
+| Platform | NVIDIA DGX Spark (GB10) |
+| GPU | Blackwell (compute capability 12.1) |
+| NIC | ConnectX-7 (two ports, tied / QSFP loopback) |
+| Topology | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable |
+| OS | _captured in `environment.txt`_ |
+| Kernel | _captured in `environment.txt`_ |
+| CUDA driver | _captured in `environment.txt`_ |
+| DPDK | patched per `dpdk_patches/` (container build) |
+| DAQIRI commit | _captured in `environment.txt`_ |
+
+### Methodology
+
+#### Bench commands
+
+Each backend has a dedicated bench executable in `examples/`. The DPDK
+numbers in this report come from the first command; the RoCE and Socket
+commands are listed here for documentation and will be the basis for the
+future fill of those rows.
+
+```bash
+# DPDK GPUDirect — physical loopback (used in this report)
+./build/examples/daqiri_bench_raw_gpudirect \
+ examples/daqiri_bench_raw_tx_rx_spark.yaml \
+ --seconds 30 [--target-gbps G]
+
+# RoCE — same NIC, two ports (deferred; see TODO / Known Limitations)
+./build/examples/daqiri_bench_rdma \
+ examples/daqiri_bench_rdma_tx_rx_spark.yaml \
+ --seconds 30 --mode both [--target-gbps G]
+
+# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations)
+./build/examples/daqiri_bench_socket \
+ examples/daqiri_bench_socket_udp_tx_rx.yaml \
+ --seconds 30 --mode both [--target-gbps G]
+```
+
+The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC:
+
+```bash
+ETH_DST_ADDR="$(cat /sys/class/net/
/address)"
+```
+
+#### Per-backend sweep dimensions
+
+**Payload** is the user-data bytes in one packet (DPDK / Socket UDP),
+one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI
+hands to (or pulls from) the NIC in one `rte_eth_tx_burst` /
+`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob.
+Larger batches amortize doorbell and API overhead per packet; smaller
+batches keep per-call latency lower. Batch only matters when the bench
+is not yet at the link ceiling — at saturation, the NIC is the
+bottleneck and batch size has near-zero effect.
+
+The "payload × batch" sweep doesn't map uniformly across backends. Each
+backend has its own sweep:
+
+| Backend | Sweep dim 1 | Sweep dim 2 | Native-shape cell | Matched-size cell |
+| ------------------ | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- |
+| DPDK GPUDirect | payload_size ∈ {64, 256, 1024, 4096, 8000} B | batch_size ∈ {256, 1024, 4096, 10240} | (8000, 10240) | (8000, 10240) — same |
+| RoCE | message_size ∈ {4 K, 64 K, 1 M, 8 M} B | batch_size ∈ {1, 4, 16} | (8 M, 1) | (8 K, 16) |
+| Socket UDP | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound) | batch_size ∈ {1, 32, 256} | (1472, 256) | (1472, 256) — closest to 8 K under MTU cap |
+| Socket TCP | message_size ∈ {1 K, 64 K, 1 M} B | n/a (single stream) | (64 K) | n/a |
+
+#### "No-drop" threshold
+
+A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run.
+The headline tables report the highest target rate at which a run was still
+drop-free under this threshold. The methodology does not use a percentile cap
+— either there are drops or there are not.
+
+#### Drop-curve sweep
+
+The drop curve sweeps `--target-gbps` while holding the native-shape cell
+constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`)
+adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates
+due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK
+`accurate_send`) is unused but would tighten DPDK-only precision; deferred to
+a follow-up.
+
+#### Drop sources per backend
+
+- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO`
+ output (the bench's stderr).
+- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA
+ manager filters error completions but logs each one).
+- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run.
+- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` /
+ `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is
+ the closest proxy.
+
+#### External captures per run
+
+Each run records, alongside the bench:
+
+- `/proc/stat` snapshots before and after the bench process. The wrapper
+ computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18
+ on Spark) by delta over the run window. `mpstat` is not used — it is
+ often missing from minimal containers, and the per-core CPUs we care
+ about are pinned by the YAML so a targeted delta is more meaningful
+ than a system-wide average.
+- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem-controller%, and
+ PCIe rxpci / txpci columns (the latter are reported as `-` on the
+ current Spark driver; SM% and mem% are near zero for plain GPUDirect
+ because the GPU is a DMA target, not a compute engine).
+
+Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA,
+hugepages, GPU state, DAQIRI commit) is captured once per result set by
+`bench_capture_environment.sh`.
+
+## C++ Results
+
+### DPDK GPUDirect
+
+Native shape on Spark is 8 KB payload, batch 10240 — the configuration the
+DPDK backend was designed around. All cells below ran for 30 s with
+`drops == 0`.
+
+#### Drop curve at native shape
+
+Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The
+token-bucket pacer tracks target within ±0.02 Gbps until the link saturates
+near 96 Gbps. Beyond that, target=100 and unpaced both report the
+achievable ceiling. TX and RX cores spin in poll-mode regardless of target
+rate (visible in the CPU table below).
+
+| target Gbps | achieved Gbps | RX pps | drops | TX core % | RX core % |
+| ----------- | ------------- | --------- | ----- | --------- | --------- |
+| 1 | 1.011 | 15,678 | 0 | 92.0 | 92.0 |
+| 5 | 5.012 | 77,697 | 0 | 91.8 | 91.8 |
+| 10 | 10.001 | 155,032 | 0 | 92.5 | 92.5 |
+| 25 | 25.008 | 387,650 | 0 | 91.9 | 91.9 |
+| 50 | 50.001 | 775,062 | 0 | 92.8 | 92.8 |
+| 75 | 74.999 | 1,162,551 | 0 | 91.7 | 91.7 |
+| 100 | 96.370 | 1,493,823 | 0 | 91.6 | 91.6 |
+| unpaced | 95.897 | 1,486,498 | 0 | 91.6 | 90.5 |
+
+#### Payload × target_gbps matrix
+
+Holds batch=10240 constant; sweeps payload and `--target-gbps`
+together. Each cell shows the achieved Gbps and drops over a 30 s
+run. Coloring is relative to the **effective target**
+`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the
+link cap as its effective target:
+
+
+ green — no drops, achieved ≥ 95% of effective target
+ yellow — no drops, achieved ≥ 70% of effective target
+ red — drops, or achieved < 70% of effective target
+
+
+
+
+
+ | payload |
+ target Gbps |
+
+
+ | 1 | 5 | 10 | 25 | 50 | 75 | 100 | unpaced |
+
+
+
+
+ | 8000 B |
+ 1.0 Gbps0 drops |
+ 5.0 Gbps0 drops |
+ 10.0 Gbps0 drops |
+ 25.0 Gbps0 drops |
+ 50.0 Gbps0 drops |
+ 75.0 Gbps0 drops |
+ 95.9 Gbps0 drops |
+ 96.4 Gbps0 drops |
+
+
+ | 4096 B |
+ 1.0 Gbps0 drops |
+ 5.0 Gbps0 drops |
+ 10.0 Gbps0 drops |
+ 25.0 Gbps0 drops |
+ 50.0 Gbps0 drops |
+ 75.0 Gbps0 drops |
+ 100.0 Gbps0 drops |
+ 102.7 Gbps0 drops |
+
+
+ | 1024 B |
+ 1.0 Gbps0 drops |
+ 5.0 Gbps0 drops |
+ 10.0 Gbps0 drops |
+ 24.9 Gbps0 drops |
+ 49.8 Gbps0 drops |
+ 74.2 Gbps0 drops |
+ 94.0 Gbps0 drops |
+ 68.2 Gbps0 drops |
+
+
+ | 256 B |
+ 1.0 Gbps0 drops |
+ 5.0 Gbps0 drops |
+ 10.0 Gbps0 drops |
+ 24.7 Gbps0 drops |
+ 21.1 Gbps0 drops |
+ 21.2 Gbps0 drops |
+ 21.2 Gbps0 drops |
+ 21.6 Gbps0 drops |
+
+
+ | 64 B |
+ 1.0 Gbps0 drops |
+ 5.0 Gbps0 drops |
+ 9.9 Gbps0 drops |
+ 8.9 Gbps0 drops |
+ 8.9 Gbps0 drops |
+ 13.7 Gbps0 drops |
+ 8.9 Gbps0 drops |
+ 8.7 Gbps0 drops |
+
+
+
+
+**Reading the matrix.** The token-bucket pacer tracks target within
+sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS
+ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows
+up as a horizontal saturation band: once `target_gbps` crosses that
+ceiling, increasing it further produces no additional throughput.
+The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the
+requested rate is well above achievable — the next cell at
+target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell
+(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the
+size where the master loop transitions from idle to fully busy
+(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding
+unpaced run), and per-cell numbers in this regime carry roughly
+±5 Gbps of run-to-run variance.
+
+#### Payload × batch matrix
+
+Each cell shows the achieved Gbps and drops over a 30 s unpaced run.
+Coloring is relative to the global max across the matrix (here
+**104.989 Gbps** at payload 4096 B, batch 4096):
+
+
+ green — no drops, Gbps ≥ 90% of max
+ yellow — no drops, Gbps ≥ 70% of max
+ red — drops, or Gbps < 70% of max
+
+
+
+
+
+ | payload |
+ batch (packets per burst) |
+
+
+ | 256 | 1024 | 4096 | 10240 |
+
+
+
+
+ | 8000 B |
+ 96.4 Gbps0 drops |
+ 96.4 Gbps0 drops |
+ 96.4 Gbps0 drops |
+ 95.9 Gbps0 drops |
+
+
+ | 4096 B |
+ 104.9 Gbps0 drops |
+ 104.9 Gbps0 drops |
+ 105.0 Gbps0 drops |
+ 103.0 Gbps0 drops |
+
+
+ | 1024 B |
+ 64.9 Gbps0 drops |
+ 86.2 Gbps0 drops |
+ 71.0 Gbps0 drops |
+ 72.7 Gbps0 drops |
+
+
+ | 256 B |
+ 24.5 Gbps0 drops |
+ 24.8 Gbps0 drops |
+ 29.3 Gbps0 drops |
+ 21.1 Gbps0 drops |
+
+
+ | 64 B |
+ 10.1 Gbps0 drops |
+ 10.3 Gbps0 drops |
+ 9.8 Gbps0 drops |
+ 10.3 Gbps0 drops |
+
+
+
+
+**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105
+Gbps) and batch size barely moves the number, so every cell is green.
+The 1 KB row is the transition: pps and Gbps both matter, batch size
+starts to influence which side dominates, and most cells fall under 70%
+of the global max. At ≤ 256 B the bottleneck is packets-per-second
+(~10 M pps ceiling at 64 B), so effective Gbps stays well below the
+link ceiling regardless of batch. Run-to-run variance on the unpaced
+cells is ~0.5 Gbps; row-internal Gbps differences smaller than that
+should be treated as noise.
+
+#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced)
+
+| Resource | Value | Note |
+| -------------------- | ----- | ------------------------------------------ |
+| Master core (CPU 8) | 3.3% | Mostly idle; orchestration only |
+| TX core (CPU 17) | 91.4% | Poll-mode spin; rate-independent |
+| RX core (CPU 18) | 91.4% | Poll-mode spin; rate-independent |
+| GPU SM % | 0.0% | GPU is a DMA target, no compute |
+| GPU mem-ctrl % | 0.0% | Payload writes traverse PCIe, not the GPU memory controller |
+
+The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to
+line rate — characteristic of DPDK's poll-mode driver, which spins waiting
+for descriptors regardless of offered load. The master core handles
+configuration only and idles below 5 % at the headline shape; at smaller
+payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts
+flow through the orchestration path. That asymmetry is data, not a bug,
+and is captured in the per-cell artifacts under `bench-results/`.
+
+### RoCE
+
+**Deferred from this report.** See [headline-table footnote 1](#fn:1) and
+the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)
+section. Single-host RoCE loopback on Spark requires a two-netns +
+two-process orchestration that the wrapper does not yet implement. The RoCE
+rows will be filled when the follow-up issue lands.
+
+### Socket
+
+**Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2)
+and the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)
+section. Both backends produced unusable data on Spark during PR 1
+verification:
+
+- **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try
+ to transmit before either has received, only ~1000 packets per 30 s get
+ through (≈ 390 kbps). Visible as repeated
+ `[ERROR] UDP server has no learned peer yet; cannot transmit` lines in
+ the bench's stderr.
+- **Socket TCP** in `--mode both` aborts with a glibc heap-corruption
+ assertion (`Fatal glibc error: malloc.c:2599 (sysmalloc)`) immediately
+ after the second TCP connection accept. The bench crashes before any
+ send / recv completes, so no completion line is printed.
+
+Both are tracked as separate follow-up issues; the Socket rows here will be
+filled once the underlying bench bugs are fixed.
+
+### Workload variants (FFT, GEMM)
+
+The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15))
+adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` /
+`cuBLAS` on the received GPU-resident payload, and reports the resulting
+throughput delta and GPU utilization.
+
+**Sizes (representative):**
+
+- FFT: 1D complex-to-complex, length 1024.
+- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload).
+
+#### DPDK GPUDirect — FFT/GEMM
+
+_TBD (PR 2)._
+
+#### RoCE — FFT/GEMM
+
+_TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends:
+RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it
+once per packet. The throughput numbers are comparable; "operations per
+burst" is not.
+
+## Python Results
+
+The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror
+the C++ benches' CLI and stdout format, using the existing pybind11 bindings.
+
+### Loopback
+
+_TBD (PR 3)._
+
+### FFT / GEMM via pybind of the C++ post-process layer
+
+_TBD (PR 4)._
+
+## Reproduce these results
+
+### Container
+
+All commands below assume execution inside the project container, as
+required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md).
+On the host, launch the container in privileged mode with all GPUs,
+hugepage mounts, and `/sys` passed through, and bind-mount the repo at
+`/workspace`:
+
+```bash
+# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers.
+RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}"
+sudo docker run --rm -it \
+ --net host --ipc=host \
+ --runtime=nvidia --gpus all \
+ --privileged \
+ --ulimit memlock=-1 --ulimit stack=67108864 \
+ -v "$(pwd):/workspace" \
+ -v /dev/hugepages:/dev/hugepages \
+ -v /mnt/huge:/mnt/huge \
+ -v /sys:/sys \
+ -w /workspace \
+ -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \
+ -e RX_IFACE="$RX_IFACE" \
+ daqiri:local \
+ bash
+```
+
+### Build
+
+Inside the container:
+
+```bash
+cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \
+ -DDAQIRI_MGR="dpdk socket rdma"
+cmake --build build -j
+```
+
+### One-shot driver
+
+The whole DPDK matrix (sweep + drop-curve) is driven by a single script
+which handles pre-flight checks (hugepage availability, RX iface MAC,
+link state), orphan-hugepage cleanup between cells, and a final summary of
+per-backend result directories:
+
+```bash
+./scripts/spark_data_fill.sh dpdk
+```
+
+The script defaults to `dpdk socket-udp socket-tcp` if invoked with no
+arguments; on Spark, the socket backends will fail their own pre-flight
+once the follow-up issues land. RDMA is currently rejected by the
+pre-flight (see [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)).
+
+### Per-backend wrapper invocations
+
+For ad-hoc runs of a single cell or a single mode:
+
+```bash
+export DAQIRI_BUILD_DIR="$PWD/build"
+export ETH_DST_ADDR="$(cat /sys/class/net//address)" # DPDK only
+
+./examples/run_spark_bench.sh dpdk smoke # native-shape headline cell
+./examples/run_spark_bench.sh dpdk sweep # full payload × batch matrix
+./examples/run_spark_bench.sh dpdk drop-curve # sweep --target-gbps
+```
+
+Each invocation emits `bench-results/-dpdk-/` containing
+one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat
+captures), an `environment.txt` snapshot, and a `runs.csv` aggregating
+the cell-level metrics.
+
+### Environment-only capture
+
+Useful for filing a bug report or comparing two Spark units without
+running the bench:
+
+```bash
+./examples/bench_capture_environment.sh /tmp/spark-env
+```
+
+### Tuning prerequisites
+
+System tuning is required before the numbers in this report are
+reproducible. See
+[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md)
+for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity.
+
+## TODO: Not Yet Implemented / Known Limitations
+
+- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses
+ `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory
+ for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable —
+ so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no
+ longer changes the memory path; it only changes the segment partition,
+ which makes "HDS vs. plain GPUDirect" a non-distinction on this platform.
+ HDS is characterized when this report extends to IGX and x86-server
+ platforms where device memory works. See
+ [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking.
+- **RoCE single-host loopback is deferred from this report.** See footnote [^1]
+ on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in
+ `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0)
+ and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the
+ RC connection through `lo` and the QSFP cable carries no traffic. A
+ follow-up will land the two-netns + two-process orchestration and re-fill
+ the RoCE rows.
+- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3].
+ Both are bench-side bugs uncovered during PR 1 verification on Spark:
+ UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with
+ a glibc heap-corruption assertion on init. Follow-up issues track each.
+- **p99/p999 latency is not in v1.** The bench output captures throughput,
+ drops, and resource utilization. Per-burst RX timestamping and percentile
+ aggregation are deferred to a follow-up issue.
diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css
index b5ce45f..95af265 100644
--- a/docs/stylesheets/extra.css
+++ b/docs/stylesheets/extra.css
@@ -50,3 +50,64 @@
[data-md-color-scheme="slate"] .md-footer {
background: #111;
}
+
+/* ── Performance-report heatmap cells ───────────────────────────────── */
+/* Used by the payload×batch and payload×target_gbps matrices in
+ docs/performance-*.md. Threshold logic (vs. matrix-global max):
+ green = no drops AND Gbps ≥ 90% of max
+ yellow = no drops AND Gbps ≥ 70% of max
+ red = any drops OR Gbps < 70% of max */
+.md-typeset table.perf-matrix {
+ width: 100%;
+ table-layout: auto;
+ border-collapse: separate;
+ border-spacing: 5px;
+ font-size: 0.64rem;
+}
+.md-typeset table.perf-matrix th,
+.md-typeset table.perf-matrix td {
+ text-align: center;
+ vertical-align: middle;
+ font-variant-numeric: tabular-nums;
+ padding: 0.55em 0.5em;
+ border-radius: 4px;
+ white-space: nowrap;
+}
+.md-typeset table.perf-matrix td small {
+ display: block;
+ opacity: 0.75;
+ font-size: 0.85em;
+ margin-top: 0.25em;
+}
+.md-typeset table.perf-matrix td.cell-green {
+ background-color: rgba(118, 185, 0, 0.28);
+ color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-yellow {
+ background-color: rgba(255, 196, 0, 0.32);
+ color: inherit;
+}
+.md-typeset table.perf-matrix td.cell-red {
+ background-color: rgba(220, 60, 60, 0.32);
+ color: inherit;
+}
+.md-typeset table.perf-matrix th {
+ background-color: rgba(255, 255, 255, 0.05);
+ font-weight: 600;
+}
+/* Compact legend chips that pair with the matrix. */
+.md-typeset .perf-legend {
+ display: flex;
+ gap: 0.75em;
+ margin: 0.5em 0 1em 0;
+ font-size: 0.85em;
+ flex-wrap: wrap;
+}
+.md-typeset .perf-legend span {
+ padding: 0.1em 0.55em;
+ border-radius: 0.25em;
+ white-space: nowrap;
+}
+.md-typeset .perf-legend .cell-green { background-color: rgba(118, 185, 0, 0.28); }
+.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); }
+.md-typeset .perf-legend .cell-red { background-color: rgba(220, 60, 60, 0.32); }
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 4bc11aa..4807b85 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -71,14 +71,16 @@ add_daqiri_raw_bench(daqiri_bench_raw_reorder_quantize raw_reorder_quantize_benc
add_daqiri_raw_bench(daqiri_example_gds_write gds_write_example.cpp)
add_daqiri_raw_bench(daqiri_example_pcap_writer pcap_writer_example.cpp)
-add_executable(daqiri_bench_rdma rdma_bench.cpp)
+add_executable(daqiri_bench_rdma rdma_bench.cpp raw_bench_common.cpp)
link_daqiri_bench(daqiri_bench_rdma)
+target_link_libraries(daqiri_bench_rdma PRIVATE CUDA::cudart)
set_target_properties(daqiri_bench_rdma PROPERTIES
BUILD_RPATH "$ORIGIN/../src;$ORIGIN/../src/third_party/yaml-cpp"
)
-add_executable(daqiri_bench_socket socket_bench.cpp)
+add_executable(daqiri_bench_socket socket_bench.cpp raw_bench_common.cpp)
link_daqiri_bench(daqiri_bench_socket)
+target_link_libraries(daqiri_bench_socket PRIVATE CUDA::cudart)
foreach(cfg IN LISTS DAQIRI_BENCH_CONFIGS)
configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${cfg} ${CMAKE_CURRENT_BINARY_DIR}/${cfg} COPYONLY)
diff --git a/examples/bench_capture_environment.sh b/examples/bench_capture_environment.sh
new file mode 100755
index 0000000..3314b08
--- /dev/null
+++ b/examples/bench_capture_environment.sh
@@ -0,0 +1,116 @@
+#!/usr/bin/env bash
+#
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Capture host/NIC/GPU/build state for a benchmark run, so numbers are
+# reproducible across machines and over time. Writes one structured text file
+# with named sections.
+#
+# Usage: ./bench_capture_environment.sh
+# Default output dir: bench-results//
+
+set -u
+
+OUT_DIR="${1:-bench-results/$(date -u +%Y%m%dT%H%M%SZ)}"
+mkdir -p "$OUT_DIR"
+OUT="$OUT_DIR/environment.txt"
+
+# Run a command, capturing exit status. Always write a header so the section is
+# present even when the command is missing or fails — silent absence is harder
+# to debug than an explicit "command not found".
+run_section() {
+ local label="$1"; shift
+ {
+ echo "=========================================================="
+ echo "[$label]"
+ echo " cmd: $*"
+ echo "=========================================================="
+ if command -v "$1" >/dev/null 2>&1 || [[ "$1" == /* || "$1" == ./* ]]; then
+ "$@" 2>&1
+ echo " (exit: $?)"
+ else
+ echo " (command not found in PATH: $1)"
+ fi
+ echo
+ } >> "$OUT"
+}
+
+# Cat a file/glob; write a header either way.
+cat_section() {
+ local label="$1"; shift
+ {
+ echo "=========================================================="
+ echo "[$label]"
+ echo " paths: $*"
+ echo "=========================================================="
+ for p in "$@"; do
+ if compgen -G "$p" >/dev/null; then
+ for f in $p; do
+ echo "----- $f -----"
+ cat "$f" 2>&1
+ done
+ else
+ echo " (no match: $p)"
+ fi
+ done
+ echo
+ } >> "$OUT"
+}
+
+: > "$OUT"
+
+echo "DAQIRI benchmark environment capture" >> "$OUT"
+echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$OUT"
+echo "Host: $(hostname)" >> "$OUT"
+echo "Output: $OUT" >> "$OUT"
+echo >> "$OUT"
+
+# --- Kernel / OS ---
+run_section "uname" uname -a
+cat_section "kernel-cmdline" /proc/cmdline
+cat_section "os-release" /etc/os-release
+run_section "lsb-release" lsb_release -a
+run_section "clocksource" cat /sys/devices/system/clocksource/clocksource0/current_clocksource
+
+# --- CPU / NUMA / IRQ ---
+run_section "numactl" numactl --show
+run_section "lscpu" lscpu
+cat_section "cpu-isolated" /sys/devices/system/cpu/isolated
+run_section "cpufreq-info" cpupower frequency-info
+cat_section "cpu-governor" /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+run_section "irq-mlx5" bash -c "grep mlx5 /proc/interrupts || true"
+
+# --- Hugepages ---
+cat_section "hugepages" /sys/kernel/mm/hugepages/*/nr_hugepages
+run_section "free-h" free -h
+
+# --- PCIe topology ---
+run_section "lspci-mellanox" bash -c "lspci -vvv -d 15b3: 2>/dev/null"
+run_section "lspci-nvidia" bash -c "lspci -vvv -d 10de: 2>/dev/null"
+
+# --- NIC: OFED / firmware / DPDK binding ---
+run_section "ofed-info" ofed_info -s
+run_section "mlxfwmanager" mlxfwmanager --query
+run_section "dpdk-devbind" dpdk-devbind.py --status
+# Per-iface ethtool — iterate over the daqiri-tx/rx names if present, else all mlx5.
+for iface in daqiri-tx daqiri-rx $(ls /sys/class/net 2>/dev/null | grep -E '^(enP|enp|eth)' || true); do
+ [[ -d "/sys/class/net/$iface" ]] || continue
+ run_section "ethtool-i:$iface" ethtool -i "$iface"
+ run_section "ethtool-g:$iface" ethtool -g "$iface"
+ run_section "ethtool-l:$iface" ethtool -l "$iface"
+ cat_section "iface-mtu:$iface" "/sys/class/net/$iface/mtu"
+ cat_section "iface-mac:$iface" "/sys/class/net/$iface/address"
+done
+
+# --- GPU ---
+run_section "nvidia-smi-q" nvidia-smi -q
+run_section "nvidia-smi-tempclk" nvidia-smi --query-gpu=name,driver_version,temperature.gpu,clocks.current.sm,clocks.current.memory --format=csv
+
+# --- Build state ---
+DAQIRI_DIR="$(git -C "$(dirname "$0")/.." rev-parse --show-toplevel 2>/dev/null || pwd)"
+run_section "git-rev-parse" git -C "$DAQIRI_DIR" rev-parse HEAD
+run_section "git-status" git -C "$DAQIRI_DIR" status --short
+run_section "git-describe" git -C "$DAQIRI_DIR" describe --always --dirty
+
+echo "Capture complete: $OUT"
diff --git a/examples/raw_bench_common.cpp b/examples/raw_bench_common.cpp
index 405c500..bb1e42c 100644
--- a/examples/raw_bench_common.cpp
+++ b/examples/raw_bench_common.cpp
@@ -19,10 +19,12 @@
#include
+#include
#include
#include
#include
#include
+#include
#include
#include
@@ -97,6 +99,44 @@ int parse_run_seconds(int argc, char **argv) {
return run_seconds;
}
+double parse_target_gbps(int argc, char **argv) {
+ double target_gbps = 0.0;
+ for (int i = 2; i + 1 < argc; i += 2) {
+ if (std::string(argv[i]) == "--target-gbps") {
+ target_gbps = std::stod(argv[i + 1]);
+ }
+ }
+ return target_gbps;
+}
+
+TokenBucketPacer::TokenBucketPacer(double target_gbps)
+ : target_bps_(target_gbps > 0.0 ? target_gbps * 1e9 : 0.0),
+ t0_(std::chrono::steady_clock::now()) {}
+
+void TokenBucketPacer::wait_for_bytes(size_t bytes, std::atomic &stop) {
+ if (target_bps_ <= 0.0) {
+ return;
+ }
+ total_bytes_ += bytes;
+ const double scheduled_secs = (total_bytes_ * 8.0) / target_bps_;
+ const auto scheduled = t0_ + std::chrono::duration_cast<
+ std::chrono::steady_clock::duration>(
+ std::chrono::duration(scheduled_secs));
+ // Slice the wait into 10 ms chunks so a stop flag (--seconds expiry or
+ // Ctrl-C) can break us out promptly. The total slept across the slices
+ // accumulates to the scheduled deadline, so pacing remains accurate.
+ constexpr auto kSlice = std::chrono::milliseconds(10);
+ while (!stop.load()) {
+ const auto now = std::chrono::steady_clock::now();
+ if (scheduled <= now) {
+ return;
+ }
+ const auto remaining = scheduled - now;
+ std::this_thread::sleep_for(
+ std::min(remaining, kSlice));
+ }
+}
+
bool has_bench_rx(const YAML::Node &root) {
return root["bench_rx"] && root["bench_rx"]["interface_name"];
}
@@ -287,6 +327,7 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) {
uint64_t pkts = 0;
uint64_t bytes = 0;
uint64_t bursts = 0;
+ const auto t0 = std::chrono::steady_clock::now();
while (!stop.load()) {
const auto num_rx_queues =
static_cast(daqiri::get_num_rx_queues(port_id));
@@ -307,9 +348,17 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) {
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
}
-
- std::cout << "RX complete: packets=" << pkts << " bytes=" << bytes
- << " bursts=" << bursts << "\n";
+ const double secs =
+ std::chrono::duration(std::chrono::steady_clock::now() - t0)
+ .count();
+
+ // Build the line in a stringstream so the print to stdout is a single
+ // write(). RX and TX workers race at end-of-run and naive `cout <<` can
+ // interleave their output (corrupting downstream parsers).
+ std::ostringstream oss;
+ oss << "RX complete: packets=" << pkts << " bytes=" << bytes
+ << " bursts=" << bursts << " seconds=" << secs << "\n";
+ std::cout << oss.str() << std::flush;
}
} // namespace daqiri::bench
diff --git a/examples/raw_bench_common.h b/examples/raw_bench_common.h
index d53a4f9..cf0ee0c 100644
--- a/examples/raw_bench_common.h
+++ b/examples/raw_bench_common.h
@@ -21,6 +21,7 @@
#include
#include
+#include
#include
#include
#include
@@ -28,6 +29,34 @@
namespace daqiri::bench {
+// Software token-bucket pacer used by the bench TX workers. When
+// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the
+// pacer adds no overhead when --target-gbps is unset.
+//
+// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler
+// jitter. Acceptable for drop-curve sweeps; tighter pacing would require
+// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred.
+class TokenBucketPacer {
+public:
+ TokenBucketPacer() = default;
+ explicit TokenBucketPacer(double target_gbps);
+
+ // Call after each TX burst. Sleeps in short slices until the pacer's notion
+ // of "time the configured target rate would have taken to send the
+ // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the
+ // bench responsive to --seconds expiry / Ctrl-C without truncating the total
+ // sleep (which would silently break pacing for low target rates).
+ void wait_for_bytes(size_t bytes, std::atomic &stop);
+
+ bool enabled() const { return target_bps_ > 0.0; }
+ double target_gbps() const { return target_bps_ / 1e9; }
+
+private:
+ double target_bps_ = 0.0; // 0 means disabled
+ uint64_t total_bytes_ = 0;
+ std::chrono::steady_clock::time_point t0_;
+};
+
struct RawBenchTxConfig {
std::string interface_name = "tx_port";
uint32_t batch_size = 1024;
@@ -68,6 +97,7 @@ class PinnedHostBuffer {
};
int parse_run_seconds(int argc, char **argv);
+double parse_target_gbps(int argc, char **argv);
bool has_bench_rx(const YAML::Node &root);
bool has_bench_tx(const YAML::Node &root);
RawBenchRxConfig parse_rx(const YAML::Node &root);
diff --git a/examples/raw_gpudirect_bench.cpp b/examples/raw_gpudirect_bench.cpp
index e93d0c8..cbfa9f9 100644
--- a/examples/raw_gpudirect_bench.cpp
+++ b/examples/raw_gpudirect_bench.cpp
@@ -24,6 +24,7 @@
#include
#include
#include
+#include
#include
#include
#include
@@ -35,6 +36,7 @@
namespace {
void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
+ daqiri::bench::TokenBucketPacer &pacer,
std::atomic &stop) {
const int port_id = daqiri::get_port_id(cfg.interface_name);
if (port_id < 0) {
@@ -67,6 +69,12 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
std::unordered_set initialized_tx_buffers;
+ uint64_t tx_packets = 0;
+ uint64_t tx_bytes = 0;
+ uint64_t tx_bursts = 0;
+ const auto t0 = std::chrono::steady_clock::now();
+ const uint32_t pkt_bytes = cfg.header_size + cfg.payload_size;
+
while (!stop.load()) {
auto *msg = daqiri::create_tx_burst_params();
daqiri::set_header(msg, static_cast(port_id), 0, cfg.batch_size,
@@ -109,7 +117,7 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
}
if (daqiri::set_packet_lengths(
- msg, i, {static_cast(cfg.header_size + cfg.payload_size)}) !=
+ msg, i, {static_cast(pkt_bytes)}) !=
daqiri::Status::SUCCESS) {
failed = true;
break;
@@ -121,18 +129,34 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg,
continue;
}
daqiri::send_tx_burst(msg);
+ const uint64_t burst_bytes =
+ static_cast(num_pkts) * pkt_bytes;
+ tx_packets += static_cast(num_pkts);
+ tx_bytes += burst_bytes;
+ ++tx_bursts;
+ pacer.wait_for_bytes(burst_bytes, stop);
}
+ const double secs =
+ std::chrono::duration(std::chrono::steady_clock::now() - t0)
+ .count();
+ // Single-write print so concurrent RX worker output doesn't interleave.
+ std::ostringstream oss;
+ oss << "TX complete: packets=" << tx_packets << " bytes=" << tx_bytes
+ << " bursts=" << tx_bursts << " seconds=" << secs << "\n";
+ std::cout << oss.str() << std::flush;
}
} // namespace
int main(int argc, char **argv) {
if (argc < 2) {
- std::cerr << "Usage: " << argv[0] << " [--seconds N]\n";
+ std::cerr << "Usage: " << argv[0]
+ << " [--seconds N] [--target-gbps G]\n";
return 1;
}
const int run_seconds = daqiri::bench::parse_run_seconds(argc, argv);
+ const double target_gbps = daqiri::bench::parse_target_gbps(argc, argv);
const auto root = YAML::LoadFile(argv[1]);
if (daqiri::daqiri_init(argv[1]) != daqiri::Status::SUCCESS) {
std::cerr << "daqiri_init failed\n";
@@ -150,14 +174,15 @@ int main(int argc, char **argv) {
std::atomic stop{false};
std::thread tx_thread;
std::thread rx_thread;
+ daqiri::bench::TokenBucketPacer tx_pacer(target_gbps);
if (has_rx) {
rx_thread = std::thread(daqiri::bench::rx_count_worker,
daqiri::bench::parse_rx(root), std::ref(stop));
}
if (has_tx) {
- tx_thread =
- std::thread(tx_worker, daqiri::bench::parse_tx(root), std::ref(stop));
+ tx_thread = std::thread(tx_worker, daqiri::bench::parse_tx(root),
+ std::ref(tx_pacer), std::ref(stop));
}
daqiri::bench::wait_for_stop(run_seconds, stop);
diff --git a/examples/rdma_bench.cpp b/examples/rdma_bench.cpp
index dbe8186..5b9bcfd 100644
--- a/examples/rdma_bench.cpp
+++ b/examples/rdma_bench.cpp
@@ -25,6 +25,7 @@
#include
#include
+#include "raw_bench_common.h"
#include "src/common.h"
namespace {
@@ -48,6 +49,8 @@ struct RdmaBenchConfig {
struct RdmaWorkerStats {
uint64_t send_completions = 0;
uint64_t recv_completions = 0;
+ uint64_t send_bytes = 0;
+ uint64_t recv_bytes = 0;
};
RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) {
@@ -62,7 +65,8 @@ RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) {
return cfg;
}
-void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorkerStats& stats) {
+void rdma_worker(const RdmaBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer,
+ std::atomic& stop, RdmaWorkerStats& stats) {
static constexpr int kMaxOutstanding = 5;
int outstanding_send = 0;
int outstanding_recv = 0;
@@ -116,6 +120,10 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker
}
outstanding++;
wr_id++;
+ // Only meter actual byte transmissions (SENDs), not RECEIVE-side posts.
+ if (op == daqiri::RDMAOpCode::SEND) {
+ pacer.wait_for_bytes(static_cast(cfg.message_size), stop);
+ }
};
if (cfg.send) { post_req(outstanding_send, send_wr_id, daqiri::RDMAOpCode::SEND, send_mr); }
@@ -127,10 +135,12 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker
if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::SEND && outstanding_send > 0) {
outstanding_send--;
stats.send_completions++;
+ stats.send_bytes += static_cast(cfg.message_size);
} else if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::RECEIVE &&
outstanding_recv > 0) {
outstanding_recv--;
stats.recv_completions++;
+ stats.recv_bytes += static_cast(cfg.message_size);
}
daqiri::free_tx_burst(completion);
} else {
@@ -143,17 +153,21 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker
int main(int argc, char** argv) {
if (argc < 2) {
- std::cerr << "Usage: " << argv[0] << " [--seconds N] [--mode server|client|both]\n";
+ std::cerr << "Usage: " << argv[0]
+ << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n";
return 1;
}
int run_seconds = 10;
+ double target_gbps = 0.0;
std::string mode = "both";
for (int i = 2; i + 1 < argc; i += 2) {
if (std::string(argv[i]) == "--seconds") {
run_seconds = std::stoi(argv[i + 1]);
} else if (std::string(argv[i]) == "--mode") {
mode = argv[i + 1];
+ } else if (std::string(argv[i]) == "--target-gbps") {
+ target_gbps = std::stod(argv[i + 1]);
}
}
@@ -168,18 +182,22 @@ int main(int argc, char** argv) {
std::thread client_thread;
RdmaWorkerStats server_stats;
RdmaWorkerStats client_stats;
+ daqiri::bench::TokenBucketPacer server_pacer(target_gbps);
+ daqiri::bench::TokenBucketPacer client_pacer(target_gbps);
bool run_server = false;
bool run_client = false;
if ((mode == "server" || mode == "both") && root["rdma_bench_server"]) {
run_server = true;
server_thread = std::thread(
- rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), std::ref(stop), std::ref(server_stats));
+ rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]),
+ std::ref(server_pacer), std::ref(stop), std::ref(server_stats));
}
if ((mode == "client" || mode == "both") && root["rdma_bench_client"]) {
run_client = true;
client_thread = std::thread(
- rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), std::ref(stop), std::ref(client_stats));
+ rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]),
+ std::ref(client_pacer), std::ref(stop), std::ref(client_stats));
}
if (!server_thread.joinable() && !client_thread.joinable()) {
@@ -203,11 +221,23 @@ int main(int argc, char** argv) {
if (server_thread.joinable()) { server_thread.join(); }
if (client_thread.joinable()) { client_thread.join(); }
+ const double secs =
+ std::chrono::duration(std::chrono::steady_clock::now() - start)
+ .count();
+
if (run_server) {
- std::cout << "Server received messages: " << server_stats.recv_completions << '\n';
+ std::cout << "Server complete: send_completions=" << server_stats.send_completions
+ << " recv_completions=" << server_stats.recv_completions
+ << " send_bytes=" << server_stats.send_bytes
+ << " recv_bytes=" << server_stats.recv_bytes
+ << " seconds=" << secs << '\n';
}
if (run_client) {
- std::cout << "Client received messages: " << client_stats.recv_completions << '\n';
+ std::cout << "Client complete: send_completions=" << client_stats.send_completions
+ << " recv_completions=" << client_stats.recv_completions
+ << " send_bytes=" << client_stats.send_bytes
+ << " recv_bytes=" << client_stats.recv_bytes
+ << " seconds=" << secs << '\n';
}
daqiri::print_stats();
diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh
new file mode 100755
index 0000000..c2aa75a
--- /dev/null
+++ b/examples/run_spark_bench.sh
@@ -0,0 +1,334 @@
+#!/usr/bin/env bash
+#
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Sweep wrapper for DAQIRI benchmarks on DGX Spark. Runs the bench across a
+# matrix of (payload/message size, batch size, target-gbps), captures per-run
+# CPU/GPU/NIC counters, and emits one CSV row per cell into bench-results/.
+#
+# Drop sources per backend (per the report methodology):
+# DPDK : grep imissed/ierrors/rx_nombuf from bench log (DAQIRI_LOG_INFO).
+# RDMA : grep "CQ error" lines from bench log (DAQIRI_LOG_ERROR).
+# socket : diff /proc/net/udp drops column (UDP); nstat -a (TCP retransmits).
+#
+# Usage:
+# ./run_spark_bench.sh [mode]
+# backend ∈ {dpdk, rdma, socket-udp, socket-tcp}
+# mode ∈ {smoke, sweep, drop-curve, drop-curve-matrix} (default: smoke)
+#
+# Required environment in current shell:
+# DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build).
+# ETH_DST_ADDR — required for dpdk backend (the RX iface MAC).
+# RX_IFACE — kernel name of the RX interface for /proc/net/udp diff
+# (e.g. enP2p1s0f0np0); required for socket-udp.
+#
+# Run inside the project container as root (per AGENTS.md).
+
+set -u
+set -o pipefail
+
+# --------------------------------------------------------------------------
+# Configuration
+# --------------------------------------------------------------------------
+
+BACKEND="${1:-}"
+MODE="${2:-smoke}"
+if [[ -z "$BACKEND" ]]; then
+ echo "Usage: $0 [smoke|sweep|drop-curve|drop-curve-matrix]" >&2
+ exit 1
+fi
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+BUILD_DIR="${DAQIRI_BUILD_DIR:-$SCRIPT_DIR/../build}"
+TS="$(date -u +%Y%m%dT%H%M%SZ)"
+OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE"
+mkdir -p "$OUT_DIR"
+
+CSV="$OUT_DIR/runs.csv"
+echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_master_pct,cpu_tx_pct,cpu_rx_pct,gpu_sm_pct,gpu_mem_pct" > "$CSV"
+
+# Capture slow-moving environment state once per result set.
+"$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR"
+
+RUN_SECONDS=30
+DRIVER_LOG="$OUT_DIR/last_run.stderr"
+
+# Per-backend sweep matrices (see docs/performance-dgx-spark.md methodology).
+# Native-shape sizes are the leftmost entry; "matched 8K" cell is also included.
+case "$BACKEND" in
+ dpdk)
+ PAYLOADS_SWEEP=(8000 4096 1024 256 64)
+ BATCHES_SWEEP=(10240 4096 1024 256)
+ PAYLOADS_HEADLINE=(8000)
+ BATCHES_HEADLINE=(10240)
+ BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml"
+ BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect"
+ CPU_MASTER=8; CPU_TX=17; CPU_RX=18
+ : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net//address)}"
+ ;;
+ rdma)
+ PAYLOADS_SWEEP=(8000000 1048576 65536 8192 4096)
+ BATCHES_SWEEP=(1)
+ PAYLOADS_HEADLINE=(8000000)
+ BATCHES_HEADLINE=(1)
+ BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml"
+ BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma"
+ CPU_MASTER=8; CPU_TX=17; CPU_RX=18
+ ;;
+ socket-udp)
+ PAYLOADS_SWEEP=(1472 1024 256 64)
+ BATCHES_SWEEP=(256 32 1)
+ PAYLOADS_HEADLINE=(1472)
+ BATCHES_HEADLINE=(256)
+ BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml"
+ BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+ CPU_MASTER=8; CPU_TX=17; CPU_RX=18
+ ;;
+ socket-tcp)
+ PAYLOADS_SWEEP=(1048576 65536 1024)
+ BATCHES_SWEEP=(1)
+ PAYLOADS_HEADLINE=(65536)
+ BATCHES_HEADLINE=(1)
+ BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml"
+ BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket"
+ CPU_MASTER=8; CPU_TX=17; CPU_RX=18
+ ;;
+ *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;;
+esac
+
+DROP_CURVE_TARGETS=(1 5 10 25 50 75 100 0) # 0 means unpaced (line rate)
+
+# --------------------------------------------------------------------------
+# Helpers
+# --------------------------------------------------------------------------
+
+# Read a scalar field from a `key=value` style stdout line.
+# usage: extract_field
+extract_field() {
+ local prefix="$1" field="$2" file="$3"
+ grep -E "^$prefix" "$file" | tail -n1 | grep -oE " $field=[^ ]+" | head -n1 | sed -E "s/.*$field=//"
+}
+
+# Sum DPDK drop counters from the manager log emitted via DAQIRI_LOG_INFO.
+parse_dpdk_drops() {
+ local log="$1"
+ local sum=0 v
+ for key in imissed ierrors rx_nombuf; do
+ v="$(grep -oE "$key=[0-9]+" "$log" 2>/dev/null | tail -n1 | sed -E "s/.*=//" || true)"
+ [[ -n "${v:-}" ]] && sum=$((sum + v))
+ done
+ echo "$sum"
+}
+
+# Count RDMA CQ errors in the manager log.
+parse_rdma_drops() {
+ local log="$1"
+ grep -c 'CQ error' "$log" 2>/dev/null || echo 0
+}
+
+# Snapshot socket drops on the kernel side.
+snapshot_proc_net_udp() {
+ awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0
+}
+snapshot_nstat() {
+ nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0
+}
+
+# Snapshot /proc/stat per-cpu counters to a file. Mpstat is often not installed
+# in the bench container; /proc/stat is always available.
+snapshot_cpu_stat() {
+ awk '/^cpu[0-9]+/ {
+ total = $2+$3+$4+$5+$6+$7+$8
+ busy = total - $5 - $6
+ print $1, total, busy
+ }' /proc/stat > "$1"
+}
+
+# Compute busy% for a single cpu index between two /proc/stat snapshots.
+cpu_busy_pct() {
+ local before="$1" after="$2" cpu_idx="$3"
+ awk -v cpu="cpu$cpu_idx" '
+ NR == FNR { b_total[$1] = $2; b_busy[$1] = $3; next }
+ { a_total[$1] = $2; a_busy[$1] = $3 }
+ END {
+ dt = a_total[cpu] - b_total[cpu]
+ db = a_busy[cpu] - b_busy[cpu]
+ if (dt > 0) printf "%.1f", (db * 100.0) / dt
+ else printf "0.0"
+ }
+ ' "$before" "$after"
+}
+
+# Substitute payload / batch into the base YAML and write a temp config.
+generate_yaml() {
+ local out="$1" payload="$2" batch="$3"
+ case "$BACKEND" in
+ dpdk)
+ sed -E \
+ -e "s|^( *payload_size: ).*|\1$payload|" \
+ -e "s|^( *batch_size: ).*|\1$batch|" \
+ -e "s|<00:00:00:00:00:00>|$ETH_DST_ADDR|g" \
+ "$BASE_YAML" > "$out"
+ ;;
+ rdma)
+ sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out"
+ ;;
+ socket-udp|socket-tcp)
+ sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out"
+ ;;
+ esac
+}
+
+# Run one cell. Echoes the CSV row to stdout.
+run_cell() {
+ local lang="$1" payload="$2" batch="$3" target_gbps="$4"
+ local cell="$lang-$BACKEND-p$payload-b$batch-g$target_gbps"
+ local cell_dir="$OUT_DIR/$cell"
+ mkdir -p "$cell_dir"
+
+ local yaml="$cell_dir/config.yaml"
+ generate_yaml "$yaml" "$payload" "$batch"
+
+ # Snapshot kernel-side drop counters.
+ local udp_before tcp_before
+ udp_before="$(snapshot_proc_net_udp)"
+ tcp_before="$(snapshot_nstat)"
+
+ # Snapshot per-cpu stats just before the bench starts.
+ snapshot_cpu_stat "$cell_dir/cpu_stat.before"
+
+ # Background GPU dmon (1-sec sample, RUN_SECONDS samples).
+ ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) &
+ local dmon_pid=$!
+
+ # Run the bench. Stderr captures DAQIRI_LOG_* output (DPDK/RDMA drop sources).
+ local stdout="$cell_dir/stdout.txt"
+ local stderr="$cell_dir/stderr.txt"
+ local args=("$yaml" --seconds "$RUN_SECONDS")
+ [[ "$target_gbps" != "0" ]] && args+=(--target-gbps "$target_gbps")
+ [[ "$BACKEND" == "rdma" || "$BACKEND" =~ ^socket- ]] && args+=(--mode both)
+
+ "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true
+ cp "$stderr" "$DRIVER_LOG"
+
+ # Snapshot per-cpu stats right after the bench exits (before background
+ # captures finish reaping, to bound the window).
+ snapshot_cpu_stat "$cell_dir/cpu_stat.after"
+
+ # Stop background captures (they self-terminate at -c , but reap if needed).
+ wait "$dmon_pid" 2>/dev/null || true
+
+ # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative;
+ # for TX-only configs fall back to "TX complete".
+ local pkts bytes secs
+ pkts="$(extract_field 'RX complete' packets "$stdout")"
+ bytes="$(extract_field 'RX complete' bytes "$stdout")"
+ secs="$(extract_field 'RX complete' seconds "$stdout")"
+ if [[ -z "$pkts" ]]; then
+ pkts="$(extract_field 'TX complete' packets "$stdout")"
+ bytes="$(extract_field 'TX complete' bytes "$stdout")"
+ secs="$(extract_field 'TX complete' seconds "$stdout")"
+ fi
+ if [[ -z "$pkts" ]]; then
+ # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S"
+ pkts="$(extract_field 'Client complete' send_completions "$stdout")"
+ bytes="$(extract_field 'Client complete' send_bytes "$stdout")"
+ secs="$(extract_field 'Client complete' seconds "$stdout")"
+ fi
+ pkts="${pkts:-0}"; bytes="${bytes:-0}"; secs="${secs:-0}"
+
+ local pps gbps
+ pps="$(awk -v p="$pkts" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.0f", p/s; else print 0 }')"
+ gbps="$(awk -v b="$bytes" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.3f", (b*8.0)/s/1e9; else print 0 }')"
+
+ # Drops per backend.
+ local drops drops_kind
+ case "$BACKEND" in
+ dpdk)
+ drops="$(parse_dpdk_drops "$stderr")"
+ drops_kind="dpdk-imissed+ierrors+nombuf"
+ ;;
+ rdma)
+ drops="$(parse_rdma_drops "$stderr")"
+ drops_kind="rdma-cqe-error"
+ ;;
+ socket-udp)
+ local udp_after; udp_after="$(snapshot_proc_net_udp)"
+ drops="$((udp_after - udp_before))"
+ drops_kind="udp-proc-net-udp-drops"
+ ;;
+ socket-tcp)
+ local tcp_after; tcp_after="$(snapshot_nstat)"
+ drops="$((tcp_after - tcp_before))"
+ drops_kind="tcp-nstat-retrans+inerrs"
+ ;;
+ esac
+
+ # Per-core CPU busy% over the bench window. Cores defined per-backend
+ # (master/TX/RX) match the YAML so we measure the threads we actually pin.
+ local cpu_master_pct cpu_tx_pct cpu_rx_pct
+ cpu_master_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_MASTER")"
+ cpu_tx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_TX")"
+ cpu_rx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_RX")"
+
+ # GPU SM% (column 5) and memory-controller % (column 6) from nvidia-smi
+ # dmon -s pucvmet. These are near zero for GPUDirect workloads (GPU is a
+ # DMA target, not a compute engine).
+ local gpu_sm gpu_mem
+ gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+ "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
+ gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \
+ "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)"
+
+ echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_master_pct,$cpu_tx_pct,$cpu_rx_pct,$gpu_sm,$gpu_mem" \
+ | tee -a "$CSV"
+}
+
+# --------------------------------------------------------------------------
+# Driver
+# --------------------------------------------------------------------------
+
+case "$MODE" in
+ smoke)
+ # One cell, native-shape, unpaced.
+ for p in "${PAYLOADS_HEADLINE[@]}"; do
+ for b in "${BATCHES_HEADLINE[@]}"; do
+ run_cell cpp "$p" "$b" 0
+ done
+ done
+ ;;
+ sweep)
+ # Full payload × batch matrix at line rate.
+ for p in "${PAYLOADS_SWEEP[@]}"; do
+ for b in "${BATCHES_SWEEP[@]}"; do
+ run_cell cpp "$p" "$b" 0
+ done
+ done
+ ;;
+ drop-curve)
+ # Hold native-shape constant, sweep target_gbps.
+ for p in "${PAYLOADS_HEADLINE[@]}"; do
+ for b in "${BATCHES_HEADLINE[@]}"; do
+ for g in "${DROP_CURVE_TARGETS[@]}"; do
+ run_cell cpp "$p" "$b" "$g"
+ done
+ done
+ done
+ ;;
+ drop-curve-matrix)
+ # 2D drop curve: sweep payload × target_gbps at the headline batch.
+ for p in "${PAYLOADS_SWEEP[@]}"; do
+ for b in "${BATCHES_HEADLINE[@]}"; do
+ for g in "${DROP_CURVE_TARGETS[@]}"; do
+ run_cell cpp "$p" "$b" "$g"
+ done
+ done
+ done
+ ;;
+ *) echo "Unknown mode: $MODE" >&2; exit 1 ;;
+esac
+
+echo
+echo "Results in: $OUT_DIR"
+echo "CSV: $CSV"
diff --git a/examples/socket_bench.cpp b/examples/socket_bench.cpp
index 5fda5f5..8ebb27f 100644
--- a/examples/socket_bench.cpp
+++ b/examples/socket_bench.cpp
@@ -26,6 +26,7 @@
#include
#include
+#include "raw_bench_common.h"
#include "src/common.h"
namespace {
@@ -50,6 +51,8 @@ struct SocketBenchConfig {
struct SocketWorkerStats {
uint64_t sent_packets = 0;
uint64_t received_packets = 0;
+ uint64_t sent_bytes = 0;
+ uint64_t received_bytes = 0;
};
SocketBenchConfig parse_socket_cfg(const YAML::Node& node) {
@@ -65,7 +68,8 @@ SocketBenchConfig parse_socket_cfg(const YAML::Node& node) {
return cfg;
}
-void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, SocketWorkerStats& stats) {
+void socket_worker(const SocketBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer,
+ std::atomic& stop, SocketWorkerStats& stats) {
uintptr_t conn_id = 0;
uint16_t port = 0;
uint16_t queue = 0;
@@ -92,8 +96,14 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket
}
}
- const bool send_done = !cfg.send || stats.sent_packets >= static_cast(cfg.iterations);
- const bool recv_done = !cfg.receive || stats.received_packets >= static_cast(cfg.iterations);
+ // When cfg.iterations <= 0, the loop is time-bounded (driven by stop.load()
+ // set by --seconds). Otherwise the iteration cap applies as before.
+ const bool send_done = !cfg.send ||
+ (cfg.iterations > 0 &&
+ stats.sent_packets >= static_cast(cfg.iterations));
+ const bool recv_done = !cfg.receive ||
+ (cfg.iterations > 0 &&
+ stats.received_packets >= static_cast(cfg.iterations));
if (send_done && recv_done) { break; }
if (cfg.send && !send_done) {
@@ -110,6 +120,8 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket
if (daqiri::send_tx_burst(msg) == daqiri::Status::SUCCESS) {
stats.sent_packets++;
+ stats.sent_bytes += static_cast(cfg.message_size);
+ pacer.wait_for_bytes(static_cast(cfg.message_size), stop);
}
} else {
daqiri::free_tx_metadata(msg);
@@ -120,7 +132,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket
daqiri::BurstParams* burst = nullptr;
if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS &&
burst != nullptr) {
- stats.received_packets += static_cast(daqiri::get_num_packets(burst));
+ const uint64_t rx_pkts = static_cast(daqiri::get_num_packets(burst));
+ stats.received_packets += rx_pkts;
+ stats.received_bytes += daqiri::get_burst_tot_byte(burst);
daqiri::free_all_packets_and_burst_rx(burst);
} else {
std::this_thread::sleep_for(std::chrono::microseconds(100));
@@ -134,17 +148,20 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket
int main(int argc, char** argv) {
if (argc < 2) {
std::cerr << "Usage: " << argv[0]
- << " [--seconds N] [--mode server|client|both]\n";
+ << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n";
return 1;
}
int run_seconds = 10;
+ double target_gbps = 0.0;
std::string mode = "both";
for (int i = 2; i + 1 < argc; i += 2) {
if (std::string(argv[i]) == "--seconds") {
run_seconds = std::stoi(argv[i + 1]);
} else if (std::string(argv[i]) == "--mode") {
mode = argv[i + 1];
+ } else if (std::string(argv[i]) == "--target-gbps") {
+ target_gbps = std::stod(argv[i + 1]);
}
}
@@ -159,6 +176,8 @@ int main(int argc, char** argv) {
std::thread client_thread;
SocketWorkerStats server_stats;
SocketWorkerStats client_stats;
+ daqiri::bench::TokenBucketPacer server_pacer(target_gbps);
+ daqiri::bench::TokenBucketPacer client_pacer(target_gbps);
bool run_server = false;
bool run_client = false;
@@ -166,6 +185,7 @@ int main(int argc, char** argv) {
run_server = true;
server_thread = std::thread(socket_worker,
parse_socket_cfg(root["socket_bench_server"]),
+ std::ref(server_pacer),
std::ref(stop),
std::ref(server_stats));
}
@@ -173,6 +193,7 @@ int main(int argc, char** argv) {
run_client = true;
client_thread = std::thread(socket_worker,
parse_socket_cfg(root["socket_bench_client"]),
+ std::ref(client_pacer),
std::ref(stop),
std::ref(client_stats));
}
@@ -199,13 +220,23 @@ int main(int argc, char** argv) {
if (server_thread.joinable()) { server_thread.join(); }
if (client_thread.joinable()) { client_thread.join(); }
+ const double secs =
+ std::chrono::duration(std::chrono::steady_clock::now() - start)
+ .count();
+
if (run_server) {
- std::cout << "Server sent packets: " << server_stats.sent_packets
- << ", received packets: " << server_stats.received_packets << '\n';
+ std::cout << "Server complete: sent_packets=" << server_stats.sent_packets
+ << " recv_packets=" << server_stats.received_packets
+ << " sent_bytes=" << server_stats.sent_bytes
+ << " recv_bytes=" << server_stats.received_bytes
+ << " seconds=" << secs << '\n';
}
if (run_client) {
- std::cout << "Client sent packets: " << client_stats.sent_packets
- << ", received packets: " << client_stats.received_packets << '\n';
+ std::cout << "Client complete: sent_packets=" << client_stats.sent_packets
+ << " recv_packets=" << client_stats.received_packets
+ << " sent_bytes=" << client_stats.sent_bytes
+ << " recv_bytes=" << client_stats.received_bytes
+ << " seconds=" << secs << '\n';
}
daqiri::print_stats();
diff --git a/mkdocs.yml b/mkdocs.yml
index 4b80201..0fd6b1a 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -35,6 +35,8 @@ nav:
- Getting Started: getting-started.md
- Configuration: configuration.md
- API Reference: api-guide.md
+ - Performance:
+ - DGX Spark: performance-dgx-spark.md
- Tutorials:
- Background: tutorials/background.md
- System Configuration: tutorials/system_configuration.md
@@ -45,6 +47,7 @@ markdown_extensions:
- admonition
- attr_list
- def_list
+ - footnotes
- md_in_html
- tables
- toc:
diff --git a/scripts/spark_data_fill.sh b/scripts/spark_data_fill.sh
new file mode 100755
index 0000000..0e4f81b
--- /dev/null
+++ b/scripts/spark_data_fill.sh
@@ -0,0 +1,143 @@
+#!/usr/bin/env bash
+# Drives the PR 1 data-fill bench runs for the DGX Spark performance report.
+#
+# Runs DPDK GPUDirect, socket-UDP, and socket-TCP through their sweep and
+# drop-curve modes via examples/run_spark_bench.sh, with pre-flight checks
+# and orphan-hugepage cleanup. RDMA is deferred from PR 1 (single-host
+# loopback over the cable needs a netns+two-process refactor; tracked
+# separately).
+#
+# Run inside the project container (privileged, --gpus all, /dev/hugepages
+# and /mnt/huge mounted, repo at /workspace).
+#
+# Usage:
+# ./scripts/spark_data_fill.sh # all three backends
+# ./scripts/spark_data_fill.sh dpdk # just DPDK
+# ./scripts/spark_data_fill.sh socket-udp socket-tcp
+#
+# Env overrides:
+# ETH_DST_ADDR — RX-side MAC. Auto-detected from
+# /sys/class/net/enP2p1s0f0np0/address if unset.
+# RX_IFACE — RX netdev name (default enP2p1s0f0np0).
+# DAQIRI_BUILD_DIR — defaults to ./build.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+WRAPPER="$REPO_ROOT/examples/run_spark_bench.sh"
+BUILD_DIR="${DAQIRI_BUILD_DIR:-$REPO_ROOT/build}"
+RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}"
+
+BACKENDS=("$@")
+[[ ${#BACKENDS[@]} -eq 0 ]] && BACKENDS=(dpdk socket-udp socket-tcp)
+
+# --- pre-flight ------------------------------------------------------------
+
+preflight_fail() { echo "PREFLIGHT FAIL: $*" >&2; exit 1; }
+note() { echo "[$(date -u +%H:%M:%SZ)] $*"; }
+
+[[ -x "$WRAPPER" ]] || preflight_fail "wrapper missing or not executable: $WRAPPER"
+
+for be in "${BACKENDS[@]}"; do
+ case "$be" in
+ dpdk) bin="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" ;;
+ socket-udp|socket-tcp) bin="$BUILD_DIR/examples/daqiri_bench_socket" ;;
+ rdma) preflight_fail "RDMA is deferred from PR 1; see follow-up issue" ;;
+ *) preflight_fail "unknown backend: $be" ;;
+ esac
+ [[ -x "$bin" ]] || preflight_fail "missing bench binary: $bin (run cmake --build first)"
+done
+
+# DPDK-only checks.
+if [[ " ${BACKENDS[*]} " == *" dpdk "* ]]; then
+ free_hp="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+ [[ "${free_hp:-0}" -ge 4 ]] || preflight_fail "HugePages_Free=$free_hp (need >=4); clean /mnt/huge and /dev/hugepages from prior runs"
+
+ if [[ -z "${ETH_DST_ADDR:-}" ]]; then
+ mac_path="/sys/class/net/$RX_IFACE/address"
+ [[ -r "$mac_path" ]] || preflight_fail "cannot read $mac_path; set ETH_DST_ADDR explicitly"
+ ETH_DST_ADDR="$(cat "$mac_path")"
+ export ETH_DST_ADDR
+ note "ETH_DST_ADDR auto-detected from $RX_IFACE: $ETH_DST_ADDR"
+ fi
+
+ carrier="$(cat "/sys/class/net/$RX_IFACE/carrier" 2>/dev/null || echo 0)"
+ [[ "$carrier" == "1" ]] || preflight_fail "RX iface $RX_IFACE has no carrier (cable unplugged or link down)"
+fi
+
+note "Pre-flight OK. Backends: ${BACKENDS[*]}"
+note "Build dir: $BUILD_DIR"
+note "Repo root: $REPO_ROOT"
+
+# --- hugepage cleanup helper ----------------------------------------------
+
+# DPDK leaves orphan rtemap_* files when a bench aborts. Clean between runs so
+# we don't run out of hugepages mid-sweep.
+clean_orphan_hugepages() {
+ local pre post freed
+ pre="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+ : "${pre:=0}"
+ shopt -s nullglob
+ # DPDK uses a random per-process file prefix (override with --file-prefix);
+ # match anything ending in `map_` to catch the common shape without
+ # nuking unrelated files. Skip any that are still held by a live process.
+ for f in /dev/hugepages/*map_[0-9]* /mnt/huge/*map_[0-9]*; do
+ if ! fuser -- "$f" >/dev/null 2>&1; then
+ rm -f -- "$f" 2>/dev/null || true
+ fi
+ done
+ shopt -u nullglob
+ post="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)"
+ : "${post:=0}"
+ freed=$((post - pre))
+ if [[ "$freed" -gt 0 ]]; then
+ note "Freed $freed orphan hugepages (now ${post} free)"
+ fi
+ return 0
+}
+
+# --- driver loop -----------------------------------------------------------
+
+declare -a RESULT_DIRS
+
+run_backend_mode() {
+ local backend="$1" mode="$2"
+ note "=== Running: $backend $mode ==="
+ clean_orphan_hugepages
+
+ # Stream wrapper output live (per-cell CSV rows appear as they finish) while
+ # also keeping a log for post-run parsing of the "Results in:" line.
+ local log="/tmp/spark_data_fill.$backend.$mode.log"
+ local rc=0
+ "$WRAPPER" "$backend" "$mode" 2>&1 | tee "$log" || rc=$?
+ rc="${PIPESTATUS[0]:-$rc}"
+
+ if [[ "$rc" -eq 0 ]]; then
+ local result_dir
+ result_dir="$(awk '/^Results in:/ { print $3 }' "$log" | tail -n1)"
+ [[ -n "$result_dir" ]] && RESULT_DIRS+=("$backend/$mode -> $result_dir")
+ note "$backend $mode complete"
+ else
+ note "$backend $mode FAILED (exit $rc); continuing"
+ tail -n 40 "$log" >&2
+ fi
+ clean_orphan_hugepages
+}
+
+for be in "${BACKENDS[@]}"; do
+ run_backend_mode "$be" sweep
+ run_backend_mode "$be" drop-curve
+done
+
+# --- summary ---------------------------------------------------------------
+
+echo
+echo "=========================================="
+echo "Data-fill complete. Result directories:"
+echo "=========================================="
+for r in "${RESULT_DIRS[@]}"; do
+ echo " $r"
+done
+echo
+echo "Next: aggregate CSVs and fill docs/performance-dgx-spark.md."