diff --git a/.gitignore b/.gitignore index e221700..472185d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,9 @@ build*/ site/ +bench-results/ + +# tune_system.py default output +pcie_schematic.png # macOS .DS_Store diff --git a/AGENTS.md b/AGENTS.md index c45ef72..79d9568 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -93,14 +93,16 @@ The web docs live in `docs/` and are built with [MkDocs Material](https://squidf - `docs/index.html` — custom HTML landing page (not generated by MkDocs, hand-maintained) - `docs/daqiri-api.html` — standalone HTML API reference (hand-maintained) - `docs/api-guide.md`, `docs/getting-started.md`, `docs/configuration.md` — core markdown docs +- `docs/performance-dgx-spark.md` — per-platform performance report (DGX Spark; more platforms to follow) - `docs/tutorials/` — tutorial walkthroughs (background, system config, benchmarking, config files) - `docs/stylesheets/extra.css` — custom theme overrides **Keeping docs in sync with code:** before committing changes, scan for the recurring drift hotspots: - **Backend list** (`src/managers/*/`) — README Backends table, `docs/getting-started.md`, `docs/configuration.md` - **CMake options / `DAQIRI_MGR` default** (`src/CMakeLists.txt:137`) — README Quick Start, `docs/getting-started.md`, this file's Build & run section -- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, and the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage) +- **Benchmark binary or YAML names** (`examples/`) — the benchmark table above, `docs/tutorials/benchmarking_examples.md`, the "Choosing an example config" decision tree in `docs/tutorials/configuration-walkthrough.md` (every YAML must have a leaf; CI's `scripts/check_doc_refs.py` enforces coverage), and per-platform performance docs (`docs/performance-*.md`) - **Public API** (`src/common.h`, `src/types.h`, `src/manager.h`) — `docs/api-guide.md`, `docs/daqiri-api.html` +- **Bench CLI flags or output format** (`examples/raw_bench_common.{h,cpp}`, `*_bench.cpp`) — per-platform performance docs' Methodology section, `examples/run_spark_bench.sh` parsing logic - **Doc reorganization** (any rename in `docs/`) — `docs/index.html` landing page, `mkdocs.yml` nav, README Documentation table The full mapping with rationale lives in the docs-sync agent rule. Internal-link, anchor, and nav drift is enforced by CI (`.github/workflows/docs.yml`); content drift (stale binary names, defaults) is still a manual check at commit time. diff --git a/README.md b/README.md index 50112c6..30c17e1 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,7 @@ Reference material for the DAQIRI codebase: - [Getting Started](docs/getting-started.md) — System requirements, build/install instructions, and CMake options - [Configuration Reference](docs/configuration.md) — Full YAML config reference for all backends - [API Guide](docs/api-guide.md) — BurstParams, RX/TX workflows, buffer lifecycle, status codes +- [Performance: DGX Spark](docs/performance-dgx-spark.md) — Per-platform throughput, drop, and utilization numbers for all backends - [Contributing](CONTRIBUTING.md) — Contribution guidelines, coding standards, DCO sign-off ## Tutorials diff --git a/docs/index.html b/docs/index.html index 61a9eb2..53f2ffe 100644 --- a/docs/index.html +++ b/docs/index.html @@ -595,6 +595,12 @@

News

+
+
Performance2026
+
DAQIRI Performance on DGX Spark
+
NVIDIA — Throughput, drops, and resource utilization for DPDK GPUDirect, RoCE, and socket backends measured on a DGX Spark (GB10) workstation. First in a series of per-platform performance reports.
+ +
GitHub2025
DAQIRI Open-Sourced on GitHub
diff --git a/docs/performance-dgx-spark.md b/docs/performance-dgx-spark.md new file mode 100644 index 0000000..9854243 --- /dev/null +++ b/docs/performance-dgx-spark.md @@ -0,0 +1,566 @@ +# Performance: DGX Spark + +This page reports DAQIRI throughput, drop, and resource-utilization numbers +measured on a DGX Spark (GB10) workstation. It is the first in a series of +per-platform performance reports; the same section layout will be reused for +IGX, x86-server, and other targets. + +The numbers below are reproducible — every cell is generated by +[`examples/run_spark_bench.sh`](https://github.com/nvidia/daqiri/blob/main/examples/run_spark_bench.sh) +against the YAML configs in `examples/`, with the system state captured by +[`examples/bench_capture_environment.sh`](https://github.com/nvidia/daqiri/blob/main/examples/bench_capture_environment.sh) +alongside each result set. + +**Contents** + +- [Summary](#summary) +- [Introduction](#introduction) — system under test, methodology +- [C++ Results](#c-results) — DPDK, RoCE, Socket, workload variants +- [Python Results](#python-results) +- [Reproduce these results](#reproduce-these-results) +- [TODO: Not Yet Implemented / Known Limitations](#todo-not-yet-implemented-known-limitations) + +## Summary + +Two headline tables. **Native-shape peak** reports each backend at its +best-case operation size (the configuration the backend was designed for). +**Matched 8 KB** drives DPDK / RoCE / Socket UDP at a common ~8 KB unit of +work so the cross-backend comparison is apples-to-apples; TCP is omitted from +that table since it has no operation boundary. + +### Native-shape peak — max no-drop throughput (Gbps) + +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| --------------------- | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect (8 KB) | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE (8 MB SEND) | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | +| Socket TCP (stream) | _deferred_[^3] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | + +### Matched 8 KB operation — cross-backend Gbps + +| Backend / Stack | C++ loopback | C++ + FFT | C++ + GEMM | Python loopback | Python + FFT | Python + GEMM | +| ------------------------ | -------------- | -------------- | -------------- | --------------- | -------------- | -------------- | +| DPDK GPUDirect | **96.4** | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| RoCE | _deferred_[^1] | _TBD (PR 2)_ | _TBD (PR 2)_ | _TBD (PR 3)_ | _TBD (PR 4)_ | _TBD (PR 4)_ | +| Socket UDP (1472 B, MTU) | _deferred_[^2] | n/a | n/a | _TBD (PR 3)_ | n/a | n/a | + +[^1]: RoCE single-host loopback is deferred from PR 1 — the bench's `--mode both` runs both ends in one process; with both IPs locally bound, the kernel shortcuts RC connection setup through `lo` rather than the cable. Real loopback measurement requires two netns (one per port), `rdma system set netns exclusive`, RDMA-device netns transfer, a YAML split, and two-process orchestration. Tracked in a follow-up issue. +[^2]: Socket UDP `--mode both` deadlocks on peer learning: both server and client try to transmit before either has received, so neither side learns its peer. Only ~1000 packets / 30 s trickle through. Tracked in a follow-up issue. +[^3]: Socket TCP `--mode both` aborts with a glibc heap-corruption assertion (`malloc.c:2599`) immediately after the second TCP connection accept. Bench is unrunnable. Tracked in a follow-up issue. + +!!! note "Why two tables" + A single Gbps number isn't enough to compare backends fairly. A DPDK + backend at peak with 8 KB packets is doing very different work than a RoCE + backend at peak with 8 MB messages, even when both report the same Gbps. + The matched 8 KB table makes the cross-backend comparison honest; the + native-shape table shows each backend's design ceiling. + +## Introduction + +### System under test + +The reproducibility appendix has the full capture. Key fields: + +| Field | Value | +| ----------------- | ------------------------------------ | +| Platform | NVIDIA DGX Spark (GB10) | +| GPU | Blackwell (compute capability 12.1) | +| NIC | ConnectX-7 (two ports, tied / QSFP loopback) | +| Topology | `0000:01:00.0` (mlx5_0) → `0002:01:00.0` (mlx5_2), physical loopback cable | +| OS | _captured in `environment.txt`_ | +| Kernel | _captured in `environment.txt`_ | +| CUDA driver | _captured in `environment.txt`_ | +| DPDK | patched per `dpdk_patches/` (container build) | +| DAQIRI commit | _captured in `environment.txt`_ | + +### Methodology + +#### Bench commands + +Each backend has a dedicated bench executable in `examples/`. The DPDK +numbers in this report come from the first command; the RoCE and Socket +commands are listed here for documentation and will be the basis for the +future fill of those rows. + +```bash +# DPDK GPUDirect — physical loopback (used in this report) +./build/examples/daqiri_bench_raw_gpudirect \ + examples/daqiri_bench_raw_tx_rx_spark.yaml \ + --seconds 30 [--target-gbps G] + +# RoCE — same NIC, two ports (deferred; see TODO / Known Limitations) +./build/examples/daqiri_bench_rdma \ + examples/daqiri_bench_rdma_tx_rx_spark.yaml \ + --seconds 30 --mode both [--target-gbps G] + +# Socket UDP / TCP — localhost (deferred; see TODO / Known Limitations) +./build/examples/daqiri_bench_socket \ + examples/daqiri_bench_socket_udp_tx_rx.yaml \ + --seconds 30 --mode both [--target-gbps G] +``` + +The DPDK YAML expects `eth_dst_addr` filled from the RX iface MAC: + +```bash +ETH_DST_ADDR="$(cat /sys/class/net//address)" +``` + +#### Per-backend sweep dimensions + +**Payload** is the user-data bytes in one packet (DPDK / Socket UDP), +one RDMA message, or one TCP send. **Batch** is how many packets DAQIRI +hands to (or pulls from) the NIC in one `rte_eth_tx_burst` / +`rte_eth_rx_burst` call — the burst size knob, not a packet-size knob. +Larger batches amortize doorbell and API overhead per packet; smaller +batches keep per-call latency lower. Batch only matters when the bench +is not yet at the link ceiling — at saturation, the NIC is the +bottleneck and batch size has near-zero effect. + +The "payload × batch" sweep doesn't map uniformly across backends. Each +backend has its own sweep: + +| Backend | Sweep dim 1 | Sweep dim 2 | Native-shape cell | Matched-size cell | +| ------------------ | ------------------------------------------------------ | -------------------------------------- | -------------------- | ---------------------- | +| DPDK GPUDirect | payload_size ∈ {64, 256, 1024, 4096, 8000} B | batch_size ∈ {256, 1024, 4096, 10240} | (8000, 10240) | (8000, 10240) — same | +| RoCE | message_size ∈ {4 K, 64 K, 1 M, 8 M} B | batch_size ∈ {1, 4, 16} | (8 M, 1) | (8 K, 16) | +| Socket UDP | payload_size ∈ {64, 256, 1024, 1472} B (MTU-bound) | batch_size ∈ {1, 32, 256} | (1472, 256) | (1472, 256) — closest to 8 K under MTU cap | +| Socket TCP | message_size ∈ {1 K, 64 K, 1 M} B | n/a (single stream) | (64 K) | n/a | + +#### "No-drop" threshold + +A run is **drop-free** when reported `drops == 0` over a `--seconds 30` run. +The headline tables report the highest target rate at which a run was still +drop-free under this threshold. The methodology does not use a percentile cap +— either there are drops or there are not. + +#### Drop-curve sweep + +The drop curve sweeps `--target-gbps` while holding the native-shape cell +constant. The token-bucket pacer in the bench TX worker (`raw_bench_common`) +adds a software-paced sleep after each burst; accuracy is ~±5 % at high rates +due to OS sleep granularity and scheduler jitter. Hardware TX pacing (DPDK +`accurate_send`) is unused but would tighten DPDK-only precision; deferred to +a follow-up. + +#### Drop sources per backend + +- **DPDK** — `imissed + ierrors + rx_nombuf` parsed from `DAQIRI_LOG_INFO` + output (the bench's stderr). +- **RoCE** — count of `CQ error` lines in `DAQIRI_LOG_ERROR` output (RDMA + manager filters error completions but logs each one). +- **Socket UDP** — diff of the `drops` column in `/proc/net/udp` over the run. +- **Socket TCP** — `nstat -a` diff of `TcpExtTCPLostRetransmit` / + `TcpRetransSegs` / `TcpInErrs`. TCP has no clean "drops" semantic; this is + the closest proxy. + +#### External captures per run + +Each run records, alongside the bench: + +- `/proc/stat` snapshots before and after the bench process. The wrapper + computes per-core busy% for the master / TX / RX cores (cores 8 / 17 / 18 + on Spark) by delta over the run window. `mpstat` is not used — it is + often missing from minimal containers, and the per-core CPUs we care + about are pinned by the YAML so a targeted delta is more meaningful + than a system-wide average. +- `nvidia-smi dmon -s pucvmet -c ` — GPU SM%, mem-controller%, and + PCIe rxpci / txpci columns (the latter are reported as `-` on the + current Spark driver; SM% and mem% are near zero for plain GPUDirect + because the GPU is a DMA target, not a compute engine). + +Slow-moving state (kernel, OFED, NIC firmware, PCIe link, NUMA, +hugepages, GPU state, DAQIRI commit) is captured once per result set by +`bench_capture_environment.sh`. + +## C++ Results + +### DPDK GPUDirect + +Native shape on Spark is 8 KB payload, batch 10240 — the configuration the +DPDK backend was designed around. All cells below ran for 30 s with +`drops == 0`. + +#### Drop curve at native shape + +Hold (payload=8000, batch=10240) constant; sweep `--target-gbps`. The +token-bucket pacer tracks target within ±0.02 Gbps until the link saturates +near 96 Gbps. Beyond that, target=100 and unpaced both report the +achievable ceiling. TX and RX cores spin in poll-mode regardless of target +rate (visible in the CPU table below). + +| target Gbps | achieved Gbps | RX pps | drops | TX core % | RX core % | +| ----------- | ------------- | --------- | ----- | --------- | --------- | +| 1 | 1.011 | 15,678 | 0 | 92.0 | 92.0 | +| 5 | 5.012 | 77,697 | 0 | 91.8 | 91.8 | +| 10 | 10.001 | 155,032 | 0 | 92.5 | 92.5 | +| 25 | 25.008 | 387,650 | 0 | 91.9 | 91.9 | +| 50 | 50.001 | 775,062 | 0 | 92.8 | 92.8 | +| 75 | 74.999 | 1,162,551 | 0 | 91.7 | 91.7 | +| 100 | 96.370 | 1,493,823 | 0 | 91.6 | 91.6 | +| unpaced | 95.897 | 1,486,498 | 0 | 91.6 | 90.5 | + +#### Payload × target_gbps matrix + +Holds batch=10240 constant; sweeps payload and `--target-gbps` +together. Each cell shows the achieved Gbps and drops over a 30 s +run. Coloring is relative to the **effective target** +`min(target_gbps, 96 Gbps link cap)`; the "unpaced" column uses the +link cap as its effective target: + +
+ green — no drops, achieved ≥ 95% of effective target + yellow — no drops, achieved ≥ 70% of effective target + red — drops, or achieved < 70% of effective target +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadtarget Gbps
1510255075100unpaced
8000 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops95.9 Gbps0 drops96.4 Gbps0 drops
4096 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops25.0 Gbps0 drops50.0 Gbps0 drops75.0 Gbps0 drops100.0 Gbps0 drops102.7 Gbps0 drops
1024 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.9 Gbps0 drops49.8 Gbps0 drops74.2 Gbps0 drops94.0 Gbps0 drops68.2 Gbps0 drops
256 B1.0 Gbps0 drops5.0 Gbps0 drops10.0 Gbps0 drops24.7 Gbps0 drops21.1 Gbps0 drops21.2 Gbps0 drops21.2 Gbps0 drops21.6 Gbps0 drops
64 B1.0 Gbps0 drops5.0 Gbps0 drops9.9 Gbps0 drops8.9 Gbps0 drops8.9 Gbps0 drops13.7 Gbps0 drops8.9 Gbps0 drops8.7 Gbps0 drops
+ +**Reading the matrix.** The token-bucket pacer tracks target within +sub-Gbps at payloads ≥ 1 KB right up to the link cap. The PPS +ceiling for each payload (~21 Gbps at 256 B, ~9 Gbps at 64 B) shows +up as a horizontal saturation band: once `target_gbps` crosses that +ceiling, increasing it further produces no additional throughput. +The 64 B / target=75 cell (13.7 Gbps) is a pacer transient when the +requested rate is well above achievable — the next cell at +target=100 falls back to the PPS ceiling. The 1024 B / unpaced cell +(68.2 Gbps, yellow) sits below its own paced cells; 1024 B is the +size where the master loop transitions from idle to fully busy +(`cpu_master_pct` jumps from ~3% to ~93% in the corresponding +unpaced run), and per-cell numbers in this regime carry roughly +±5 Gbps of run-to-run variance. + +#### Payload × batch matrix + +Each cell shows the achieved Gbps and drops over a 30 s unpaced run. +Coloring is relative to the global max across the matrix (here +**104.989 Gbps** at payload 4096 B, batch 4096): + +
+ green — no drops, Gbps ≥ 90% of max + yellow — no drops, Gbps ≥ 70% of max + red — drops, or Gbps < 70% of max +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
payloadbatch (packets per burst)
2561024409610240
8000 B96.4 Gbps0 drops96.4 Gbps0 drops96.4 Gbps0 drops95.9 Gbps0 drops
4096 B104.9 Gbps0 drops104.9 Gbps0 drops105.0 Gbps0 drops103.0 Gbps0 drops
1024 B64.9 Gbps0 drops86.2 Gbps0 drops71.0 Gbps0 drops72.7 Gbps0 drops
256 B24.5 Gbps0 drops24.8 Gbps0 drops29.3 Gbps0 drops21.1 Gbps0 drops
64 B10.1 Gbps0 drops10.3 Gbps0 drops9.8 Gbps0 drops10.3 Gbps0 drops
+ +**Reading the matrix.** At payload ≥ 4 KB the link saturates (~96–105 +Gbps) and batch size barely moves the number, so every cell is green. +The 1 KB row is the transition: pps and Gbps both matter, batch size +starts to influence which side dominates, and most cells fall under 70% +of the global max. At ≤ 256 B the bottleneck is packets-per-second +(~10 M pps ceiling at 64 B), so effective Gbps stays well below the +link ceiling regardless of batch. Run-to-run variance on the unpaced +cells is ~0.5 Gbps; row-internal Gbps differences smaller than that +should be treated as noise. + +#### CPU and GPU utilization (headline cell, payload 8000 B / batch 10240, unpaced) + +| Resource | Value | Note | +| -------------------- | ----- | ------------------------------------------ | +| Master core (CPU 8) | 3.3% | Mostly idle; orchestration only | +| TX core (CPU 17) | 91.4% | Poll-mode spin; rate-independent | +| RX core (CPU 18) | 91.4% | Poll-mode spin; rate-independent | +| GPU SM % | 0.0% | GPU is a DMA target, no compute | +| GPU mem-ctrl % | 0.0% | Payload writes traverse PCIe, not the GPU memory controller | + +The TX/RX cores stay at ~92% across every drop-curve step from 1 Gbps to +line rate — characteristic of DPDK's poll-mode driver, which spins waiting +for descriptors regardless of offered load. The master core handles +configuration only and idles below 5 % at the headline shape; at smaller +payload sizes (1 KB and below) it occasionally hits 90%+ as more bursts +flow through the orchestration path. That asymmetry is data, not a bug, +and is captured in the per-cell artifacts under `bench-results/`. + +### RoCE + +**Deferred from this report.** See [headline-table footnote 1](#fn:1) and +the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations) +section. Single-host RoCE loopback on Spark requires a two-netns + +two-process orchestration that the wrapper does not yet implement. The RoCE +rows will be filled when the follow-up issue lands. + +### Socket + +**Deferred from this report.** See [headline-table footnotes 2 and 3](#fn:2) +and the [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations) +section. Both backends produced unusable data on Spark during PR 1 +verification: + +- **Socket UDP** in `--mode both` deadlocks on peer learning — both ends try + to transmit before either has received, only ~1000 packets per 30 s get + through (≈ 390 kbps). Visible as repeated + `[ERROR] UDP server has no learned peer yet; cannot transmit` lines in + the bench's stderr. +- **Socket TCP** in `--mode both` aborts with a glibc heap-corruption + assertion (`Fatal glibc error: malloc.c:2599 (sysmalloc)`) immediately + after the second TCP connection accept. The bench crashes before any + send / recv completes, so no completion line is printed. + +Both are tracked as separate follow-up issues; the Socket rows here will be +filled once the underlying bench bugs are fixed. + +### Workload variants (FFT, GEMM) + +The post-process layer ([PR 2](https://github.com/NVIDIA/daqiri/issues/15)) +adds a `--post-process {fft,gemm}` flag to the bench, runs `cuFFT` / +`cuBLAS` on the received GPU-resident payload, and reports the resulting +throughput delta and GPU utilization. + +**Sizes (representative):** + +- FFT: 1D complex-to-complex, length 1024. +- GEMM: fp32 square, N = 44 (largest tile that fits in an 8 KB payload). + +#### DPDK GPUDirect — FFT/GEMM + +_TBD (PR 2)._ + +#### RoCE — FFT/GEMM + +_TBD (PR 2)._ Note the unit-of-work mismatch when comparing across backends: +RoCE applies the post-process kernel once per ~8 MB SEND; DPDK applies it +once per packet. The throughput numbers are comparable; "operations per +burst" is not. + +## Python Results + +The Python benches ([PR 3](https://github.com/NVIDIA/daqiri/issues/16)) mirror +the C++ benches' CLI and stdout format, using the existing pybind11 bindings. + +### Loopback + +_TBD (PR 3)._ + +### FFT / GEMM via pybind of the C++ post-process layer + +_TBD (PR 4)._ + +## Reproduce these results + +### Container + +All commands below assume execution inside the project container, as +required by [`AGENTS.md`](https://github.com/nvidia/daqiri/blob/main/AGENTS.md). +On the host, launch the container in privileged mode with all GPUs, +hugepage mounts, and `/sys` passed through, and bind-mount the repo at +`/workspace`: + +```bash +# RX-side NIC; auto-injects ETH_DST_ADDR for the DPDK bench wrappers. +RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}" +sudo docker run --rm -it \ + --net host --ipc=host \ + --runtime=nvidia --gpus all \ + --privileged \ + --ulimit memlock=-1 --ulimit stack=67108864 \ + -v "$(pwd):/workspace" \ + -v /dev/hugepages:/dev/hugepages \ + -v /mnt/huge:/mnt/huge \ + -v /sys:/sys \ + -w /workspace \ + -e ETH_DST_ADDR="$(cat /sys/class/net/$RX_IFACE/address)" \ + -e RX_IFACE="$RX_IFACE" \ + daqiri:local \ + bash +``` + +### Build + +Inside the container: + +```bash +cmake -S . -B build -DBUILD_SHARED_LIBS=ON -DDAQIRI_BUILD_PYTHON=ON \ + -DDAQIRI_MGR="dpdk socket rdma" +cmake --build build -j +``` + +### One-shot driver + +The whole DPDK matrix (sweep + drop-curve) is driven by a single script +which handles pre-flight checks (hugepage availability, RX iface MAC, +link state), orphan-hugepage cleanup between cells, and a final summary of +per-backend result directories: + +```bash +./scripts/spark_data_fill.sh dpdk +``` + +The script defaults to `dpdk socket-udp socket-tcp` if invoked with no +arguments; on Spark, the socket backends will fail their own pre-flight +once the follow-up issues land. RDMA is currently rejected by the +pre-flight (see [TODO / Known Limitations](#todo-not-yet-implemented-known-limitations)). + +### Per-backend wrapper invocations + +For ad-hoc runs of a single cell or a single mode: + +```bash +export DAQIRI_BUILD_DIR="$PWD/build" +export ETH_DST_ADDR="$(cat /sys/class/net//address)" # DPDK only + +./examples/run_spark_bench.sh dpdk smoke # native-shape headline cell +./examples/run_spark_bench.sh dpdk sweep # full payload × batch matrix +./examples/run_spark_bench.sh dpdk drop-curve # sweep --target-gbps +``` + +Each invocation emits `bench-results/-dpdk-/` containing +one subdirectory per cell (stdout / stderr / config / dmon / cpu_stat +captures), an `environment.txt` snapshot, and a `runs.csv` aggregating +the cell-level metrics. + +### Environment-only capture + +Useful for filing a bug report or comparing two Spark units without +running the bench: + +```bash +./examples/bench_capture_environment.sh /tmp/spark-env +``` + +### Tuning prerequisites + +System tuning is required before the numbers in this report are +reproducible. See +[`docs/tutorials/system_configuration.md`](tutorials/system_configuration.md) +for the DGX Spark tab — isolated cores, hugepages, governor, IRQ affinity. + +## TODO: Not Yet Implemented / Known Limitations + +- **HDS (Header–Data Split) is deferred.** The generic HDS configuration uses + `kind: device` for GPU memory regions. Spark / GB10 cannot use device memory + for GPUDirect — `nvidia_peermem` does not load and DMA-BUF is unreachable — + so DAQIRI uses `host_pinned` instead. Under `host_pinned`, the HDS layout no + longer changes the memory path; it only changes the segment partition, + which makes "HDS vs. plain GPUDirect" a non-distinction on this platform. + HDS is characterized when this report extends to IGX and x86-server + platforms where device memory works. See + [issue #15](https://github.com/NVIDIA/daqiri/issues/15) for tracking. +- **RoCE single-host loopback is deferred from this report.** See footnote [^1] + on the headline tables. The wrapper currently runs `daqiri_bench_rdma` in + `--mode both` from a single process; on Spark, with both 1.1.1.1 (mlx5_0) + and 2.2.2.2 (mlx5_2) bound in the root namespace, the kernel shortcuts the + RC connection through `lo` and the QSFP cable carries no traffic. A + follow-up will land the two-netns + two-process orchestration and re-fill + the RoCE rows. +- **Socket UDP / TCP results are deferred.** See footnotes [^2] and [^3]. + Both are bench-side bugs uncovered during PR 1 verification on Spark: + UDP `--mode both` deadlocks on peer learning; TCP `--mode both` aborts with + a glibc heap-corruption assertion on init. Follow-up issues track each. +- **p99/p999 latency is not in v1.** The bench output captures throughput, + drops, and resource utilization. Per-burst RX timestamping and percentile + aggregation are deferred to a follow-up issue. diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css index b5ce45f..95af265 100644 --- a/docs/stylesheets/extra.css +++ b/docs/stylesheets/extra.css @@ -50,3 +50,64 @@ [data-md-color-scheme="slate"] .md-footer { background: #111; } + +/* ── Performance-report heatmap cells ───────────────────────────────── */ +/* Used by the payload×batch and payload×target_gbps matrices in + docs/performance-*.md. Threshold logic (vs. matrix-global max): + green = no drops AND Gbps ≥ 90% of max + yellow = no drops AND Gbps ≥ 70% of max + red = any drops OR Gbps < 70% of max */ +.md-typeset table.perf-matrix { + width: 100%; + table-layout: auto; + border-collapse: separate; + border-spacing: 5px; + font-size: 0.64rem; +} +.md-typeset table.perf-matrix th, +.md-typeset table.perf-matrix td { + text-align: center; + vertical-align: middle; + font-variant-numeric: tabular-nums; + padding: 0.55em 0.5em; + border-radius: 4px; + white-space: nowrap; +} +.md-typeset table.perf-matrix td small { + display: block; + opacity: 0.75; + font-size: 0.85em; + margin-top: 0.25em; +} +.md-typeset table.perf-matrix td.cell-green { + background-color: rgba(118, 185, 0, 0.28); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-yellow { + background-color: rgba(255, 196, 0, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix td.cell-red { + background-color: rgba(220, 60, 60, 0.32); + color: inherit; +} +.md-typeset table.perf-matrix th { + background-color: rgba(255, 255, 255, 0.05); + font-weight: 600; +} +/* Compact legend chips that pair with the matrix. */ +.md-typeset .perf-legend { + display: flex; + gap: 0.75em; + margin: 0.5em 0 1em 0; + font-size: 0.85em; + flex-wrap: wrap; +} +.md-typeset .perf-legend span { + padding: 0.1em 0.55em; + border-radius: 0.25em; + white-space: nowrap; +} +.md-typeset .perf-legend .cell-green { background-color: rgba(118, 185, 0, 0.28); } +.md-typeset .perf-legend .cell-yellow { background-color: rgba(255, 196, 0, 0.32); } +.md-typeset .perf-legend .cell-red { background-color: rgba(220, 60, 60, 0.32); } diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt index 4bc11aa..4807b85 100644 --- a/examples/CMakeLists.txt +++ b/examples/CMakeLists.txt @@ -71,14 +71,16 @@ add_daqiri_raw_bench(daqiri_bench_raw_reorder_quantize raw_reorder_quantize_benc add_daqiri_raw_bench(daqiri_example_gds_write gds_write_example.cpp) add_daqiri_raw_bench(daqiri_example_pcap_writer pcap_writer_example.cpp) -add_executable(daqiri_bench_rdma rdma_bench.cpp) +add_executable(daqiri_bench_rdma rdma_bench.cpp raw_bench_common.cpp) link_daqiri_bench(daqiri_bench_rdma) +target_link_libraries(daqiri_bench_rdma PRIVATE CUDA::cudart) set_target_properties(daqiri_bench_rdma PROPERTIES BUILD_RPATH "$ORIGIN/../src;$ORIGIN/../src/third_party/yaml-cpp" ) -add_executable(daqiri_bench_socket socket_bench.cpp) +add_executable(daqiri_bench_socket socket_bench.cpp raw_bench_common.cpp) link_daqiri_bench(daqiri_bench_socket) +target_link_libraries(daqiri_bench_socket PRIVATE CUDA::cudart) foreach(cfg IN LISTS DAQIRI_BENCH_CONFIGS) configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${cfg} ${CMAKE_CURRENT_BINARY_DIR}/${cfg} COPYONLY) diff --git a/examples/bench_capture_environment.sh b/examples/bench_capture_environment.sh new file mode 100755 index 0000000..3314b08 --- /dev/null +++ b/examples/bench_capture_environment.sh @@ -0,0 +1,116 @@ +#!/usr/bin/env bash +# +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Capture host/NIC/GPU/build state for a benchmark run, so numbers are +# reproducible across machines and over time. Writes one structured text file +# with named sections. +# +# Usage: ./bench_capture_environment.sh +# Default output dir: bench-results// + +set -u + +OUT_DIR="${1:-bench-results/$(date -u +%Y%m%dT%H%M%SZ)}" +mkdir -p "$OUT_DIR" +OUT="$OUT_DIR/environment.txt" + +# Run a command, capturing exit status. Always write a header so the section is +# present even when the command is missing or fails — silent absence is harder +# to debug than an explicit "command not found". +run_section() { + local label="$1"; shift + { + echo "==========================================================" + echo "[$label]" + echo " cmd: $*" + echo "==========================================================" + if command -v "$1" >/dev/null 2>&1 || [[ "$1" == /* || "$1" == ./* ]]; then + "$@" 2>&1 + echo " (exit: $?)" + else + echo " (command not found in PATH: $1)" + fi + echo + } >> "$OUT" +} + +# Cat a file/glob; write a header either way. +cat_section() { + local label="$1"; shift + { + echo "==========================================================" + echo "[$label]" + echo " paths: $*" + echo "==========================================================" + for p in "$@"; do + if compgen -G "$p" >/dev/null; then + for f in $p; do + echo "----- $f -----" + cat "$f" 2>&1 + done + else + echo " (no match: $p)" + fi + done + echo + } >> "$OUT" +} + +: > "$OUT" + +echo "DAQIRI benchmark environment capture" >> "$OUT" +echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$OUT" +echo "Host: $(hostname)" >> "$OUT" +echo "Output: $OUT" >> "$OUT" +echo >> "$OUT" + +# --- Kernel / OS --- +run_section "uname" uname -a +cat_section "kernel-cmdline" /proc/cmdline +cat_section "os-release" /etc/os-release +run_section "lsb-release" lsb_release -a +run_section "clocksource" cat /sys/devices/system/clocksource/clocksource0/current_clocksource + +# --- CPU / NUMA / IRQ --- +run_section "numactl" numactl --show +run_section "lscpu" lscpu +cat_section "cpu-isolated" /sys/devices/system/cpu/isolated +run_section "cpufreq-info" cpupower frequency-info +cat_section "cpu-governor" /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +run_section "irq-mlx5" bash -c "grep mlx5 /proc/interrupts || true" + +# --- Hugepages --- +cat_section "hugepages" /sys/kernel/mm/hugepages/*/nr_hugepages +run_section "free-h" free -h + +# --- PCIe topology --- +run_section "lspci-mellanox" bash -c "lspci -vvv -d 15b3: 2>/dev/null" +run_section "lspci-nvidia" bash -c "lspci -vvv -d 10de: 2>/dev/null" + +# --- NIC: OFED / firmware / DPDK binding --- +run_section "ofed-info" ofed_info -s +run_section "mlxfwmanager" mlxfwmanager --query +run_section "dpdk-devbind" dpdk-devbind.py --status +# Per-iface ethtool — iterate over the daqiri-tx/rx names if present, else all mlx5. +for iface in daqiri-tx daqiri-rx $(ls /sys/class/net 2>/dev/null | grep -E '^(enP|enp|eth)' || true); do + [[ -d "/sys/class/net/$iface" ]] || continue + run_section "ethtool-i:$iface" ethtool -i "$iface" + run_section "ethtool-g:$iface" ethtool -g "$iface" + run_section "ethtool-l:$iface" ethtool -l "$iface" + cat_section "iface-mtu:$iface" "/sys/class/net/$iface/mtu" + cat_section "iface-mac:$iface" "/sys/class/net/$iface/address" +done + +# --- GPU --- +run_section "nvidia-smi-q" nvidia-smi -q +run_section "nvidia-smi-tempclk" nvidia-smi --query-gpu=name,driver_version,temperature.gpu,clocks.current.sm,clocks.current.memory --format=csv + +# --- Build state --- +DAQIRI_DIR="$(git -C "$(dirname "$0")/.." rev-parse --show-toplevel 2>/dev/null || pwd)" +run_section "git-rev-parse" git -C "$DAQIRI_DIR" rev-parse HEAD +run_section "git-status" git -C "$DAQIRI_DIR" status --short +run_section "git-describe" git -C "$DAQIRI_DIR" describe --always --dirty + +echo "Capture complete: $OUT" diff --git a/examples/raw_bench_common.cpp b/examples/raw_bench_common.cpp index 405c500..bb1e42c 100644 --- a/examples/raw_bench_common.cpp +++ b/examples/raw_bench_common.cpp @@ -19,10 +19,12 @@ #include +#include #include #include #include #include +#include #include #include @@ -97,6 +99,44 @@ int parse_run_seconds(int argc, char **argv) { return run_seconds; } +double parse_target_gbps(int argc, char **argv) { + double target_gbps = 0.0; + for (int i = 2; i + 1 < argc; i += 2) { + if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); + } + } + return target_gbps; +} + +TokenBucketPacer::TokenBucketPacer(double target_gbps) + : target_bps_(target_gbps > 0.0 ? target_gbps * 1e9 : 0.0), + t0_(std::chrono::steady_clock::now()) {} + +void TokenBucketPacer::wait_for_bytes(size_t bytes, std::atomic &stop) { + if (target_bps_ <= 0.0) { + return; + } + total_bytes_ += bytes; + const double scheduled_secs = (total_bytes_ * 8.0) / target_bps_; + const auto scheduled = t0_ + std::chrono::duration_cast< + std::chrono::steady_clock::duration>( + std::chrono::duration(scheduled_secs)); + // Slice the wait into 10 ms chunks so a stop flag (--seconds expiry or + // Ctrl-C) can break us out promptly. The total slept across the slices + // accumulates to the scheduled deadline, so pacing remains accurate. + constexpr auto kSlice = std::chrono::milliseconds(10); + while (!stop.load()) { + const auto now = std::chrono::steady_clock::now(); + if (scheduled <= now) { + return; + } + const auto remaining = scheduled - now; + std::this_thread::sleep_for( + std::min(remaining, kSlice)); + } +} + bool has_bench_rx(const YAML::Node &root) { return root["bench_rx"] && root["bench_rx"]["interface_name"]; } @@ -287,6 +327,7 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) { uint64_t pkts = 0; uint64_t bytes = 0; uint64_t bursts = 0; + const auto t0 = std::chrono::steady_clock::now(); while (!stop.load()) { const auto num_rx_queues = static_cast(daqiri::get_num_rx_queues(port_id)); @@ -307,9 +348,17 @@ void rx_count_worker(const RawBenchRxConfig &cfg, std::atomic &stop) { std::this_thread::sleep_for(std::chrono::microseconds(100)); } } - - std::cout << "RX complete: packets=" << pkts << " bytes=" << bytes - << " bursts=" << bursts << "\n"; + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - t0) + .count(); + + // Build the line in a stringstream so the print to stdout is a single + // write(). RX and TX workers race at end-of-run and naive `cout <<` can + // interleave their output (corrupting downstream parsers). + std::ostringstream oss; + oss << "RX complete: packets=" << pkts << " bytes=" << bytes + << " bursts=" << bursts << " seconds=" << secs << "\n"; + std::cout << oss.str() << std::flush; } } // namespace daqiri::bench diff --git a/examples/raw_bench_common.h b/examples/raw_bench_common.h index d53a4f9..cf0ee0c 100644 --- a/examples/raw_bench_common.h +++ b/examples/raw_bench_common.h @@ -21,6 +21,7 @@ #include #include +#include #include #include #include @@ -28,6 +29,34 @@ namespace daqiri::bench { +// Software token-bucket pacer used by the bench TX workers. When +// target_gbps == 0 the wait_for_bytes() call is a no-op early return, so the +// pacer adds no overhead when --target-gbps is unset. +// +// Accuracy: ~5% at high rates due to Linux nanosleep granularity and scheduler +// jitter. Acceptable for drop-curve sweeps; tighter pacing would require +// hardware TX timestamping (DAQIRI's accurate_send YAML flag), deferred. +class TokenBucketPacer { +public: + TokenBucketPacer() = default; + explicit TokenBucketPacer(double target_gbps); + + // Call after each TX burst. Sleeps in short slices until the pacer's notion + // of "time the configured target rate would have taken to send the + // accumulated bytes" catches up, OR `stop` flips true. Slicing keeps the + // bench responsive to --seconds expiry / Ctrl-C without truncating the total + // sleep (which would silently break pacing for low target rates). + void wait_for_bytes(size_t bytes, std::atomic &stop); + + bool enabled() const { return target_bps_ > 0.0; } + double target_gbps() const { return target_bps_ / 1e9; } + +private: + double target_bps_ = 0.0; // 0 means disabled + uint64_t total_bytes_ = 0; + std::chrono::steady_clock::time_point t0_; +}; + struct RawBenchTxConfig { std::string interface_name = "tx_port"; uint32_t batch_size = 1024; @@ -68,6 +97,7 @@ class PinnedHostBuffer { }; int parse_run_seconds(int argc, char **argv); +double parse_target_gbps(int argc, char **argv); bool has_bench_rx(const YAML::Node &root); bool has_bench_tx(const YAML::Node &root); RawBenchRxConfig parse_rx(const YAML::Node &root); diff --git a/examples/raw_gpudirect_bench.cpp b/examples/raw_gpudirect_bench.cpp index e93d0c8..cbfa9f9 100644 --- a/examples/raw_gpudirect_bench.cpp +++ b/examples/raw_gpudirect_bench.cpp @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -35,6 +36,7 @@ namespace { void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, + daqiri::bench::TokenBucketPacer &pacer, std::atomic &stop) { const int port_id = daqiri::get_port_id(cfg.interface_name); if (port_id < 0) { @@ -67,6 +69,12 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, std::unordered_set initialized_tx_buffers; + uint64_t tx_packets = 0; + uint64_t tx_bytes = 0; + uint64_t tx_bursts = 0; + const auto t0 = std::chrono::steady_clock::now(); + const uint32_t pkt_bytes = cfg.header_size + cfg.payload_size; + while (!stop.load()) { auto *msg = daqiri::create_tx_burst_params(); daqiri::set_header(msg, static_cast(port_id), 0, cfg.batch_size, @@ -109,7 +117,7 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, } if (daqiri::set_packet_lengths( - msg, i, {static_cast(cfg.header_size + cfg.payload_size)}) != + msg, i, {static_cast(pkt_bytes)}) != daqiri::Status::SUCCESS) { failed = true; break; @@ -121,18 +129,34 @@ void tx_worker(const daqiri::bench::RawBenchTxConfig &cfg, continue; } daqiri::send_tx_burst(msg); + const uint64_t burst_bytes = + static_cast(num_pkts) * pkt_bytes; + tx_packets += static_cast(num_pkts); + tx_bytes += burst_bytes; + ++tx_bursts; + pacer.wait_for_bytes(burst_bytes, stop); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - t0) + .count(); + // Single-write print so concurrent RX worker output doesn't interleave. + std::ostringstream oss; + oss << "TX complete: packets=" << tx_packets << " bytes=" << tx_bytes + << " bursts=" << tx_bursts << " seconds=" << secs << "\n"; + std::cout << oss.str() << std::flush; } } // namespace int main(int argc, char **argv) { if (argc < 2) { - std::cerr << "Usage: " << argv[0] << " [--seconds N]\n"; + std::cerr << "Usage: " << argv[0] + << " [--seconds N] [--target-gbps G]\n"; return 1; } const int run_seconds = daqiri::bench::parse_run_seconds(argc, argv); + const double target_gbps = daqiri::bench::parse_target_gbps(argc, argv); const auto root = YAML::LoadFile(argv[1]); if (daqiri::daqiri_init(argv[1]) != daqiri::Status::SUCCESS) { std::cerr << "daqiri_init failed\n"; @@ -150,14 +174,15 @@ int main(int argc, char **argv) { std::atomic stop{false}; std::thread tx_thread; std::thread rx_thread; + daqiri::bench::TokenBucketPacer tx_pacer(target_gbps); if (has_rx) { rx_thread = std::thread(daqiri::bench::rx_count_worker, daqiri::bench::parse_rx(root), std::ref(stop)); } if (has_tx) { - tx_thread = - std::thread(tx_worker, daqiri::bench::parse_tx(root), std::ref(stop)); + tx_thread = std::thread(tx_worker, daqiri::bench::parse_tx(root), + std::ref(tx_pacer), std::ref(stop)); } daqiri::bench::wait_for_stop(run_seconds, stop); diff --git a/examples/rdma_bench.cpp b/examples/rdma_bench.cpp index dbe8186..5b9bcfd 100644 --- a/examples/rdma_bench.cpp +++ b/examples/rdma_bench.cpp @@ -25,6 +25,7 @@ #include #include +#include "raw_bench_common.h" #include "src/common.h" namespace { @@ -48,6 +49,8 @@ struct RdmaBenchConfig { struct RdmaWorkerStats { uint64_t send_completions = 0; uint64_t recv_completions = 0; + uint64_t send_bytes = 0; + uint64_t recv_bytes = 0; }; RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) { @@ -62,7 +65,8 @@ RdmaBenchConfig parse_rdma_cfg(const YAML::Node& node) { return cfg; } -void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorkerStats& stats) { +void rdma_worker(const RdmaBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer, + std::atomic& stop, RdmaWorkerStats& stats) { static constexpr int kMaxOutstanding = 5; int outstanding_send = 0; int outstanding_recv = 0; @@ -116,6 +120,10 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker } outstanding++; wr_id++; + // Only meter actual byte transmissions (SENDs), not RECEIVE-side posts. + if (op == daqiri::RDMAOpCode::SEND) { + pacer.wait_for_bytes(static_cast(cfg.message_size), stop); + } }; if (cfg.send) { post_req(outstanding_send, send_wr_id, daqiri::RDMAOpCode::SEND, send_mr); } @@ -127,10 +135,12 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::SEND && outstanding_send > 0) { outstanding_send--; stats.send_completions++; + stats.send_bytes += static_cast(cfg.message_size); } else if (daqiri::rdma_get_opcode(completion) == daqiri::RDMAOpCode::RECEIVE && outstanding_recv > 0) { outstanding_recv--; stats.recv_completions++; + stats.recv_bytes += static_cast(cfg.message_size); } daqiri::free_tx_burst(completion); } else { @@ -143,17 +153,21 @@ void rdma_worker(const RdmaBenchConfig& cfg, std::atomic& stop, RdmaWorker int main(int argc, char** argv) { if (argc < 2) { - std::cerr << "Usage: " << argv[0] << " [--seconds N] [--mode server|client|both]\n"; + std::cerr << "Usage: " << argv[0] + << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n"; return 1; } int run_seconds = 10; + double target_gbps = 0.0; std::string mode = "both"; for (int i = 2; i + 1 < argc; i += 2) { if (std::string(argv[i]) == "--seconds") { run_seconds = std::stoi(argv[i + 1]); } else if (std::string(argv[i]) == "--mode") { mode = argv[i + 1]; + } else if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); } } @@ -168,18 +182,22 @@ int main(int argc, char** argv) { std::thread client_thread; RdmaWorkerStats server_stats; RdmaWorkerStats client_stats; + daqiri::bench::TokenBucketPacer server_pacer(target_gbps); + daqiri::bench::TokenBucketPacer client_pacer(target_gbps); bool run_server = false; bool run_client = false; if ((mode == "server" || mode == "both") && root["rdma_bench_server"]) { run_server = true; server_thread = std::thread( - rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), std::ref(stop), std::ref(server_stats)); + rdma_worker, parse_rdma_cfg(root["rdma_bench_server"]), + std::ref(server_pacer), std::ref(stop), std::ref(server_stats)); } if ((mode == "client" || mode == "both") && root["rdma_bench_client"]) { run_client = true; client_thread = std::thread( - rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), std::ref(stop), std::ref(client_stats)); + rdma_worker, parse_rdma_cfg(root["rdma_bench_client"]), + std::ref(client_pacer), std::ref(stop), std::ref(client_stats)); } if (!server_thread.joinable() && !client_thread.joinable()) { @@ -203,11 +221,23 @@ int main(int argc, char** argv) { if (server_thread.joinable()) { server_thread.join(); } if (client_thread.joinable()) { client_thread.join(); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - start) + .count(); + if (run_server) { - std::cout << "Server received messages: " << server_stats.recv_completions << '\n'; + std::cout << "Server complete: send_completions=" << server_stats.send_completions + << " recv_completions=" << server_stats.recv_completions + << " send_bytes=" << server_stats.send_bytes + << " recv_bytes=" << server_stats.recv_bytes + << " seconds=" << secs << '\n'; } if (run_client) { - std::cout << "Client received messages: " << client_stats.recv_completions << '\n'; + std::cout << "Client complete: send_completions=" << client_stats.send_completions + << " recv_completions=" << client_stats.recv_completions + << " send_bytes=" << client_stats.send_bytes + << " recv_bytes=" << client_stats.recv_bytes + << " seconds=" << secs << '\n'; } daqiri::print_stats(); diff --git a/examples/run_spark_bench.sh b/examples/run_spark_bench.sh new file mode 100755 index 0000000..c2aa75a --- /dev/null +++ b/examples/run_spark_bench.sh @@ -0,0 +1,334 @@ +#!/usr/bin/env bash +# +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Sweep wrapper for DAQIRI benchmarks on DGX Spark. Runs the bench across a +# matrix of (payload/message size, batch size, target-gbps), captures per-run +# CPU/GPU/NIC counters, and emits one CSV row per cell into bench-results/. +# +# Drop sources per backend (per the report methodology): +# DPDK : grep imissed/ierrors/rx_nombuf from bench log (DAQIRI_LOG_INFO). +# RDMA : grep "CQ error" lines from bench log (DAQIRI_LOG_ERROR). +# socket : diff /proc/net/udp drops column (UDP); nstat -a (TCP retransmits). +# +# Usage: +# ./run_spark_bench.sh [mode] +# backend ∈ {dpdk, rdma, socket-udp, socket-tcp} +# mode ∈ {smoke, sweep, drop-curve, drop-curve-matrix} (default: smoke) +# +# Required environment in current shell: +# DAQIRI_BUILD_DIR — path to the cmake build dir (defaults to ../build). +# ETH_DST_ADDR — required for dpdk backend (the RX iface MAC). +# RX_IFACE — kernel name of the RX interface for /proc/net/udp diff +# (e.g. enP2p1s0f0np0); required for socket-udp. +# +# Run inside the project container as root (per AGENTS.md). + +set -u +set -o pipefail + +# -------------------------------------------------------------------------- +# Configuration +# -------------------------------------------------------------------------- + +BACKEND="${1:-}" +MODE="${2:-smoke}" +if [[ -z "$BACKEND" ]]; then + echo "Usage: $0 [smoke|sweep|drop-curve|drop-curve-matrix]" >&2 + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +BUILD_DIR="${DAQIRI_BUILD_DIR:-$SCRIPT_DIR/../build}" +TS="$(date -u +%Y%m%dT%H%M%SZ)" +OUT_DIR="$SCRIPT_DIR/../bench-results/$TS-$BACKEND-$MODE" +mkdir -p "$OUT_DIR" + +CSV="$OUT_DIR/runs.csv" +echo "lang,backend,post_process,payload,batch,target_gbps,seconds,packets,bytes,pps,gbps,drops,drops_kind,cpu_master_pct,cpu_tx_pct,cpu_rx_pct,gpu_sm_pct,gpu_mem_pct" > "$CSV" + +# Capture slow-moving environment state once per result set. +"$SCRIPT_DIR/bench_capture_environment.sh" "$OUT_DIR" + +RUN_SECONDS=30 +DRIVER_LOG="$OUT_DIR/last_run.stderr" + +# Per-backend sweep matrices (see docs/performance-dgx-spark.md methodology). +# Native-shape sizes are the leftmost entry; "matched 8K" cell is also included. +case "$BACKEND" in + dpdk) + PAYLOADS_SWEEP=(8000 4096 1024 256 64) + BATCHES_SWEEP=(10240 4096 1024 256) + PAYLOADS_HEADLINE=(8000) + BATCHES_HEADLINE=(10240) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_raw_tx_rx_spark.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 + : "${ETH_DST_ADDR:?ETH_DST_ADDR must be set for dpdk backend (cat /sys/class/net//address)}" + ;; + rdma) + PAYLOADS_SWEEP=(8000000 1048576 65536 8192 4096) + BATCHES_SWEEP=(1) + PAYLOADS_HEADLINE=(8000000) + BATCHES_HEADLINE=(1) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_rdma_tx_rx_spark.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_rdma" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 + ;; + socket-udp) + PAYLOADS_SWEEP=(1472 1024 256 64) + BATCHES_SWEEP=(256 32 1) + PAYLOADS_HEADLINE=(1472) + BATCHES_HEADLINE=(256) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_udp_tx_rx.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 + ;; + socket-tcp) + PAYLOADS_SWEEP=(1048576 65536 1024) + BATCHES_SWEEP=(1) + PAYLOADS_HEADLINE=(65536) + BATCHES_HEADLINE=(1) + BASE_YAML="$SCRIPT_DIR/daqiri_bench_socket_tcp_tx_rx.yaml" + BENCH_BIN="$BUILD_DIR/examples/daqiri_bench_socket" + CPU_MASTER=8; CPU_TX=17; CPU_RX=18 + ;; + *) echo "Unknown backend: $BACKEND" >&2; exit 1 ;; +esac + +DROP_CURVE_TARGETS=(1 5 10 25 50 75 100 0) # 0 means unpaced (line rate) + +# -------------------------------------------------------------------------- +# Helpers +# -------------------------------------------------------------------------- + +# Read a scalar field from a `key=value` style stdout line. +# usage: extract_field +extract_field() { + local prefix="$1" field="$2" file="$3" + grep -E "^$prefix" "$file" | tail -n1 | grep -oE " $field=[^ ]+" | head -n1 | sed -E "s/.*$field=//" +} + +# Sum DPDK drop counters from the manager log emitted via DAQIRI_LOG_INFO. +parse_dpdk_drops() { + local log="$1" + local sum=0 v + for key in imissed ierrors rx_nombuf; do + v="$(grep -oE "$key=[0-9]+" "$log" 2>/dev/null | tail -n1 | sed -E "s/.*=//" || true)" + [[ -n "${v:-}" ]] && sum=$((sum + v)) + done + echo "$sum" +} + +# Count RDMA CQ errors in the manager log. +parse_rdma_drops() { + local log="$1" + grep -c 'CQ error' "$log" 2>/dev/null || echo 0 +} + +# Snapshot socket drops on the kernel side. +snapshot_proc_net_udp() { + awk 'NR>1 { sum += strtonum("0x" $13) } END { print sum+0 }' /proc/net/udp 2>/dev/null || echo 0 +} +snapshot_nstat() { + nstat -a 2>/dev/null | awk '/TcpExtTCPLostRetransmit|TcpRetransSegs|TcpInErrs/ { s += $2 } END { print s+0 }' || echo 0 +} + +# Snapshot /proc/stat per-cpu counters to a file. Mpstat is often not installed +# in the bench container; /proc/stat is always available. +snapshot_cpu_stat() { + awk '/^cpu[0-9]+/ { + total = $2+$3+$4+$5+$6+$7+$8 + busy = total - $5 - $6 + print $1, total, busy + }' /proc/stat > "$1" +} + +# Compute busy% for a single cpu index between two /proc/stat snapshots. +cpu_busy_pct() { + local before="$1" after="$2" cpu_idx="$3" + awk -v cpu="cpu$cpu_idx" ' + NR == FNR { b_total[$1] = $2; b_busy[$1] = $3; next } + { a_total[$1] = $2; a_busy[$1] = $3 } + END { + dt = a_total[cpu] - b_total[cpu] + db = a_busy[cpu] - b_busy[cpu] + if (dt > 0) printf "%.1f", (db * 100.0) / dt + else printf "0.0" + } + ' "$before" "$after" +} + +# Substitute payload / batch into the base YAML and write a temp config. +generate_yaml() { + local out="$1" payload="$2" batch="$3" + case "$BACKEND" in + dpdk) + sed -E \ + -e "s|^( *payload_size: ).*|\1$payload|" \ + -e "s|^( *batch_size: ).*|\1$batch|" \ + -e "s|<00:00:00:00:00:00>|$ETH_DST_ADDR|g" \ + "$BASE_YAML" > "$out" + ;; + rdma) + sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out" + ;; + socket-udp|socket-tcp) + sed -E "s|^( *message_size: ).*|\1$payload|g" "$BASE_YAML" > "$out" + ;; + esac +} + +# Run one cell. Echoes the CSV row to stdout. +run_cell() { + local lang="$1" payload="$2" batch="$3" target_gbps="$4" + local cell="$lang-$BACKEND-p$payload-b$batch-g$target_gbps" + local cell_dir="$OUT_DIR/$cell" + mkdir -p "$cell_dir" + + local yaml="$cell_dir/config.yaml" + generate_yaml "$yaml" "$payload" "$batch" + + # Snapshot kernel-side drop counters. + local udp_before tcp_before + udp_before="$(snapshot_proc_net_udp)" + tcp_before="$(snapshot_nstat)" + + # Snapshot per-cpu stats just before the bench starts. + snapshot_cpu_stat "$cell_dir/cpu_stat.before" + + # Background GPU dmon (1-sec sample, RUN_SECONDS samples). + ( nvidia-smi dmon -s pucvmet -c "$RUN_SECONDS" > "$cell_dir/nvidia_smi_dmon.txt" 2>&1 ) & + local dmon_pid=$! + + # Run the bench. Stderr captures DAQIRI_LOG_* output (DPDK/RDMA drop sources). + local stdout="$cell_dir/stdout.txt" + local stderr="$cell_dir/stderr.txt" + local args=("$yaml" --seconds "$RUN_SECONDS") + [[ "$target_gbps" != "0" ]] && args+=(--target-gbps "$target_gbps") + [[ "$BACKEND" == "rdma" || "$BACKEND" =~ ^socket- ]] && args+=(--mode both) + + "$BENCH_BIN" "${args[@]}" > "$stdout" 2> "$stderr" || true + cp "$stderr" "$DRIVER_LOG" + + # Snapshot per-cpu stats right after the bench exits (before background + # captures finish reaping, to bound the window). + snapshot_cpu_stat "$cell_dir/cpu_stat.after" + + # Stop background captures (they self-terminate at -c , but reap if needed). + wait "$dmon_pid" 2>/dev/null || true + + # Parse bench stdout. For RX-bearing benches "RX complete" is authoritative; + # for TX-only configs fall back to "TX complete". + local pkts bytes secs + pkts="$(extract_field 'RX complete' packets "$stdout")" + bytes="$(extract_field 'RX complete' bytes "$stdout")" + secs="$(extract_field 'RX complete' seconds "$stdout")" + if [[ -z "$pkts" ]]; then + pkts="$(extract_field 'TX complete' packets "$stdout")" + bytes="$(extract_field 'TX complete' bytes "$stdout")" + secs="$(extract_field 'TX complete' seconds "$stdout")" + fi + if [[ -z "$pkts" ]]; then + # RDMA prints "Client/Server complete: ... send_completions=N send_bytes=N seconds=S" + pkts="$(extract_field 'Client complete' send_completions "$stdout")" + bytes="$(extract_field 'Client complete' send_bytes "$stdout")" + secs="$(extract_field 'Client complete' seconds "$stdout")" + fi + pkts="${pkts:-0}"; bytes="${bytes:-0}"; secs="${secs:-0}" + + local pps gbps + pps="$(awk -v p="$pkts" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.0f", p/s; else print 0 }')" + gbps="$(awk -v b="$bytes" -v s="$secs" 'BEGIN { if (s+0>0) printf "%.3f", (b*8.0)/s/1e9; else print 0 }')" + + # Drops per backend. + local drops drops_kind + case "$BACKEND" in + dpdk) + drops="$(parse_dpdk_drops "$stderr")" + drops_kind="dpdk-imissed+ierrors+nombuf" + ;; + rdma) + drops="$(parse_rdma_drops "$stderr")" + drops_kind="rdma-cqe-error" + ;; + socket-udp) + local udp_after; udp_after="$(snapshot_proc_net_udp)" + drops="$((udp_after - udp_before))" + drops_kind="udp-proc-net-udp-drops" + ;; + socket-tcp) + local tcp_after; tcp_after="$(snapshot_nstat)" + drops="$((tcp_after - tcp_before))" + drops_kind="tcp-nstat-retrans+inerrs" + ;; + esac + + # Per-core CPU busy% over the bench window. Cores defined per-backend + # (master/TX/RX) match the YAML so we measure the threads we actually pin. + local cpu_master_pct cpu_tx_pct cpu_rx_pct + cpu_master_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_MASTER")" + cpu_tx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_TX")" + cpu_rx_pct="$(cpu_busy_pct "$cell_dir/cpu_stat.before" "$cell_dir/cpu_stat.after" "$CPU_RX")" + + # GPU SM% (column 5) and memory-controller % (column 6) from nvidia-smi + # dmon -s pucvmet. These are near zero for GPUDirect workloads (GPU is a + # DMA target, not a compute engine). + local gpu_sm gpu_mem + gpu_sm="$(awk '/^ *[0-9]/ { count++; sum += $5 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ + "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" + gpu_mem="$(awk '/^ *[0-9]/ { count++; sum += $6 } END { if (count) printf "%.1f", sum/count; else print 0 }' \ + "$cell_dir/nvidia_smi_dmon.txt" 2>/dev/null || echo 0)" + + echo "$lang,$BACKEND,none,$payload,$batch,$target_gbps,$secs,$pkts,$bytes,$pps,$gbps,$drops,$drops_kind,$cpu_master_pct,$cpu_tx_pct,$cpu_rx_pct,$gpu_sm,$gpu_mem" \ + | tee -a "$CSV" +} + +# -------------------------------------------------------------------------- +# Driver +# -------------------------------------------------------------------------- + +case "$MODE" in + smoke) + # One cell, native-shape, unpaced. + for p in "${PAYLOADS_HEADLINE[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + run_cell cpp "$p" "$b" 0 + done + done + ;; + sweep) + # Full payload × batch matrix at line rate. + for p in "${PAYLOADS_SWEEP[@]}"; do + for b in "${BATCHES_SWEEP[@]}"; do + run_cell cpp "$p" "$b" 0 + done + done + ;; + drop-curve) + # Hold native-shape constant, sweep target_gbps. + for p in "${PAYLOADS_HEADLINE[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + for g in "${DROP_CURVE_TARGETS[@]}"; do + run_cell cpp "$p" "$b" "$g" + done + done + done + ;; + drop-curve-matrix) + # 2D drop curve: sweep payload × target_gbps at the headline batch. + for p in "${PAYLOADS_SWEEP[@]}"; do + for b in "${BATCHES_HEADLINE[@]}"; do + for g in "${DROP_CURVE_TARGETS[@]}"; do + run_cell cpp "$p" "$b" "$g" + done + done + done + ;; + *) echo "Unknown mode: $MODE" >&2; exit 1 ;; +esac + +echo +echo "Results in: $OUT_DIR" +echo "CSV: $CSV" diff --git a/examples/socket_bench.cpp b/examples/socket_bench.cpp index 5fda5f5..8ebb27f 100644 --- a/examples/socket_bench.cpp +++ b/examples/socket_bench.cpp @@ -26,6 +26,7 @@ #include #include +#include "raw_bench_common.h" #include "src/common.h" namespace { @@ -50,6 +51,8 @@ struct SocketBenchConfig { struct SocketWorkerStats { uint64_t sent_packets = 0; uint64_t received_packets = 0; + uint64_t sent_bytes = 0; + uint64_t received_bytes = 0; }; SocketBenchConfig parse_socket_cfg(const YAML::Node& node) { @@ -65,7 +68,8 @@ SocketBenchConfig parse_socket_cfg(const YAML::Node& node) { return cfg; } -void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, SocketWorkerStats& stats) { +void socket_worker(const SocketBenchConfig& cfg, daqiri::bench::TokenBucketPacer& pacer, + std::atomic& stop, SocketWorkerStats& stats) { uintptr_t conn_id = 0; uint16_t port = 0; uint16_t queue = 0; @@ -92,8 +96,14 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket } } - const bool send_done = !cfg.send || stats.sent_packets >= static_cast(cfg.iterations); - const bool recv_done = !cfg.receive || stats.received_packets >= static_cast(cfg.iterations); + // When cfg.iterations <= 0, the loop is time-bounded (driven by stop.load() + // set by --seconds). Otherwise the iteration cap applies as before. + const bool send_done = !cfg.send || + (cfg.iterations > 0 && + stats.sent_packets >= static_cast(cfg.iterations)); + const bool recv_done = !cfg.receive || + (cfg.iterations > 0 && + stats.received_packets >= static_cast(cfg.iterations)); if (send_done && recv_done) { break; } if (cfg.send && !send_done) { @@ -110,6 +120,8 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket if (daqiri::send_tx_burst(msg) == daqiri::Status::SUCCESS) { stats.sent_packets++; + stats.sent_bytes += static_cast(cfg.message_size); + pacer.wait_for_bytes(static_cast(cfg.message_size), stop); } } else { daqiri::free_tx_metadata(msg); @@ -120,7 +132,9 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket daqiri::BurstParams* burst = nullptr; if (daqiri::get_rx_burst(&burst, conn_id, cfg.server) == daqiri::Status::SUCCESS && burst != nullptr) { - stats.received_packets += static_cast(daqiri::get_num_packets(burst)); + const uint64_t rx_pkts = static_cast(daqiri::get_num_packets(burst)); + stats.received_packets += rx_pkts; + stats.received_bytes += daqiri::get_burst_tot_byte(burst); daqiri::free_all_packets_and_burst_rx(burst); } else { std::this_thread::sleep_for(std::chrono::microseconds(100)); @@ -134,17 +148,20 @@ void socket_worker(const SocketBenchConfig& cfg, std::atomic& stop, Socket int main(int argc, char** argv) { if (argc < 2) { std::cerr << "Usage: " << argv[0] - << " [--seconds N] [--mode server|client|both]\n"; + << " [--seconds N] [--mode server|client|both] [--target-gbps G]\n"; return 1; } int run_seconds = 10; + double target_gbps = 0.0; std::string mode = "both"; for (int i = 2; i + 1 < argc; i += 2) { if (std::string(argv[i]) == "--seconds") { run_seconds = std::stoi(argv[i + 1]); } else if (std::string(argv[i]) == "--mode") { mode = argv[i + 1]; + } else if (std::string(argv[i]) == "--target-gbps") { + target_gbps = std::stod(argv[i + 1]); } } @@ -159,6 +176,8 @@ int main(int argc, char** argv) { std::thread client_thread; SocketWorkerStats server_stats; SocketWorkerStats client_stats; + daqiri::bench::TokenBucketPacer server_pacer(target_gbps); + daqiri::bench::TokenBucketPacer client_pacer(target_gbps); bool run_server = false; bool run_client = false; @@ -166,6 +185,7 @@ int main(int argc, char** argv) { run_server = true; server_thread = std::thread(socket_worker, parse_socket_cfg(root["socket_bench_server"]), + std::ref(server_pacer), std::ref(stop), std::ref(server_stats)); } @@ -173,6 +193,7 @@ int main(int argc, char** argv) { run_client = true; client_thread = std::thread(socket_worker, parse_socket_cfg(root["socket_bench_client"]), + std::ref(client_pacer), std::ref(stop), std::ref(client_stats)); } @@ -199,13 +220,23 @@ int main(int argc, char** argv) { if (server_thread.joinable()) { server_thread.join(); } if (client_thread.joinable()) { client_thread.join(); } + const double secs = + std::chrono::duration(std::chrono::steady_clock::now() - start) + .count(); + if (run_server) { - std::cout << "Server sent packets: " << server_stats.sent_packets - << ", received packets: " << server_stats.received_packets << '\n'; + std::cout << "Server complete: sent_packets=" << server_stats.sent_packets + << " recv_packets=" << server_stats.received_packets + << " sent_bytes=" << server_stats.sent_bytes + << " recv_bytes=" << server_stats.received_bytes + << " seconds=" << secs << '\n'; } if (run_client) { - std::cout << "Client sent packets: " << client_stats.sent_packets - << ", received packets: " << client_stats.received_packets << '\n'; + std::cout << "Client complete: sent_packets=" << client_stats.sent_packets + << " recv_packets=" << client_stats.received_packets + << " sent_bytes=" << client_stats.sent_bytes + << " recv_bytes=" << client_stats.received_bytes + << " seconds=" << secs << '\n'; } daqiri::print_stats(); diff --git a/mkdocs.yml b/mkdocs.yml index 4b80201..0fd6b1a 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -35,6 +35,8 @@ nav: - Getting Started: getting-started.md - Configuration: configuration.md - API Reference: api-guide.md + - Performance: + - DGX Spark: performance-dgx-spark.md - Tutorials: - Background: tutorials/background.md - System Configuration: tutorials/system_configuration.md @@ -45,6 +47,7 @@ markdown_extensions: - admonition - attr_list - def_list + - footnotes - md_in_html - tables - toc: diff --git a/scripts/spark_data_fill.sh b/scripts/spark_data_fill.sh new file mode 100755 index 0000000..0e4f81b --- /dev/null +++ b/scripts/spark_data_fill.sh @@ -0,0 +1,143 @@ +#!/usr/bin/env bash +# Drives the PR 1 data-fill bench runs for the DGX Spark performance report. +# +# Runs DPDK GPUDirect, socket-UDP, and socket-TCP through their sweep and +# drop-curve modes via examples/run_spark_bench.sh, with pre-flight checks +# and orphan-hugepage cleanup. RDMA is deferred from PR 1 (single-host +# loopback over the cable needs a netns+two-process refactor; tracked +# separately). +# +# Run inside the project container (privileged, --gpus all, /dev/hugepages +# and /mnt/huge mounted, repo at /workspace). +# +# Usage: +# ./scripts/spark_data_fill.sh # all three backends +# ./scripts/spark_data_fill.sh dpdk # just DPDK +# ./scripts/spark_data_fill.sh socket-udp socket-tcp +# +# Env overrides: +# ETH_DST_ADDR — RX-side MAC. Auto-detected from +# /sys/class/net/enP2p1s0f0np0/address if unset. +# RX_IFACE — RX netdev name (default enP2p1s0f0np0). +# DAQIRI_BUILD_DIR — defaults to ./build. + +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" +WRAPPER="$REPO_ROOT/examples/run_spark_bench.sh" +BUILD_DIR="${DAQIRI_BUILD_DIR:-$REPO_ROOT/build}" +RX_IFACE="${RX_IFACE:-enP2p1s0f0np0}" + +BACKENDS=("$@") +[[ ${#BACKENDS[@]} -eq 0 ]] && BACKENDS=(dpdk socket-udp socket-tcp) + +# --- pre-flight ------------------------------------------------------------ + +preflight_fail() { echo "PREFLIGHT FAIL: $*" >&2; exit 1; } +note() { echo "[$(date -u +%H:%M:%SZ)] $*"; } + +[[ -x "$WRAPPER" ]] || preflight_fail "wrapper missing or not executable: $WRAPPER" + +for be in "${BACKENDS[@]}"; do + case "$be" in + dpdk) bin="$BUILD_DIR/examples/daqiri_bench_raw_gpudirect" ;; + socket-udp|socket-tcp) bin="$BUILD_DIR/examples/daqiri_bench_socket" ;; + rdma) preflight_fail "RDMA is deferred from PR 1; see follow-up issue" ;; + *) preflight_fail "unknown backend: $be" ;; + esac + [[ -x "$bin" ]] || preflight_fail "missing bench binary: $bin (run cmake --build first)" +done + +# DPDK-only checks. +if [[ " ${BACKENDS[*]} " == *" dpdk "* ]]; then + free_hp="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + [[ "${free_hp:-0}" -ge 4 ]] || preflight_fail "HugePages_Free=$free_hp (need >=4); clean /mnt/huge and /dev/hugepages from prior runs" + + if [[ -z "${ETH_DST_ADDR:-}" ]]; then + mac_path="/sys/class/net/$RX_IFACE/address" + [[ -r "$mac_path" ]] || preflight_fail "cannot read $mac_path; set ETH_DST_ADDR explicitly" + ETH_DST_ADDR="$(cat "$mac_path")" + export ETH_DST_ADDR + note "ETH_DST_ADDR auto-detected from $RX_IFACE: $ETH_DST_ADDR" + fi + + carrier="$(cat "/sys/class/net/$RX_IFACE/carrier" 2>/dev/null || echo 0)" + [[ "$carrier" == "1" ]] || preflight_fail "RX iface $RX_IFACE has no carrier (cable unplugged or link down)" +fi + +note "Pre-flight OK. Backends: ${BACKENDS[*]}" +note "Build dir: $BUILD_DIR" +note "Repo root: $REPO_ROOT" + +# --- hugepage cleanup helper ---------------------------------------------- + +# DPDK leaves orphan rtemap_* files when a bench aborts. Clean between runs so +# we don't run out of hugepages mid-sweep. +clean_orphan_hugepages() { + local pre post freed + pre="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + : "${pre:=0}" + shopt -s nullglob + # DPDK uses a random per-process file prefix (override with --file-prefix); + # match anything ending in `map_` to catch the common shape without + # nuking unrelated files. Skip any that are still held by a live process. + for f in /dev/hugepages/*map_[0-9]* /mnt/huge/*map_[0-9]*; do + if ! fuser -- "$f" >/dev/null 2>&1; then + rm -f -- "$f" 2>/dev/null || true + fi + done + shopt -u nullglob + post="$(awk '/^HugePages_Free:/ { print $2 }' /proc/meminfo)" + : "${post:=0}" + freed=$((post - pre)) + if [[ "$freed" -gt 0 ]]; then + note "Freed $freed orphan hugepages (now ${post} free)" + fi + return 0 +} + +# --- driver loop ----------------------------------------------------------- + +declare -a RESULT_DIRS + +run_backend_mode() { + local backend="$1" mode="$2" + note "=== Running: $backend $mode ===" + clean_orphan_hugepages + + # Stream wrapper output live (per-cell CSV rows appear as they finish) while + # also keeping a log for post-run parsing of the "Results in:" line. + local log="/tmp/spark_data_fill.$backend.$mode.log" + local rc=0 + "$WRAPPER" "$backend" "$mode" 2>&1 | tee "$log" || rc=$? + rc="${PIPESTATUS[0]:-$rc}" + + if [[ "$rc" -eq 0 ]]; then + local result_dir + result_dir="$(awk '/^Results in:/ { print $3 }' "$log" | tail -n1)" + [[ -n "$result_dir" ]] && RESULT_DIRS+=("$backend/$mode -> $result_dir") + note "$backend $mode complete" + else + note "$backend $mode FAILED (exit $rc); continuing" + tail -n 40 "$log" >&2 + fi + clean_orphan_hugepages +} + +for be in "${BACKENDS[@]}"; do + run_backend_mode "$be" sweep + run_backend_mode "$be" drop-curve +done + +# --- summary --------------------------------------------------------------- + +echo +echo "==========================================" +echo "Data-fill complete. Result directories:" +echo "==========================================" +for r in "${RESULT_DIRS[@]}"; do + echo " $r" +done +echo +echo "Next: aggregate CSVs and fill docs/performance-dgx-spark.md."