DPDK init failures (e.g. memory regions exceed HugePages_Free) leave <file-prefix>map_* files in /dev/hugepages/ that pin every page. HugePages_Free stays at 0 and every subsequent run fails the same way until the files are removed by hand. Even clean exit-0 runs leak a few pages.
Hit on examples/daqiri_bench_raw_tx_rx_spark.yaml: num_bufs: 51200 × buf_size: 8064 overruns the kernel-default 1024 × 2 MiB hugepages. 2048 × 2 MiB works.
Steps/Code to reproduce bug
- Boot with kernel-default
nr_hugepages=1024.
daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5 → fails with Failed to allocate TX message pool!.
ls /dev/hugepages/ shows leftover *map_* files.
- Re-run fails identically until
sudo rm -f /dev/hugepages/*map_*.
Expected behavior
- Preflight check on memory-region footprint vs
HugePages_Free with an actionable error before EAL allocates.
rte_eal_cleanup() on every error path out of daqiri_init.
- Surface the manual cleanup recipe from the run instructions (today it's only in a collapsed FAQ entry).
Environment overview
- Bare-metal host, bench inside Docker.
- Source build via
scripts/build-container.sh (BASE_TARGET=dpdk DAQIRI_MGR="dpdk socket rdma").
Environment details
Proposed fix (docs-only follow-up to PR #49; in-code work tracked separately)
-
docs/tutorials/benchmarking_examples.md, before the first docker run: "Configure hugepages first" callout — grep Huge /proc/meminfo, size against the YAML's num_bufs × buf_size, runtime knobs for both 2 MiB and 1 GiB pools, and a link to the persistent grub recipe in system_configuration.md.
echo 2048 | sudo tee /proc/sys/vm/nr_hugepages # 2 MiB pool, 4 GiB
echo 4 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 1 GiB pool, 4 GiB
-
docs/tutorials/system_configuration.md, Spark tab: state the shipped Spark YAML needs ~2048 × 2 MiB; default 1024 OOMs at Failed to allocate TX message pool!.
-
Same file, new "Troubleshooting" sub-section: orphan-hugepage symptom, <file-prefix>map_* root cause, cleanup in sudo + Docker forms, pgrep -af daqiri_bench safety check, leak-on-clean-exit caveat.
sudo rm -f /dev/hugepages/*map_*
DPDK init failures (e.g. memory regions exceed
HugePages_Free) leave<file-prefix>map_*files in/dev/hugepages/that pin every page.HugePages_Freestays at 0 and every subsequent run fails the same way until the files are removed by hand. Even clean exit-0 runs leak a few pages.Hit on
examples/daqiri_bench_raw_tx_rx_spark.yaml:num_bufs: 51200×buf_size: 8064overruns the kernel-default 1024 × 2 MiB hugepages. 2048 × 2 MiB works.Steps/Code to reproduce bug
nr_hugepages=1024.daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5→ fails withFailed to allocate TX message pool!.ls /dev/hugepages/shows leftover*map_*files.sudo rm -f /dev/hugepages/*map_*.Expected behavior
HugePages_Freewith an actionable error before EAL allocates.rte_eal_cleanup()on every error path out ofdaqiri_init.Environment overview
scripts/build-container.sh(BASE_TARGET=dpdk DAQIRI_MGR="dpdk socket rdma").Environment details
main@ 2532bb6 (post PR [DOC] Update system tuning tutorial with DGX Spark specific details #45).kind:(huge,host_pinned,device) — DPDK uses hugepages for its own mempools/rings regardless.Proposed fix (docs-only follow-up to PR #49; in-code work tracked separately)
docs/tutorials/benchmarking_examples.md, before the firstdocker run: "Configure hugepages first" callout —grep Huge /proc/meminfo, size against the YAML'snum_bufs×buf_size, runtime knobs for both 2 MiB and 1 GiB pools, and a link to the persistent grub recipe insystem_configuration.md.docs/tutorials/system_configuration.md, Spark tab: state the shipped Spark YAML needs ~2048 × 2 MiB; default 1024 OOMs atFailed to allocate TX message pool!.Same file, new "Troubleshooting" sub-section: orphan-hugepage symptom,
<file-prefix>map_*root cause, cleanup insudo+ Docker forms,pgrep -af daqiri_benchsafety check, leak-on-clean-exit caveat.