Skip to content

[BUG] Failed init leaves hugepage files behind in /dev/hugepages, blocking subsequent runs #56

@chloecrozier

Description

@chloecrozier

DPDK init failures (e.g. memory regions exceed HugePages_Free) leave <file-prefix>map_* files in /dev/hugepages/ that pin every page. HugePages_Free stays at 0 and every subsequent run fails the same way until the files are removed by hand. Even clean exit-0 runs leak a few pages.

Hit on examples/daqiri_bench_raw_tx_rx_spark.yaml: num_bufs: 51200 × buf_size: 8064 overruns the kernel-default 1024 × 2 MiB hugepages. 2048 × 2 MiB works.

Steps/Code to reproduce bug

  1. Boot with kernel-default nr_hugepages=1024.
  2. daqiri_bench_raw_gpudirect examples/daqiri_bench_raw_tx_rx_spark.yaml --seconds 5 → fails with Failed to allocate TX message pool!.
  3. ls /dev/hugepages/ shows leftover *map_* files.
  4. Re-run fails identically until sudo rm -f /dev/hugepages/*map_*.

Expected behavior

  • Preflight check on memory-region footprint vs HugePages_Free with an actionable error before EAL allocates.
  • rte_eal_cleanup() on every error path out of daqiri_init.
  • Surface the manual cleanup recipe from the run instructions (today it's only in a collapsed FAQ entry).

Environment overview

  • Bare-metal host, bench inside Docker.
  • Source build via scripts/build-container.sh (BASE_TARGET=dpdk DAQIRI_MGR="dpdk socket rdma").

Environment details

Proposed fix (docs-only follow-up to PR #49; in-code work tracked separately)

  • docs/tutorials/benchmarking_examples.md, before the first docker run: "Configure hugepages first" callout — grep Huge /proc/meminfo, size against the YAML's num_bufs × buf_size, runtime knobs for both 2 MiB and 1 GiB pools, and a link to the persistent grub recipe in system_configuration.md.

    echo 2048 | sudo tee /proc/sys/vm/nr_hugepages                                  # 2 MiB pool, 4 GiB
    echo 4    | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages  # 1 GiB pool, 4 GiB
  • docs/tutorials/system_configuration.md, Spark tab: state the shipped Spark YAML needs ~2048 × 2 MiB; default 1024 OOMs at Failed to allocate TX message pool!.

  • Same file, new "Troubleshooting" sub-section: orphan-hugepage symptom, <file-prefix>map_* root cause, cleanup in sudo + Docker forms, pgrep -af daqiri_bench safety check, leak-on-clean-exit caveat.

    sudo rm -f /dev/hugepages/*map_*

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions