Several HDF5 printer (v1 + v2) fixes by anderkve · Pull Request #587 · GambitBSM/gambit

anderkve · 2026-05-05T07:18:25Z

A series of targeted fixes to the v1 (HDF5Printer) and v2 (HDF5Printer2) HDF5 printers, surfaced by a code audit of the two printer source trees plus their combine tools. Each commit is independently small, locally verified against the spartan example (1 and 2 MPI processes, where applicable), and behaviourally additive — no on-disk format changes, no API changes for callers.

[v1] Default-initialise hid_t members to -1 to harden aux-printer paths. HDF5Printer's seven hid_t file/group/location handles had no in-class initialisers; the aux constructor only assigned the three *_location_id members, leaving the four file/group handles indeterminate on every aux instance. This was the root cause of Multi-nest fails with hdf5 version 1 #495 (resolved at the call site by using the file handle through primary_printer); this commit fixes the underlying class invariant so any future aux call site that forgets to route through primary_printer fails deterministically (-1 is the sentinel the existing checks already treat as "unset").
[v2] Close HDF5 type handle leak in HDF5DataSet::write_buffer. The synchronised-write path called H5Dget_type for an H5Tequal sanity check but never closed the returned id, leaking one type handle per flush per dataset. Mirrors the close that already exists in the sibling write_RA_buffer. Slow-burn leak, mechanical fix.
[v1] Clarify why buffer / dataset destructors are intentionally empty. Replaces misleading TODO: close on destruct comments in DataSetInterfaceBase and VertexBufferNumeric1D_HDF5 with comments documenting the real reason: these objects are populated via assign-from-temporary, so a closing destructor would invalidate the surviving copy's hid_t. Cleanup is correctly driven from HDF5Printer::flush()/finalise(). Pure documentation change to prevent a future "fix" from reintroducing handle aliasing.
[v1] Replace bare exit(1) in printer constructor with printer_error().raise(). The "file exists, group missing, not in resume mode" branch in HDF5Printer::common_constructor called exit(1) with no message — bypassing MPI shutdown (worker ranks then hung at the downstream Barrier) and giving the user no diagnostic. Replaced with the same printer_error pattern its three sibling branches already use. (Does not address the broader rank-0-error-deadlocks-workers class of issues, which need BarrierWithTimeout.)
[v2] Delete unused MPI_flush_to_rank / MPI_recv_all_buffers / MPI_request_buffer_data path. The v2 printer carried a complete second MPI shipping mechanism (seven functions, eight dedicated MPI tag constants, plus recv_counter/send_counter debug scaffolding) with no external callers and a missing Recv for the h5v2_BEGIN Send — abandoned mid-wiring. The currently-used gather_and_print path is correct and tested. Net -262 lines.
[v1] Close H5Dget_type ids leaked in combine_tools. Eight call sites in hdf5_combine_tools.cpp (4 in the primary loop, 2 in the auxilliary branch, 2 in the metadata loop) obtained datatype ids via H5Dget_type for setup_hdf5_points and never released them. Fixed by adding H5Tclose after each datatype's last use.
[v1] Avoid undefined iterator decrement past begin() in combine size-trim loops. Two near-identical for (auto it = valids.end()-1; size > 0; --it) loops in hdf5_stuff incurred UB on empty input (end()-1 is undefined) and again on the final post-step decrement when all entries are invalid (--it to begin()-1). Replaced with the equivalent while (size > 0 && !valids[size-1]) --size;, which carries no iterator arithmetic.
[v2] HDF5Printer2::flush() now also flushes registered aux buffermasters. Previously flush() invoked only buffermaster.flush(), leaving every aux printer's metadata / RA / sync stream still in RAM until either an auto-flush or finalise() ran. Now flushes the primary's own buffermaster plus each entry in aux_buffers (sync streams first, then RA, matching finalise()'s ordering). No MPI gather added — each rank still writes its own data under the existing cross-rank Utils::FileLock, consistent with how the auto-flush path works.
[v1] Guard sync_pos == 0 underflow in synchronise_buffers and get_buffer fast_forward. Both sites computed get_sync_pos() - 1 on an unsigned member that's zero-initialised, underflowing to ULONG_MAX. The bug was dormant in practice (call ordering ensured the underflow value was never delivered to a buffer with non-zero state), but fragile. Adds explicit if (sync_pos > 0) guards at both sites; sync_pos == 0 is now a no-op, which matches the semantics ("no previous slot to catch up to").

Tests:

Build cleanly (for each new commit).
spartan.yaml + v2 printer + 1 MPI rank
spartan.yaml + v2 printer + 2 MPI ranks
spartan.yaml + v1 printer + 1 MPI rank
spartan.yaml + v1 printer + 2 MPI ranks (with combine routines enabled)
spartan.yaml + v1 printer + MultiNest + 1 MPI rank (what originally reproduced Multi-nest fails with hdf5 version 1 #495)
Reproduced the pre-fix exit(1) deadlock scenario for the printer-ctor commit and confirmed it now terminates with a FATAL ERROR and MPI_Finalize.

The hdf5_v1 printer aborted with "H5Fflush returned error value (-1)... not a file or file object" whenever a scanner explicitly called flush() on an auxiliary print stream. MultiNest hits this on its very first dumper callback, where it flushes the "txt" (Posterior) and "live" (LastLive) RA streams. Root cause: only the primary HDF5Printer opens the HDF5 file and sets its file_id member; auxiliary printers inherit location_id, RA_location_id,and metadata_location_id from the primary but never initialise their own file_id. HDF5Printer::empty_sync_buffers() called H5Fflush(file_id, ...)unconditionally, so on an aux printer it was flushing an uninitialised hid_t. Fix: route the flush through primary_printer->file_id. primary_printer defaults to "this" for the primary and is set to the actual primary in the aux constructor, so the same call is correct in both cases. This also makes the dumper's flush() calls actually useful (they now push HDF5's internal buffers to disk) instead of erroring out. Closes #495.

…t final_size on workers Issue ----- In HDF5Printer2::finalise() (Printers/src/printers/hdf5printer_v2/ hdf5printer_v2.cpp), the local `std::size_t final_size` was declared unconditionally but only assigned inside an `if(myRank==0)` block (via `buffermaster.get_next_free_position()`). Immediately after, the loop that flushes unsynchronised (random-access) aux buffers and calls `(*it)->extend_all_datasets_to(final_size)` ran on **every** rank. Worker ranks therefore passed an uninitialised value into `extend_all_datasets_to`, which is undefined behaviour. In practice this hit any MPI scan using the v2 hdf5 printer with auxiliary RA streams (i.e. anything beyond a single-rank run): worker ranks with non-empty `aux_buffers` would attempt local writes / dataset extensions that they have no business performing — only rank 0 owns the output file. The matching synchronised-buffer flush block earlier in finalise() is already correctly gated by `if(myRank==0)`, so the fix is to apply the same guard symmetrically. Fix --- - Move the `final_size` query and the entire RA-flush / extend_all_datasets_to loop inside `if(myRank==0)`. Worker ranks have already shipped their RA buffer contents to rank 0 via the preceding `gather_and_print(...)` call, which remains unguarded (it is a collective and must be called by all ranks). - Tighten the scope of the `final_size` declaration so it lives only inside the rank-0 block where it is meaningful. - Move the `add_aux_buffer(RAbuffer)` call inside the rank-0 block too, since the RAbuffer only contains gathered data on rank 0 and the loop that consumes `aux_buffers` is now rank-0-only. (Note: RAbuffer being a stack local appended to the member `aux_buffers` vector is a separate latent issue, tracked in `hdf5_printer_issues.md`.) - Added a comment explaining why the block is rank-0-only. Verification ------------ - Built GAMBIT with `-DWITH_MPI=True`. - Ran the spartan example single-rank (`OMP_NUM_THREADS=1 ./gambit -rf yaml_files/spartan.yaml`): successful, "Final dataset size is 9003". - Ran spartan with 2 MPI ranks (`mpiexec -np 2 ./gambit -rf yaml_files/spartan.yaml`): successful, "Final dataset size is 9001". Inspected the output HDF5: all 34 datasets under /data have identical length (9001), confirming sync and RA datasets were extended consistently and no output was corrupted. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

Add an "Optional: MPI" subsection to step 1 covering the `libopenmpi-dev openmpi-bin` install and the corresponding `-DWITH_MPI=True` cmake flag, plus an MPI run example in step 6 (including the `--allow-run-as-root --oversubscribe` flags that are typically needed in containerised dev environments). Needed for testing any printer / scanner code paths gated on `#ifdef WITH_MPI` (e.g. the hdf5_v2 finalise rank-0 guard fixed in the preceding commit). https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

Cmake writes downloaded_tarball_paths.txt at the repo root listing the paths of scanner / backend tarballs it downloaded into ScannerBit/downloaded/ and Backends/downloaded/ during a build. It is purely a build-time bookkeeping file and should not be tracked. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…F5 errors Issue ----- In Printers/src/printers/hdf5printer/hdf5tools.cpp, errorsOff() saved the current HDF5 error handler into module-globals old_func / old_client_data and then installed a NULL handler; errorsOn() restored from those globals. This is fine for an isolated pair, but it breaks when errorsOff() is called while errors are already off: errorsOff(); // saves real handler -> globals; installs NULL ... errorsOff(); // saves NULL (current handler) -> globals; installs NULL ... errorsOn(); // restores globals = NULL -> errors stay off forever There is exactly one such nested case in the v1 printer code today (hdf5tools.cpp lines 369/377/386 in the dataset corruption-recovery helper), and it permanently disables HDF5 error reporting for the rest of the process whenever it is hit. Any subsequent HDF5 problem then fails silently. There are also ~25 paired call sites across hdf5tools.cpp and hdf5_combine_tools.cpp that work fine in isolation but would corrupt the global state if any of them were ever wrapped in another errorsOff() / errorsOn() block. Fix --- Make errorsOff() / errorsOn() idempotent with a bool guard: - errorsOff(): if errors are already off, do nothing (in particular, do not call H5Eget_auto2, which would overwrite the saved handler with the NULL we installed ourselves). - errorsOn(): if errors are already on (i.e. unbalanced call), do nothing rather than blindly restoring whatever is in the saved slots. The bool / handler state is moved into an unnamed namespace at file scope, giving it internal linkage. Functionally equivalent to the existing namespace-scope globals for this codebase (no other TU references them), but it (a) makes intent explicit -- these are strictly the implementation detail of the two functions just below them -- and (b) prevents any future accidental ODR collision on the short generic names. A counter-based nesting scheme was considered but rejected: no real caller pairs an inner errorsOn() with an inner errorsOff(), so all the counter would do is silently absorb the one buggy nested call we already have. The bool keeps the logic minimal and matches what every existing caller actually wants. Note: the underlying state is still a module global, so errorsOff / errorsOn remain not thread-safe. That is a separate (lower-priority) issue tracked in hdf5_printer_issues.md. Verification ------------ - Rebuilt GAMBIT (-DWITH_MPI=True) cleanly. - Single-rank v1 spartan run: success. - 2-rank MPI v1 spartan run: success. - Resume run via combine-routines path through hdf5_combine_tools.cpp (which has the bulk of the errorsOff / errorsOn call sites): success. No HDF5 warnings or errors in any of the runs. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

The hdf5_v1 printer aborted with "H5Fflush returned error value (-1)... not a file or file object" whenever a scanner explicitly called flush() on an auxiliary print stream. MultiNest hits this on its very first dumper callback, where it flushes the "txt" (Posterior) and "live" (LastLive) RA streams. Root cause: only the primary HDF5Printer opens the HDF5 file and sets its file_id member; auxiliary printers inherit location_id, RA_location_id,and metadata_location_id from the primary but never initialise their own file_id. HDF5Printer::empty_sync_buffers() called H5Fflush(file_id, ...)unconditionally, so on an aux printer it was flushing an uninitialised hid_t. Fix: route the flush through primary_printer->file_id. primary_printer defaults to "this" for the primary and is set to the actual primary in the aux constructor, so the same call is correct in both cases. This also makes the dumper's flush() calls actually useful (they now push HDF5's internal buffers to disk) instead of erroring out. Closes #495.

…r paths The HDF5Printer class has seven hid_t members for HDF5 file / group / location handles: file_id, group_id, RA_group_id, metadata_id, location_id, RA_location_id, metadata_location_id The four file/group handles are only assigned by the *primary* constructor; the aux constructor at hdf5printer.cpp:683-699 only sets the three *_location_id members (inheriting them from primary_printer). With no in-class initialisers, the four file/group handles held indeterminate values on every auxiliary printer instance. This was the root cause of issue #495 (H5Fflush(file_id, ...) firing on aux printers when MultiNest's dumper called flush() on its 'txt' / 'live' aux streams). That specific call site is now correctly routed through primary_printer->file_id (merge of upstream fix_hdf5v1_and_multinest_issue), but the underlying class invariant remained fragile: any future aux call site that forgets to go through primary_printer would silently read uninitialised memory. Fix: give all seven hid_t members an in-class default initialiser of -1. This matches the sentinel value the class already treats as "unset HDF5 handle" -- the existing get_location() / get_RA_location() / get_metadata_location() checks at hdf5printer.cpp:1144,1155,1166 already raise a printer_error when they see -1. With this change the class invariant becomes uniform: an unset HDF5 handle in this class is -1, period. Any accidental HDF5 call on an aux's file/group handle now passes H5I_INVALID_HID, which HDF5 reliably rejects -- failures will surface deterministically as printer_errors instead of as indeterminate behaviour or random "lucky" successes. The bottom three *_location_id members are assigned by both constructors today, so the = -1 there is purely defensive. Including them keeps the rule uniform across all seven handles. Verification ------------ Rebuilt and ran four scenarios cleanly (no HDF5 errors, all exit 0): - v1 + Diver, single rank - v1 + Diver, 2-rank MPI - v1 + MultiNest, single rank (was the original #495 reproducer) - v2 + Diver, single rank (regression check) https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

… leak) In HDF5DataSet<T>::write_buffer (the v2 hdf5 printer's per-flush write path for synchronised datasets), the on-disk dtype is queried with hid_t dtype = H5Dget_type(get_dset_id()); and used only for an H5Tequal sanity check against the in-memory expected_dtype. The handle was never closed before the function returned, leaking one HDF5 type handle per call. write_buffer is invoked once per flush per synchronised dataset, so on long scans with many datasets and frequent flushes the leak accumulates -- eventually HDF5's internal table or the OS file-handle limit gives, depending on the build/runtime. The companion write_RA_buffer (lines 438/467 in the same header) already closes its corresponding H5Dget_type handle, so this is purely a missing close in the synchronised path. Fix: H5Tclose(dtype) immediately after the H5Tequal check, where dtype stops being needed. Mirrors the close site in write_RA_buffer. A short comment notes why it's there so the next reader doesn't "clean it up" again. Verification ------------ - Built cleanly. - v2 + Diver, single rank: exit 0, no HDF5 errors, output file has 34 datasets all sized 9002. - v2 + Diver, 2 MPI ranks: exit 0, no HDF5 errors. (The leak itself is slow-burn -- not visible in a short spartan run -- but the fix is mechanical and the smoke tests confirm it doesn't disturb the write path.) https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…empty The destructors of DataSetInterfaceBase and VertexBufferNumeric1D_HDF5 each carried a TODO suggesting that one day they ought to close their HDF5 dataset / flush their pending writes, with the actual cleanup calls left commented out. The TODOs implied the empty body was a hack waiting to be fixed, but in fact the empty body is required by how these objects are stored: - DataSetInterfaceBase (and DataSetInterfaceScalar) are populated via the assign-from-temporary pattern, e.g. _dsetdata = DataSetInterfaceScalar<T,CL>(loc, name, ...); local_buffers[key] = BuffType(loc, label, vertexID, ...); With the default copy/move semantics, the temporary and the surviving slot share the same hid_t dset_id; closing in the destructor would invalidate the handle that the surviving copy still relies on. - The same shape applies to VertexBufferNumeric1D_HDF5 itself, whose destructor would similarly run on a temporary that shares all the buffer state (HDF5 handles, sync counters, write queues) with the container slot. Doing write_to_disk() / RA_write_to_disk() there would either double-write or write through stale state. The actual close / flush is driven deterministically from HDF5Printer::flush() and HDF5Printer::finalise(), iterating over the surviving buffers in the printer's all_buffers map. If finalise() is bypassed (exception during scan setup or shutdown), HDF5's own H5Fclose() at process exit auto-closes any still-open dataset handles and flushes their pending I/O, so the file on disk remains well-formed. This commit replaces the misleading TODOs with comments explaining the actual ownership invariant, so the next reader doesn't try to "fix" the empty body and reintroduce the original handle-aliasing bug. Properly closing in the destructor would require giving these classes ownership semantics they currently don't have (e.g. wrapping the hid_t in a shared_ptr with an H5Dclose deleter, or making the chain non-copyable / move-only) -- a larger change that is out of scope here. No behaviour change. Verified by rebuilding and running the spartan example with both the v1 and v2 hdf5 printer (both exit 0, no HDF5 errors). https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…aise() In HDF5Printer::common_constructor (Printers/src/printers/hdf5printer/ hdf5printer.cpp), the rank-0 setup block at lines 414-472 handles the case where the requested output file already exists. Inside the "restart, not resume, not overwrite" branch (lines 448-469), the code opens the file, checks whether the requested group is already present, and either: - group exists -> raise a printer_error with a clear, multi-option message explaining how to fix the situation (line 465); or - group missing -> closeFile + bare exit(1) (line 468), with no message at all. The exit(1) path was buggy in two ways: 1. exit(1) bypasses MPI shutdown -- no MPI_Finalize, no MPI_Abort. Worker ranks (which never enter this rank-0-only block) sail on to the upcoming myComm.Barrier() at line 610 and hang forever. The user sees a job that never terminates, with no log message explaining why. 2. The user has no idea what went wrong: the path triggers when they pointed the printer at a file that already exists from some unrelated previous run, and the only signal is a non-zero exit code from rank 0. Fix: mirror the sibling group-exists case at line 465. Build a clear printer_error message ("file exists, requested group not in it, not in resume mode, delete_file_on_restart not set; here are the three things you can do") and raise it. printer_error().raise() throws an exception that goes through GAMBIT's normal error-reporting machinery, which at least tells the user what happened on rank 0 before tearing down -- a strict improvement over silent exit(1). Note that this commit does *not* address the broader rank-0-error- deadlocks-workers problem: if any of the four printer_error sites in this constructor block (lines 438, 445, 465, and now this one) fire on rank 0, worker ranks still hang at the Barrier downstream. That deadlock-class issue is the same one tracked under "v1 masterWaitForAll(FINAL_SYNC) deadlocks" and "Same risk on the resume-mode myComm.Barrier() at line 610" in hdf5_printer_issues.md, and needs a separate, larger fix using BarrierWithTimeout. The point of the present commit is just to bring this one site up to the same error-reporting standard as its three siblings. Verification ------------ - Built cleanly. - Regular v1 spartan run: exit 0, no change to the happy path. - Reproduced the failing scenario by seeding an output file with group "/data" via a normal run, then re-running with the same output_file but group "/data2", restart mode (-rf), no delete_file_on_restart. Before the fix this would silent-exit(1); after the fix the job terminates with a FATAL ERROR carrying the full guidance message and "Calling MPI_Finalize..." in the trailer. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…I_request_buffer_data path The v2 hdf5 printer carried two parallel sets of MPI machinery for shipping buffer data from worker ranks to the rank-0 master: 1. The currently-used "gather and Gatherv" path in finalise() and flush(), built around HDF5MasterBuffer::add_to_buffers and the HDF5bufferchunk message format. This is what actually runs in every MPI scan today. 2. A second, unwired Send/Recv path consisting of: - HDF5MasterBuffer::MPI_flush_to_rank - HDF5MasterBuffer::MPI_request_buffer_data - HDF5MasterBuffer::MPI_recv_all_buffers - HDF5MasterBuffer::MPI_recv_buffer<T> - HDF5Buffer<T>::MPI_flush_to_rank - HDF5Buffer<T>::MPI_recv_from_rank - HDF5BufferBase::MPI_flush_to_rank (pure-virtual decl) plus eight dedicated MPI tag constants (h5v2_bufname / h5v2_bufdata_points / h5v2_bufdata_ranks / h5v2_bufdata_valid / h5v2_bufdata_type / h5v2_bufdata_values / h5v2_BLOCK / h5v2_BEGIN) and some commented-out recv_counter / send_counter debug scaffolding. The path in (2) is internally consistent (the functions only call each other) but has no external callers anywhere in the codebase, and the MPI_request_buffer_data side sends an h5v2_BEGIN tag for which there is no matching Recv -- a strong sign the path was abandoned mid-wiring rather than being a finished alternative. This commit deletes (2) outright. Two reasons for choosing delete over finishing the wiring: - The currently-used gather_and_print path is correct and tested (passes our spartan smoke tests in single-rank and 2-rank MPI modes, including with auxiliary RA streams). - The send-side comment in MPI_recv_all_buffers explicitly warns of a deadlock if MPI_request_buffer_data is not called first; revisiting that ordering and the request/recv handshake to make it deadlock-free would be a real design effort, not a hygiene fix. The audit recommended either deleting or wiring up; given there is no concrete reason to believe the unfinished path would outperform the working one, deletion is the lower-risk choice. Net change: -266 lines / +4 lines across hdf5printer_v2.hpp and hdf5printer_v2.cpp. add_to_buffers, MPI_add_int_block_to_buffer, MPI_add_float_block_to_buffer (which ARE used by the live path) are untouched. Verification ------------ - Built cleanly. - v2 + Diver, single rank: exit 0, output has 34 datasets all sized 9001 (uniform, as expected). - v2 + Diver, 2 MPI ranks: exit 0, output has 34 datasets all sized 9001 (uniform). https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

hdf5_combine_tools.cpp called H5Dget_type at 8 sites to obtain the HDF5 datatype of an existing dataset (in order to recreate the dataset in the combined output via setup_hdf5_points), but never released the returned datatype id. H5Dget_type returns a fresh id that the caller owns and must close with H5Tclose; otherwise the id (and its underlying in-memory state) lingers until the file/library is shut down. The leaks were at three call sites within hdf5_stuff::Combine_HDF5: - Primary-dataset combine loop (~L949-956): type/type2 obtained per batch from either the previous combined dataset or a temp-file dataset, used by setup_hdf5_points only on batch 0, but allocated on every batch and never freed. Fixed by H5Tclose(type/type2) after the dataset_out/dataset2_out closes inside the same per-batch if(valid_dset>=0 or old_dataset>=0) block, so each iteration balances its allocations. - Auxiliary-dataset path (~L1144-1145): type/type2 obtained inside the "param not also a primary" branch, used immediately by setup_hdf5_points. Fixed by H5Tclose right after that call, guarded by >=0 since the variables are pre-initialised to -1 above. - Metadata-dataset combine loop (~L1273, L1278): single 'type' id, same pattern as the primary loop. Fixed by H5Tclose(type) after the dataset_out close inside the per-batch valid branch. These were straightforward resource leaks: setup_hdf5_points only reads 'type' to pass it to H5Dcreate2 and does not take ownership, so adding the closes at the end of the using-scope is correct and safe. Verification ------------ - Built cleanly (only Printers translation unit recompiled). - Ran spartan with the v1 (hdf5_v1) printer, 2 MPI ranks, with disable_combine_routines: false so the combine path actually exercises all three call sites. Run completed cleanly (exit 0): * 17 primary parameters combined, no errors * 51 metadata parameters combined, no errors * 0 auxiliary parameters in this scan (path not exercised, but that branch is structurally identical to the primary path) * combined output gambit_output.hdf5 contains uniform 9501-row 1-D datasets across all primary parameters. Out of scope (not fixed here) ----------------------------- - hdf5_combine_tools.hpp Enter_HDF5() at line 377 calls H5Tget_native_type(dtype, ...) and stores the result in 'type', then closes only 'dtype' (line 448) before returning. The native-type id is also a fresh id that needs H5Tclose. This is a separate leak (a different HDF5 API) and the issue title is scoped specifically to H5Dget_type; flagging here for follow-up. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…size-trim loops Two near-identical loops in hdf5_stuff (one in the primary-dataset scanning pass, one in the auxiliary-dataset Enter_Aux_Parameters pass) walked a std::vector<bool> backwards to count the number of trailing invalid entries to trim from a per-file size: for (auto it = valids.end()-1; size > 0; --it) { if (*it) break; else --size; } There were two problems with this idiom: 1. If valids.size() == 0 (which happens cleanly when the temp file's sync group is empty), valids.end()-1 is undefined behavior. In practice the loop condition `size > 0` would skip the body, but the iterator is still constructed. 2. More importantly: when ALL entries in valids are invalid (so the "break" branch is never taken), the final iteration decrements size from 1 to 0 and dereferences valids.front() correctly, but then the for-loop's post-step runs `--it` once more, producing an iterator one before begin(). That decrement is undefined behavior on std::vector::iterator -- even though it's typically a pointer under the hood and "happens to work", a sanitizer or a debug iterator implementation can flag/abort. The condition `size > 0` is then evaluated on the already-invalid iterator's surrounding state, but the UB has already occurred at the --it. Replaced both loops with a straightforward index-based trim: while (size > 0 && !valids[size-1]) --size; Equivalent semantics (trim trailing invalids, keep first valid), but no iterator arithmetic, no past-the-begin decrement, and no empty-vector trap. Both call sites already guarantee size <= valids.size() (size and valids both come from the pointID_isvalid / RA_pointID_isvalid dataset, with the prior size != size2 check raising on mismatch), so valids[size-1] is safely in range whenever size > 0. Verification ------------ - Built cleanly (Printers TU only). - Spartan + hdf5_v1 + 2 MPI ranks + disable_combine_routines: false: exit 0, combined output has uniform 9000-row primary datasets and 1-row metadata datasets. The primary-pass trim loop runs once per temp file (2x in this scan); the aux-pass trim loop is structurally identical and exercised by the same Enter_HDF5/read_hdf5 path. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

Before this change, HDF5Printer2::flush() only invoked buffermaster.flush() on the calling printer instance. That meant a public flush() on the primary printer left every auxilliary printer's buffermaster -- metadata streams, RA streams, and any other aux streams the scan has registered -- with their points still in RAM until either the per-stream sync auto-flush triggered (when a sync buffer fills) or finalise() ran. The two known external callers of the public API (twalk's out_stream->flush() at the end of a temp-file replay batch and the Python scanner binding in py_module_scanbit.hpp) both expect "persist what I've written so far", so the partial flush was a behavioural mismatch with caller intent and a divergence from the v1 printer (HDF5Printer::flush() iterates all buffers). The fix: after flushing this->buffermaster, if this is the primary printer, iterate the aux_buffers list and flush each. Aux printers register their own buffermaster into the primary's aux_buffers in their constructor (~hdf5printer_v2.cpp:1342 add_aux_buffer) and have an empty aux_buffers themselves, so guarding the loop on not is_auxilliary_printer() preserves aux-printer semantics (unchanged: still just flushes its own buffermaster). Two passes -- sync streams first, RA streams second -- mirror the ordering already used in finalise() (~:1824-1833 then :1895-1909). Reason: RA writes anchor to positions established by sync writes, so flushing RA before its sync targets are on disk can leave RA points stuck in the postpone queue. What is intentionally NOT done here: * No MPI gather. The architecture defers cross-rank gathering to finalise() as a performance optimisation for HPC networked filesystems (see comment block at hdf5printer_v2.cpp:1786-1793). Mid-scan, every rank just writes its own buffered points to the shared file under the existing cross-rank Utils::FileLock taken inside HDF5MasterBuffer::flush() (lock_and_open_file -> hdf5out.get_lock() at :741). Adding an MPI collective inside the public flush() would deadlock for non-collective callers (twalk's per-rank flush, Python scanners) and is unnecessary for "persist my data" semantics. * No explicit H5Fflush. v1 keeps the file open between buffer flushes and ends flush() with H5Fflush(file_id, H5F_SCOPE_GLOBAL). v2's HDF5MasterBuffer::flush() always closes the file at the end via close_and_unlock_file -> H5Fclose, which already drains the HDF5 internal cache to disk, making H5Fflush redundant. * Auto-flush path unchanged. add_buffered_point's "buffer full" auto flush at hdf5printer_v2.hpp:1128 / 1137 calls HDF5MasterBuffer::flush() (same class), not HDF5Printer2::flush(). That path is untouched. * finalise() unchanged. It still does its own MPI gather and per-buffer flush logic without going through the public flush(). Cost: the new loop incurs N additional lock_and_open_file -> close_and_unlock_file cycles per public flush() call (N = number of registered aux buffers), but HDF5MasterBuffer::flush() short-circuits on get_Npoints() == 0 (:479) without taking the lock, so empty aux buffers cost essentially zero. flush() is rare on the hot path; this is fine. Verification ------------ - Built cleanly (Printers TU only). - Spartan + v2 + single rank: exit 0, 34 primary datasets all sized 9502 (uniform). - Spartan + v2 + 2 MPI ranks: exit 0, 34 primary datasets all sized 9003 (uniform). Both runs produce well-formed output; the change is additive and preserves existing on-disk layout because finalise() already flushes everything at end-of-scan -- this commit just makes the same flushing happen on demand when flush() is called. https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

…buffer fast_forward HDF5Printer's sync_pos member is unsigned long, initialised to 0 (hdf5printer.hpp:416), and only ever incremented (in increment_sync_pos at hdf5printer.cpp:1650). Two call sites compute get_sync_pos()-1 without first checking that sync_pos>0, which would underflow to ULONG_MAX when sync_pos==0: 1. HDF5Printer::synchronise_buffers (hdf5printer.cpp:1320): const unsigned long sync_pos = get_sync_pos()-1; ... it->second->synchronise_output_to_position(sync_pos); The underflowed ULONG_MAX would be stored into each buffer's target_sync_pos. For RA buffers this later feeds extend_dset() in RA_write_to_disk, which would attempt to extend the dataset to ULONG_MAX rows. For sync buffers, the movediff calculation in synchronise_output_to_position relies on sync_pos+1 (ulong), which then wraps to 0 -- harmless only when dset_head_pos() also happens to be 0. 2. H5P_LocalBufferManager::get_buffer (hdf5printer.hpp:558): if(synchronised) it->second.fast_forward( printer->get_sync_pos()-1); fast_forward takes a signed long target_pos. With sync_pos==0, get_sync_pos()-1 == ULONG_MAX (ulong), implicitly converted to long it becomes -1 on two's-complement platforms. fast_forward then computes needed_steps = -1 - 0 = -1 (negative) and raises the "would need to move backwards" error (VertexBufferBase.hpp:218). In normal scans both bugs are dormant because of call ordering: - synchronise_buffers' first invocation (from the first check_for_new_point at :1632) happens before any buffer has been added to all_buffers, so the for-loop iterates over nothing and the underflowed value is never delivered. - fast_forward via get_buffer is reached only after check_for_new_point has run increment_sync_pos at :1650, taking sync_pos from 0 to 1, so by the time fast_forward executes sync_pos>=1 and target_pos>=0. The underflow is therefore real but masked by today's lifecycle. Adding a buffer earlier than the first check_for_new_point, or calling synchronise_buffers from a new code path before any pointID arrives, would surface either bug. Fix: guard each call site with sync_pos>0. Semantically, sync_pos==0 means "no point has yet been registered as the current sync position", so there is nothing to synchronise to or catch up to; returning early / skipping the call is the correct no-op behaviour and is consistent with buffers being at dset_head_pos==0 from construction. This is a surgical fix that preserves all existing behaviour: the two guarded paths were never executed with sync_pos==0 in practice, so guarding them doesn't change any observable behaviour today -- it only removes the latent underflow. Verification ------------ - Built cleanly. - Spartan + hdf5_v1 + 1 rank: exit 0, 34 primary datasets all sized 9000 (uniform). - Spartan + hdf5_v1 + 2 MPI ranks (with combine routines enabled): exit 0, 34 primary datasets all sized 9501 (uniform). https://claude.ai/code/session_01QZuy7gyVUMZa3FCXShjj4X

anderkve · 2026-05-05T08:09:10Z

Sorry for requesting one more review from you, @ChrisJChang, feel free to forward it to someone else. Or I can just take care of it myself – I've already gone over the changes once to check that the Claude-suggested changes make sense to me, and to tweak some comments and code snippets.

anderkve · 2026-05-05T08:49:34Z

@carlmfe volunteered to review this one (thanks!), so you can forget about it @ChrisJChang.

carlmfe

Code is good. Comments are generally on the verbose side, but do convey clearly why the changes are as they are. Other than a specific commenting referencing old code, I approve of the rest.

anderkve · 2026-05-07T11:58:21Z

Thanks for reviewing – and I agree, I'll revise some of the comments before I merge.

anderkve and others added 23 commits May 5, 2026 08:41

Added a CLAUDE.md and MINIMAL_BUILD.md

343026c

Added file with list of hdf5 printer issues

4d6eebc

Added a CLAUDE.md and MINIMAL_BUILD.md

592ad6d

Deleted temp files used for working on the hdf5 printers

6952d63

Clarified a code snippet and comment in hdf5printer.hpp

0721210

Fixed typos in hdf5printer.hpp

f131987

Tweaked comment in hdf5printer_v2.hpp

0ba474e

Tweaks to comment in hdf5tools.cpp

d1edc9c

anderkve requested a review from ChrisJChang May 5, 2026 08:06

carlmfe self-requested a review May 5, 2026 08:44

anderkve removed the request for review from ChrisJChang May 5, 2026 08:47

anderkve added Printers labels May 6, 2026

anderkve added Core Core group task labels May 6, 2026

carlmfe reviewed May 7, 2026

View reviewed changes

Comment thread Printers/src/printers/hdf5printer/hdf5printer.cpp Outdated

carlmfe approved these changes May 7, 2026

View reviewed changes

anderkve added 2 commits May 7, 2026 19:01

Revised a bunch of overly verbose code comments.

02d482a

Merge branch 'master' into hdf5_work

7649d04

anderkve merged commit fb14173 into master May 8, 2026
2 of 4 checks passed

anderkve deleted the hdf5_work branch May 8, 2026 04:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several HDF5 printer (v1 + v2) fixes#587

Several HDF5 printer (v1 + v2) fixes#587
anderkve merged 25 commits intomasterfrom
hdf5_work

anderkve commented May 5, 2026

Uh oh!

anderkve commented May 5, 2026

Uh oh!

anderkve commented May 5, 2026

Uh oh!

Uh oh!

carlmfe left a comment

Uh oh!

anderkve commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anderkve commented May 5, 2026

Uh oh!

anderkve commented May 5, 2026

Uh oh!

anderkve commented May 5, 2026

Uh oh!

Uh oh!

carlmfe left a comment

Choose a reason for hiding this comment

Uh oh!

anderkve commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants