Turn on collective HDF5 metadata #620

roblatham00 · 2022-10-26T18:47:23Z

Description
Since HDF5 1.10.0 (march 2016), HDF5 has the ability to request "collective metadata". When requested, HDF5 (internally) will use collective operations to update the data structures stored in the HDF5 file. If not requested, every single process will make these data structure updates, resulting in lots of small reads and writes -- particularly at scale.

This does not fix #112 but users who are interested in collective I/O optimizations and are OK with the restrictions collective I/O imposes (every process has to create datasets, every process has to call H5Dwrite etc) are likely to be interested in this optimization as well.

A "darshan" log of a four process parallel_hdf5_write_dataset example from HighFive tells me there were 0 MPI-IO collective writes, 11 MPI-IO independent writes, and 11 POSIX writes before this change, and
Please include a summary of the change and which issue is fixed or which feature is added.

How to test this?

The Darshan (https://www.mcs.anl.gov/research/projects/darshan/) I/O characterization tool collects counters (not logs) of I/O applications. I used it to observe the effects of this change on the parallel_hdf5_write_dataset example. I'm also using the 'spack' package management tool, but that's not relevant to this pull request.

    $ mpicxx -I${HOME}/work/soft/highfive/include -I$(spack location -i hdf5)/include ../src/examples/parallel_hdf5_write_dataset.cpp -o parallel_hdf5_write_dataset -L$(spack location -i hdf5)/lib -lhdf5
    $ LD_PRELOAD=$(spack location -i darshan-runtime)/lib/libdarshan.so mpiexec -np 4 ./parallel_hdf5_write_dataset

Test System

OS: Ubuntu 22.04
Compiler: g++-11.2
Dependency versions: hdf5-1.10 or newer

1uc · 2022-10-27T06:25:32Z

Thank you for this useful contribution. There's a few mechanical changes I'd like to make, i.e. split the two setting (to allow users to pick whichever the need), use the "private with friend" construction to avoid adding this to any plist other than file access plists. (I'll take care of this in the coming few days.)

Finally, I'm undecided about adding this to MPIOFileDriver. The concern is that it would change the requirements on which ranks need to participate in some of our function calls. (Which implies it might break user code.) For writing, to my understanding, always all ranks in the communicator need to participate; since this is how HDF5 keeps the file structure consistent across ranks. Hence MPI collective writes is (likely) an optimization without behavioural change.

Let's look at reading metadata. The list of collective operations is here:
https://support.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

For reading I see the list of affected HDF5 function call is:
https://docs.hdfgroup.org/hdf5/v1_12/group___g_a_c_p_l.html

Therefore we have functions that would become collective. The interesting ones are:

H5Aread
H5Dclose
H5Gclose

The two that H5?close are interesting because it's acceptable to open datasets or groups independently, if one doesn't change the dataset/group. Therefore, one should be allowed to close them independently as well. (I need this feature myself.)

Therefore, I'd suggest having nothing or a MPIOCollectiveFileDriver which enables the collective settings.

matz-e · 2022-10-27T08:03:11Z

include/highfive/bits/H5FileDriver_misc.hpp

    add(MPIOFileAccess(comm, info));
+    add(MPIOCollectiveMD());


As an alternative to MPIOFileDriver (which I personally would like to see deprecated), one can use:

FileDriver adam; adam.add(MPIOFileAccess(MPI_COMM_WORLD, MPI_INFO_NULL)); adam.add(MPIOCollectiveMD()); File file(filename.str(), File::ReadWrite | File::Create | File::Truncate, adam);

that makes a lot more sense to me, especially now that i have a better feel for "the highfive way" after taking a crack at the collective data work.

1uc

We'll merge this into a feature branch; and perform some mechanical changes there. Thanks again for the contribution.

1uc · 2022-10-28T09:54:22Z

Work continued here: #624

This PR enables users to configure collective metadata reads and writes (at file level). The heavy lifting was done in #620; here we simply restructure the proposed solution. In more detail, this PR adds: * ability to request collective MPI-IO metadata reads * ability to request collective MPI-IO metadata writes * explains the caveats of collective metadata reads * add collective MPI-IO metadata writes to `MPIOFileDriver`. Co-authored-by: Rob Latham <rlatham@gmail.com>

Turn on collective HDF5 metadata

f5f8d46

pramodk requested a review from 1uc October 26, 2022 19:16

matz-e reviewed Oct 27, 2022

View reviewed changes

1uc changed the base branch from master to collective-metadata October 27, 2022 13:20

1uc approved these changes Oct 27, 2022

View reviewed changes

1uc merged commit e1052f8 into BlueBrain:collective-metadata Oct 27, 2022

1uc mentioned this pull request Oct 28, 2022

Collective metadata #624

Merged

1uc pushed a commit that referenced this pull request Oct 31, 2022

Turn on collective HDF5 metadata (#620)

45f0a8b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turn on collective HDF5 metadata #620

Turn on collective HDF5 metadata #620

roblatham00 commented Oct 26, 2022

1uc commented Oct 27, 2022

matz-e Oct 27, 2022

roblatham00 Oct 27, 2022

1uc left a comment

1uc commented Oct 28, 2022

Turn on collective HDF5 metadata #620

Turn on collective HDF5 metadata #620

Conversation

roblatham00 commented Oct 26, 2022

1uc commented Oct 27, 2022

matz-e Oct 27, 2022

Choose a reason for hiding this comment

roblatham00 Oct 27, 2022

Choose a reason for hiding this comment

1uc left a comment

Choose a reason for hiding this comment

1uc commented Oct 28, 2022