Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on collective HDF5 metadata #620

Merged

Conversation

roblatham00
Copy link
Contributor

Description
Since HDF5 1.10.0 (march 2016), HDF5 has the ability to request "collective metadata". When requested, HDF5 (internally) will use collective operations to update the data structures stored in the HDF5 file. If not requested, every single process will make these data structure updates, resulting in lots of small reads and writes -- particularly at scale.

This does not fix #112 but users who are interested in collective I/O optimizations and are OK with the restrictions collective I/O imposes (every process has to create datasets, every process has to call H5Dwrite etc) are likely to be interested in this optimization as well.

A "darshan" log of a four process parallel_hdf5_write_dataset example from HighFive tells me there were 0 MPI-IO collective writes, 11 MPI-IO independent writes, and 11 POSIX writes before this change, and
Please include a summary of the change and which issue is fixed or which feature is added.

How to test this?

The Darshan (https://www.mcs.anl.gov/research/projects/darshan/) I/O characterization tool collects counters (not logs) of I/O applications. I used it to observe the effects of this change on the parallel_hdf5_write_dataset example. I'm also using the 'spack' package management tool, but that's not relevant to this pull request.

    $ mpicxx -I${HOME}/work/soft/highfive/include -I$(spack location -i hdf5)/include ../src/examples/parallel_hdf5_write_dataset.cpp -o parallel_hdf5_write_dataset -L$(spack location -i hdf5)/lib -lhdf5
    $ LD_PRELOAD=$(spack location -i darshan-runtime)/lib/libdarshan.so mpiexec -np 4 ./parallel_hdf5_write_dataset

Test System

  • OS: Ubuntu 22.04
  • Compiler: g++-11.2
  • Dependency versions: hdf5-1.10 or newer

@pramodk pramodk requested a review from 1uc October 26, 2022 19:16
@1uc
Copy link
Collaborator

1uc commented Oct 27, 2022

Thank you for this useful contribution. There's a few mechanical changes I'd like to make, i.e. split the two setting (to allow users to pick whichever the need), use the "private with friend" construction to avoid adding this to any plist other than file access plists. (I'll take care of this in the coming few days.)

Finally, I'm undecided about adding this to MPIOFileDriver. The concern is that it would change the requirements on which ranks need to participate in some of our function calls. (Which implies it might break user code.) For writing, to my understanding, always all ranks in the communicator need to participate; since this is how HDF5 keeps the file structure consistent across ranks. Hence MPI collective writes is (likely) an optimization without behavioural change.

Let's look at reading metadata. The list of collective operations is here:
https://support.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

For reading I see the list of affected HDF5 function call is:
https://docs.hdfgroup.org/hdf5/v1_12/group___g_a_c_p_l.html

Therefore we have functions that would become collective. The interesting ones are:

  • H5Aread
  • H5Dclose
  • H5Gclose

The two that H5?close are interesting because it's acceptable to open datasets or groups independently, if one doesn't change the dataset/group. Therefore, one should be allowed to close them independently as well. (I need this feature myself.)

Therefore, I'd suggest having nothing or a MPIOCollectiveFileDriver which enables the collective settings.

Comment on lines 17 to +18
add(MPIOFileAccess(comm, info));
add(MPIOCollectiveMD());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative to MPIOFileDriver (which I personally would like to see deprecated), one can use:

    FileDriver adam;
    adam.add(MPIOFileAccess(MPI_COMM_WORLD, MPI_INFO_NULL));
    adam.add(MPIOCollectiveMD());
    File file(filename.str(), File::ReadWrite | File::Create | File::Truncate, adam);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes a lot more sense to me, especially now that i have a better feel for "the highfive way" after taking a crack at the collective data work.

@1uc 1uc changed the base branch from master to collective-metadata October 27, 2022 13:20
Copy link
Collaborator

@1uc 1uc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll merge this into a feature branch; and perform some mechanical changes there. Thanks again for the contribution.

@1uc 1uc merged commit e1052f8 into BlueBrain:collective-metadata Oct 27, 2022
@1uc 1uc mentioned this pull request Oct 28, 2022
@1uc
Copy link
Collaborator

1uc commented Oct 28, 2022

Work continued here: #624

1uc pushed a commit that referenced this pull request Oct 31, 2022
1uc added a commit that referenced this pull request Nov 2, 2022
This PR enables users to configure collective metadata reads and writes (at file level). The heavy lifting was done in #620; here we simply restructure the proposed solution.

In more detail, this PR adds:

   * ability to request collective MPI-IO metadata reads
   * ability to request collective MPI-IO metadata writes
   * explains the caveats of collective metadata reads
   * add collective MPI-IO metadata writes to `MPIOFileDriver`.

Co-authored-by: Rob Latham <rlatham@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collective read
3 participants