Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Compression improvements #1302

Merged
merged 43 commits into from
Feb 24, 2022

Conversation

jhendersonHDF
Copy link
Collaborator

No description provided.

char global_no_coll_cause_string[512];

if (H5D__mpio_get_no_coll_cause_strings(local_no_coll_cause_string, 512,
global_no_coll_cause_string, 512) < 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved most of this code into a new function to get strings for the reasons why collective I/O was broken for code reuse.

*-------------------------------------------------------------------------
*/
herr_t
H5D_select_io_mem(void *dst_buf, const H5S_t *dst_space, const void *src_buf, const H5S_t *src_space,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new routine that is very similar to H5D__select_io(), but rather than copying between application memory and the file, copies between two memory buffers according to the selection in the dst and src dataspaces.

*---------------------------------------------------------------------------
*/
static const char *
H5FD__mem_t_to_str(H5FD_mem_t mem_type)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file allow one to see what type of I/O the MPI I/O file driver is doing. Previously one would only see the offset and length of the I/O. This now also shows you whether it's a superblock area, raw data, object header, etc.

*-------------------------------------------------------------------------
*/
herr_t
H5_mpio_gatherv_alloc(void *send_buf, int send_count, MPI_Datatype send_type, const int recv_counts[],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two new functions here are simply wrappers around MPI_(All)gatherv that hide a bit of boilerplate code. Both allocate the receive buffer for the caller. The only difference between the two is that the "simple" function calculates the recv_counts and displacements arrays for the caller before making the MPI_(All)gatherv call.

@@ -273,6 +373,185 @@ static int H5D__cmp_filtered_collective_io_info_entry_owner(const void *filtered
/* Local Variables */
/*******************/

#ifdef H5Dmpio_DEBUG
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below code adds debugging to H5Dmpio similar to that in the MPI I/O file driver.

*-------------------------------------------------------------------------
*/
static herr_t
H5D__mpio_array_gatherv(void *local_array, size_t local_array_num_entries, size_t array_entry_size,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole routine was rewritten and move to H5mpi.c

if ((mpi_rank = H5F_mpi_get_rank(io_info->dset->oloc.file)) < 0)
HGOTO_ERROR(H5E_IO, H5E_MPI, FAIL, "unable to obtain MPI rank")
if ((mpi_size = H5F_mpi_get_size(io_info->dset->oloc.file)) < 0)
HGOTO_ERROR(H5E_IO, H5E_MPI, FAIL, "unable to obtain MPI size")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than retrieving the MPI rank and size multiple times in this file, do it once in H5D__chunk_collective_io which tends to be the main entrypoint in this file. Then, just hand those down to functions as needed.

src/H5Dmpio.c Outdated Show resolved Hide resolved
*/
if (H5D__mpio_array_gatherv(chunk_list, chunk_list_num_entries,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than gathering everybody's list of chunks into a collective array, the feature has been revised in most places to construct different MPI derived types to only send as much data as needed, greatly reducing the feature's memory usage.

*-------------------------------------------------------------------------
*/
static herr_t
H5D__filtered_collective_chunk_entry_io(H5D_filtered_collective_io_info_t *chunk_entry,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine used to work on either reading an individual chunk (for dataset reads) or reading and writing an individual chunk (for dataset writes). However, any chunk reads here used to be independent which is a scalability problem for the feature. The new H5D__mpio_collective_filtered_chunk_read, H5D__mpio_collective_filtered_chunk_update and H5D__mpio_collective_filtered_chunk_common_io routines now perform the duties of this routine, but in a manner that allows chunk reads to be done collectively. This should generally scale much better and still allows the user the option of specifying independent chunk reads when desired.

} /* end H5D__mpio_collective_filtered_chunk_reinsert() */

/*-------------------------------------------------------------------------
* Function: H5D__mpio_get_chunk_redistribute_info_types
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3 functions below here create different MPI derived datatypes to extract certain portions of information from the overall per-chunk H5D_filtered_collective_io_info_t structure. Usually, a particular operation (shared chunk redistribution, chunk reallocation, chunk reinsertion) only needs a few fields out of that structure and this information is gathered to all ranks, so sending just the few fields necessary can drastically save on memory usage at the expense of a bit of MPI overhead.

*-------------------------------------------------------------------------
*/
static herr_t
H5D__mpio_collective_filtered_io_type(H5D_filtered_collective_io_info_t *chunk_list, size_t num_entries,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine was just revised a little bit to create slightly more efficient MPI derived types for performing I/O on filtered chunks.

/* Participate in the collective re-insertion of all chunks modified
* in this iteration into the chunk index
*/
for (j = 0; j < collective_chunk_list_num_entries; j++) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk index reinsertion logic here moved into H5D__mpio_collective_filtered_chunk_reinsert, which more efficiently handles memory usage as well as chunk reinsertion itself.

*/
for (i = 0; i < collective_chunk_list_num_entries; i++) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk index reinsertion logic here moved into H5D__mpio_collective_filtered_chunk_reinsert, which more efficiently handles memory usage as well as chunk reinsertion itself.

*/
for (j = 0; j < collective_chunk_list_num_entries; j++) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk file space reallocation logic moved into H5D__mpio_collective_filtered_chunk_reallocate, which more efficiently handles memory usage.

HGOTO_ERROR(H5E_DATASET, H5E_CANTGATHER, FAIL, "couldn't gather new chunk sizes")

/* Collectively re-allocate the modified chunks (from each process) in the file */
for (i = 0; i < collective_chunk_list_num_entries; i++) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk file space reallocation logic moved into H5D__mpio_collective_filtered_chunk_reallocate, which more efficiently handles memory usage.


if (have_chunk_to_process)
if (H5D__filtered_collective_chunk_entry_io(&chunk_list[i], io_info, type_info, fm) < 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duties now performed by H5D__mpio_collective_filtered_chunk_update instead.

*/
for (i = 0; i < chunk_list_num_entries; i++)
if (mpi_rank == chunk_list[i].owners.new_owner)
if (H5D__filtered_collective_chunk_entry_io(&chunk_list[i], io_info, type_info, fm) < 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duties now performed by H5D__mpio_collective_filtered_chunk_update instead.

jhendersonHDF and others added 24 commits February 23, 2022 20:36
Add support for chunk fill values to parallel compression feature

Add partial support for incremental file space allocation to parallel compression feature
Refactor chunk reallocation and reinsertion code to use less MPI communication during linked-chunk I/O
H5D__get_num_chunks can be used to correctly determine space allocation
status for filtered and unfiltered chunked datasets
Avoid doing I/O when a rank has no selection and the MPI communicator
size is 1 or the I/O has been requested as independent at the low level

Avoid 0-byte collective read of incrementally allocated filtered dataset
when dataset han't been written to yet
@lrknox lrknox merged commit 758e97c into HDFGroup:develop Feb 24, 2022
jhendersonHDF added a commit to jhendersonHDF/hdf5 that referenced this pull request Mar 25, 2022
lrknox pushed a commit that referenced this pull request Mar 28, 2022
* Fix the function cast error in H5Dchunk.c and activate (#1170)

`-Werror=cast-function-type`.  Again.

* Parallel Compression improvements (#1302)

* Fix for parallel compression examples on Windows (#1459)

* Parallel compression adjustments for HDF5 1.12

* Committing clang-format changes

Co-authored-by: David Young <dyoung@hdfgroup.org>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
jhendersonHDF added a commit to jhendersonHDF/hdf5 that referenced this pull request Apr 14, 2022
* Fix the function cast error in H5Dchunk.c and activate (HDFGroup#1170)

`-Werror=cast-function-type`.  Again.

* Parallel Compression improvements (HDFGroup#1302)

* Fix for parallel compression examples on Windows (HDFGroup#1459)

Co-authored-by: David Young <dyoung@hdfgroup.org>
@jhendersonHDF jhendersonHDF deleted the parallel_filters branch April 30, 2022 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants