Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use device arena for the_fa_arena when activating GPU-aware MPI #3362

Merged
merged 1 commit into from
Jun 13, 2023

Conversation

mukul1992
Copy link
Contributor

@mukul1992 mukul1992 commented Jun 12, 2023

Summary

This change suggested by @WeiqunZhang points the_fa_arena to The_Device_Arena when activating GPU-aware MPI. This obviates the need for setting the_arena_is_managed=0 to take advantage of GPU-aware MPI since it does not work well with managed memory.

Additional background

The motivation for this PR is that this was an long-pending change but the immediate trigger was finding that GPU-aware MPI can reduce communication times significantly but that currently needs setting the_arena_is_managed=0. Not setting this for GPU-aware MPI currently results in degraded performance.
Past discussion on GPU-aware MPI: #2967

Preliminary performance test

Running 100 steps on 8 GPUs over 2 Perlmutter A100 nodes with Tests/GPU/CNS/Exec/Sod, amr.n_cell = 128^3 per GPU, amr.max_grid_size = 128, amrex.use_profiler_syncs = 1 and setting optimal GPU affinities.

Without amrex.use_gpu_aware_mpi=1

FabArray::ParallelCopy_nowait()                200      0.133     0.1779     0.2067  17.82%
FabArray::ParallelCopy_finish()                200    0.07822     0.1193     0.1786  15.40%

With amrex.use_gpu_aware_mpi=1

FabArray::ParallelCopy_nowait()                200    0.05655    0.07633     0.1034  11.20%
FabArray::ParallelCopy_finish()                200    0.03969    0.06087    0.09024   9.77%

@WeiqunZhang WeiqunZhang enabled auto-merge (squash) June 13, 2023 02:58
@WeiqunZhang WeiqunZhang merged commit 96b811d into AMReX-Codes:development Jun 13, 2023
64 checks passed
WeiqunZhang added a commit that referenced this pull request Jun 28, 2023
## Summary
Implement a communications arena for comm buffers to replace
`the_fa_arena`. It creates a separate arena when GPU-aware MPI is used
and `the_arena` is not managed.

## Additional background
The motivation for this is a communication performance degradation that
is observed for GPU-aware MPI with `amrex.the_arena_is_managed=0`.
@WeiqunZhang has a hypothesis that this may be due to the need for
frequent re-registering of comm buffer pointers when using the same
device arena as the other compute data. Hence a separate arena in this
case would alleviate this issue.

`the_fa_arena` is eliminated in this PR and the communication buffer
directly uses `the_comms_arena` to simplify the code.

## Performance tests

The above stated performance degradation is particularly observed with
the `GPU/CNS/Exec/Sod` code under `Tests` and is alleviated by using a
separate comms arena as seen in the performance data below. `original`
refers to the state before we made the change in #3362 related to
`the_fa_arena` pointing to the device arena which allowed
`amrex.the_arena_is_managed=1` with GPU-aware MPI without a significant
performance hit. It is compared with the current development branch and
the proposed comms arena implementation. The data pointing to the
performance improvement from this PR is highlighted.

![Screenshot 2023-06-27 at 4 11 54
PM](https://github.com/AMReX-Codes/amrex/assets/18251677/ae16b822-0178-4679-a90f-255cad6c5451)

In other tests such as the `ABecLaplacian` linear solve or the ERF code,
using `amrex.the_arena_is_managed=0` did not show a significant
performance hit and using this comms arena implementation did not harm
the performance either. More comprehensive tests would be required to
determine the effect on other codes and platforms.

---------

Co-authored-by: Mukul Dave <mhdave@lbl.gov>
Co-authored-by: Weiqun Zhang <WeiqunZhang@lbl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants