Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semi-Transparent Buffers #1016

Open
wants to merge 112 commits into
base: master
Choose a base branch
from

Conversation

davschneller
Copy link
Contributor

@davschneller davschneller commented Jan 22, 2024

Or: No More Mandatory USM.

That's right, we finally switch to pure device buffers; in a first implementation (that is, we still keep many indexing arrays etc...).

Currently, we allocate a buffer twice, once on the host, and once on the device—any initialization is done on the host first; then the buffer is copied over. For output, we transfer the data in the reverse direction. We do so to allow explicit host-device memory, while keeping open the possibility to use CPU and GPU at the same time (especially on systems like Grace Hopper, or the MI300A) for different clusters. At least, if it doesn't turn out that using two instances of SeisSol is better in that case anyways.

Consequently, the performance should improve when using IO on simulations which do not have a shared memory space like Grace Hopper and MI300A. As a bonus, we can now also support AMD GPUs without the xnack functionality (or HSA_XNACK=0). That particularly includes in consumer cards, especially the RDNA-based series (since those didn't support shared memory yet, to my knowledge). But it also allows e.g. a more modern ROCm to be used on LUMI.

Next, we also remove the synchronization as far as possible and refactor the graph-handling code a bit.

Furthermore, we allow the execution of CPU and GPU at the same time—which is currently only correctly functional with unified memory enabled.

@codecov-commenter
Copy link

codecov-commenter commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 7.06215% with 329 lines in your changes are missing coverage. Please review.

Project coverage is 13.68%. Comparing base (45e14e4) to head (9d81d86).

Files Patch % Lines
src/Initializer/tree/Layer.hpp 0.00% 69 Missing ⚠️
src/Solver/time_stepping/TimeCluster.cpp 0.00% 36 Missing ⚠️
src/DynamicRupture/FrictionLaws/FrictionSolver.cpp 0.00% 29 Missing ⚠️
src/DynamicRupture/Factory.cpp 0.00% 25 Missing ⚠️
src/Initializer/DynamicRupture.h 0.00% 21 Missing ⚠️
src/Initializer/MemoryManager.cpp 0.00% 21 Missing ⚠️
src/Initializer/tree/LTSTree.hpp 23.80% 16 Missing ⚠️
src/Initializer/tree/Lut.cpp 21.05% 15 Missing ⚠️
src/Initializer/CellLocalMatrices.cpp 0.00% 13 Missing ⚠️
src/Initializer/LTS.h 0.00% 13 Missing ⚠️
... and 25 more

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1016      +/-   ##
==========================================
- Coverage   13.80%   13.68%   -0.13%     
==========================================
  Files         268      269       +1     
  Lines       14998    15110     +112     
==========================================
- Hits         2071     2068       -3     
- Misses      12927    13042     +115     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davschneller davschneller marked this pull request as ready for review May 6, 2024 15:04
Copy link
Contributor

@sebwolf-de sebwolf-de left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check everything in detail, but the performance on LUMI increases with this PR and the results are correct 👍

.ci/format.sh Outdated Show resolved Hide resolved
@@ -171,8 +172,11 @@ ProxyOutput runProxy(ProxyConfig config) {
initDataStructuresOnDevice(enableDynamicRupture);
#endif // ACL_DEVICE

if (config.verbose)
runtime = new seissol::parallel::runtime::StreamRuntime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree—but it's probably better IMO to do it in a general proxy refactor? We have multiple of such raw-pointer new statements at the moment; all of them could be eventually deleted

@@ -262,6 +270,8 @@ ProxyOutput runProxy(ProxyConfig config) {
delete m_dynRupTree;
delete m_allocator;

delete runtime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_ptr?

@davschneller davschneller force-pushed the davschneller/semi-transparent-buffers branch from fe434fc to 2498c15 Compare May 20, 2024 11:28
@davschneller davschneller force-pushed the davschneller/semi-transparent-buffers branch from 2498c15 to 3c7adc6 Compare May 22, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants