Introduce OMPTargetAllocator device-only allocator #4973

ye-luo · 2024-05-07T02:51:48Z

Proposed changes

Introduce OMPTargetAllocator. Having OpenMP offload consistently manages all the device memory.

What type(s) of changes does this code introduce?

New feature

Does this introduce a breaking change?

No

What systems has this change been tested on?

epyc-server

Checklist

Yes. This PR is up to date with current the current state of 'develop'

ye-luo · 2024-05-07T02:53:04Z

src/QMCWaveFunctions/Fermion/MatrixUpdateOMPTarget.h

@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget
    Value* Ainv_ptr       = Ainv.data();
    Value* temp_ptr       = temp.data();
    Value* rcopy_ptr      = rcopy.data();
-    PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \


The H2D copy has been moved to the caller and honor phiV_ptr as a const ptr.

Ok better to have this explicit instead of hiding here in the pragma.

ye-luo · 2024-05-07T02:53:13Z

Test this please

ye-luo · 2024-05-07T13:11:27Z

Test this please

prckent · 2024-05-08T14:11:25Z

src/Platforms/OMPTarget/OMPallocator.hpp

+
+  void copyToDevice(T* device_ptr, T* host_ptr, size_t n)
+  {
+    if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, omp_get_default_device(), omp_get_initial_device()))


Why omp_get_initial_device here? It is default device everywhere else. Please add a comment

Initial device basically means host

documentation of that here seems appropriate.

omp_target_memcpy documentation has it https://www.openmp.org/spec-html/5.0/openmpsu166.html
omp_get_initial_device https://www.openmp.org/spec-html/5.0/openmpsu150.html

It is a standard use of omp_get_initial_device rather a QMCPACK unique hack. I prefer not to document in our source.

I don't think we should be requiring a deep reading of the omp spec. this function name is needlessly obtuse. If you don't want a comment just pull out the function calls

auto& device = omp_get_default_device(); auto& host = omp_get_init_device(); if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, device, host))

I doubt the generated optimized code will be different.

I like your suggestion better than adding comments.

PDoakORNL

Some sort of unit test would be good. Other than checking pointerAttributes the CUDA device allocators are also lacking.

The copies behaving properly and preserving the expected semantics might be a good idea. These call could for instance become more perilous if they moved to being built on the nonblocking API. Which I could see being a likely future change. Most accelerator API's use the default stream for the synchronous copies which are generally non performant and force some degree of global synchronization. This might be delta runtime or compile time forcing default stream per thread.

PDoakORNL · 2024-05-08T14:41:06Z

src/QMCWaveFunctions/Fermion/MatrixDelayedUpdateCUDA.h

@@ -533,8 +532,7 @@ class MatrixDelayedUpdateCUDA
    Value* rcopy_ptr      = rcopy.data();
    // This must be Ainv must be tofrom due to NonlocalEcpComponent and possibly
    // other modules assumptions about the state of psiMinv.
-    PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \
-                    map(always, tofrom: Ainv_ptr[:Ainv.size()]) \
+    PRAGMA_OFFLOAD("omp target data map(always, tofrom: Ainv_ptr[:Ainv.size()]) \


making a reference to psiMinv_ doesn't accomplish anything other than making this code harder to read. I assume this was done at some point to minimize diffs but its making this code less clear now. Its quite important that Ainv is actually state i.e. of *this

I removed the reference and made the comment less confusing.

PDoakORNL · 2024-05-08T14:44:35Z

src/QMCWaveFunctions/Fermion/MatrixUpdateOMPTarget.h

@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget
    Value* Ainv_ptr       = Ainv.data();
    Value* temp_ptr       = temp.data();
    Value* rcopy_ptr      = rcopy.data();
-    PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \


Ok better to have this explicit instead of hiding here in the pragma.

ye-luo · 2024-05-08T15:00:41Z

Some sort of unit test would be good. Other than checking pointerAttributes the CUDA device allocators are also lacking.

After a few previously close PRs exploring how to support CUDA only cases, I decided not to explore any CUDA-only no-offload handling of memory. Thus vendor device allocators will be deleted.

Regarding aync transfer #4976 enables that in a limited scope namely only when using dual-space allocators. It seems enough for us.

PDoakORNL · 2024-05-08T15:43:07Z

I'm very much not in favor of that. I wasn't able to follow your no-offload handling PR's but I was under the impression we were finally going to clearly handle memory instead of further rely on implicit omp_target semantics. 😞

ye-luo · 2024-05-08T15:58:38Z

I'm very much not in favor of that. I wasn't able to follow your no-offload handling PR's but I was under the impression we were finally going to clearly handle memory instead of further rely on implicit omp_target semantics. 😞

You impression was not wrong. I also thought we could keep OMP and CUDA separate. After playing with the code in the past few weeks. I questioned myself what is benefit of doing so and how much effort is in need. I realized that the benefit is negative. When OpenMP offload manages all the memory allocation/deallocation, the host/device pair relationship is encoded both explicitly inside our source code an in the OpenMP runtime. This makes extension to single memory space like APUs very easily and CUDA async transfers still can be explicitly called. However, if we use CUDA to handle host/device allocation, handling APU requires source code change and the OpenMP side still doesn't recognize the pair well when having two memory spaces.

CUDA/SYCL still can be used for writing kernels and handles async streams. More in a accelerated library fashion.

PDoakORNL · 2024-05-08T17:38:12Z

We should continue the discussion of device allocators outside of this PR. I would have thought the APU or unified memory case would be best handled by using a dual allocator that just nop'd the transfers and was attached to just one memory space.

ye-luo · 2024-05-08T18:11:06Z

Test this please

ye-luo commented May 7, 2024

View reviewed changes

ye-luo force-pushed the add-omptarget-device2 branch from ab97e9b to 9e320fb Compare May 7, 2024 13:11

ye-luo mentioned this pull request May 7, 2024

Add Queue<PlatformKind::CUDA> #4976

Merged

prckent reviewed May 8, 2024

View reviewed changes

PDoakORNL requested changes May 8, 2024

View reviewed changes

ye-luo added 4 commits May 8, 2024 13:07

psiV H2D in DiracDeterminantBatched before calling updateRow.

62abf5b

Add OMPTargetAllocator a device-only allocator.

a60ad1d

Explicitly call out host_id.

44d7d81

Clarify PsiMinv_ state of motion.

0aa0870

ye-luo force-pushed the add-omptarget-device2 branch from 9e320fb to 0aa0870 Compare May 8, 2024 18:07

PDoakORNL approved these changes May 8, 2024

View reviewed changes

ye-luo merged commit 30c758a into QMCPACK:develop May 8, 2024
40 checks passed

ye-luo deleted the add-omptarget-device2 branch May 8, 2024 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce OMPTargetAllocator device-only allocator #4973

Introduce OMPTargetAllocator device-only allocator #4973

ye-luo commented May 7, 2024 •

edited

ye-luo May 7, 2024

PDoakORNL May 8, 2024

ye-luo commented May 7, 2024

ye-luo commented May 7, 2024

prckent May 8, 2024

ye-luo May 8, 2024

PDoakORNL May 8, 2024

ye-luo May 8, 2024 •

edited

ye-luo May 8, 2024

PDoakORNL May 8, 2024

ye-luo May 8, 2024

PDoakORNL left a comment

PDoakORNL May 8, 2024

ye-luo May 8, 2024

PDoakORNL May 8, 2024

ye-luo commented May 8, 2024 •

edited

PDoakORNL commented May 8, 2024

ye-luo commented May 8, 2024 •

edited

PDoakORNL commented May 8, 2024

ye-luo commented May 8, 2024

Introduce OMPTargetAllocator device-only allocator #4973

Introduce OMPTargetAllocator device-only allocator #4973

Conversation

ye-luo commented May 7, 2024 • edited

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ye-luo commented May 7, 2024

ye-luo commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ye-luo May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PDoakORNL left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ye-luo commented May 8, 2024 • edited

PDoakORNL commented May 8, 2024

ye-luo commented May 8, 2024 • edited

PDoakORNL commented May 8, 2024

ye-luo commented May 8, 2024

ye-luo commented May 7, 2024 •

edited

ye-luo May 8, 2024 •

edited

ye-luo commented May 8, 2024 •

edited

ye-luo commented May 8, 2024 •

edited