-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce OMPTargetAllocator device-only allocator #4973
Conversation
@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget | |||
Value* Ainv_ptr = Ainv.data(); | |||
Value* temp_ptr = temp.data(); | |||
Value* rcopy_ptr = rcopy.data(); | |||
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The H2D copy has been moved to the caller and honor phiV_ptr
as a const ptr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok better to have this explicit instead of hiding here in the pragma.
Test this please |
ab97e9b
to
9e320fb
Compare
Test this please |
|
||
void copyToDevice(T* device_ptr, T* host_ptr, size_t n) | ||
{ | ||
if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, omp_get_default_device(), omp_get_initial_device())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why omp_get_initial_device here? It is default device everywhere else. Please add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial device basically means host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentation of that here seems appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omp_target_memcpy documentation has it https://www.openmp.org/spec-html/5.0/openmpsu166.html
omp_get_initial_device https://www.openmp.org/spec-html/5.0/openmpsu150.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a standard use of omp_get_initial_device rather a QMCPACK unique hack. I prefer not to document in our source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be requiring a deep reading of the omp spec. this function name is needlessly obtuse. If you don't want a comment just pull out the function calls
auto& device = omp_get_default_device();
auto& host = omp_get_init_device();
if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, device, host))
I doubt the generated optimized code will be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion better than adding comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some sort of unit test would be good. Other than checking pointerAttributes the CUDA device allocators are also lacking.
The copies behaving properly and preserving the expected semantics might be a good idea. These call could for instance become more perilous if they moved to being built on the nonblocking API. Which I could see being a likely future change. Most accelerator API's use the default stream for the synchronous copies which are generally non performant and force some degree of global synchronization. This might be delta runtime or compile time forcing default stream per thread.
@@ -533,8 +532,7 @@ class MatrixDelayedUpdateCUDA | |||
Value* rcopy_ptr = rcopy.data(); | |||
// This must be Ainv must be tofrom due to NonlocalEcpComponent and possibly | |||
// other modules assumptions about the state of psiMinv. | |||
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \ | |||
map(always, tofrom: Ainv_ptr[:Ainv.size()]) \ | |||
PRAGMA_OFFLOAD("omp target data map(always, tofrom: Ainv_ptr[:Ainv.size()]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making a reference to psiMinv_ doesn't accomplish anything other than making this code harder to read. I assume this was done at some point to minimize diffs but its making this code less clear now. Its quite important that Ainv is actually state i.e. of *this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the reference and made the comment less confusing.
@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget | |||
Value* Ainv_ptr = Ainv.data(); | |||
Value* temp_ptr = temp.data(); | |||
Value* rcopy_ptr = rcopy.data(); | |||
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok better to have this explicit instead of hiding here in the pragma.
After a few previously close PRs exploring how to support CUDA only cases, I decided not to explore any CUDA-only no-offload handling of memory. Thus vendor device allocators will be deleted. Regarding aync transfer #4976 enables that in a limited scope namely only when using dual-space allocators. It seems enough for us. |
I'm very much not in favor of that. I wasn't able to follow your no-offload handling PR's but I was under the impression we were finally going to clearly handle memory instead of further rely on implicit omp_target semantics. 😞 |
You impression was not wrong. I also thought we could keep OMP and CUDA separate. After playing with the code in the past few weeks. I questioned myself what is benefit of doing so and how much effort is in need. I realized that the benefit is negative. When OpenMP offload manages all the memory allocation/deallocation, the host/device pair relationship is encoded both explicitly inside our source code an in the OpenMP runtime. This makes extension to single memory space like APUs very easily and CUDA async transfers still can be explicitly called. However, if we use CUDA to handle host/device allocation, handling APU requires source code change and the OpenMP side still doesn't recognize the pair well when having two memory spaces. CUDA/SYCL still can be used for writing kernels and handles async streams. More in a accelerated library fashion. |
We should continue the discussion of device allocators outside of this PR. I would have thought the APU or unified memory case would be best handled by using a dual allocator that just nop'd the transfers and was attached to just one memory space. |
9e320fb
to
0aa0870
Compare
Test this please |
Proposed changes
Introduce OMPTargetAllocator. Having OpenMP offload consistently manages all the device memory.
What type(s) of changes does this code introduce?
Does this introduce a breaking change?
What systems has this change been tested on?
epyc-server
Checklist