Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce OMPTargetAllocator device-only allocator #4973

Merged
merged 4 commits into from
May 8, 2024

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented May 7, 2024

Proposed changes

Introduce OMPTargetAllocator. Having OpenMP offload consistently manages all the device memory.

What type(s) of changes does this code introduce?

  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

epyc-server

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget
Value* Ainv_ptr = Ainv.data();
Value* temp_ptr = temp.data();
Value* rcopy_ptr = rcopy.data();
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The H2D copy has been moved to the caller and honor phiV_ptr as a const ptr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok better to have this explicit instead of hiding here in the pragma.

@ye-luo
Copy link
Contributor Author

ye-luo commented May 7, 2024

Test this please

@ye-luo
Copy link
Contributor Author

ye-luo commented May 7, 2024

Test this please


void copyToDevice(T* device_ptr, T* host_ptr, size_t n)
{
if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, omp_get_default_device(), omp_get_initial_device()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why omp_get_initial_device here? It is default device everywhere else. Please add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial device basically means host

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation of that here seems appropriate.

Copy link
Contributor Author

@ye-luo ye-luo May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a standard use of omp_get_initial_device rather a QMCPACK unique hack. I prefer not to document in our source.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be requiring a deep reading of the omp spec. this function name is needlessly obtuse. If you don't want a comment just pull out the function calls

auto& device = omp_get_default_device();
auto& host = omp_get_init_device();
if(omp_target_memcpy(device_ptr, host_ptr, n, 0, 0, device, host))

I doubt the generated optimized code will be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion better than adding comments.

Copy link
Contributor

@PDoakORNL PDoakORNL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sort of unit test would be good. Other than checking pointerAttributes the CUDA device allocators are also lacking.

The copies behaving properly and preserving the expected semantics might be a good idea. These call could for instance become more perilous if they moved to being built on the nonblocking API. Which I could see being a likely future change. Most accelerator API's use the default stream for the synchronous copies which are generally non performant and force some degree of global synchronization. This might be delta runtime or compile time forcing default stream per thread.

@@ -533,8 +532,7 @@ class MatrixDelayedUpdateCUDA
Value* rcopy_ptr = rcopy.data();
// This must be Ainv must be tofrom due to NonlocalEcpComponent and possibly
// other modules assumptions about the state of psiMinv.
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \
map(always, tofrom: Ainv_ptr[:Ainv.size()]) \
PRAGMA_OFFLOAD("omp target data map(always, tofrom: Ainv_ptr[:Ainv.size()]) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making a reference to psiMinv_ doesn't accomplish anything other than making this code harder to read. I assume this was done at some point to minimize diffs but its making this code less clear now. Its quite important that Ainv is actually state i.e. of *this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the reference and made the comment less confusing.

@@ -302,8 +302,7 @@ class MatrixUpdateOMPTarget
Value* Ainv_ptr = Ainv.data();
Value* temp_ptr = temp.data();
Value* rcopy_ptr = rcopy.data();
PRAGMA_OFFLOAD("omp target data map(always, to: phiV_ptr[:norb]) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok better to have this explicit instead of hiding here in the pragma.

@ye-luo
Copy link
Contributor Author

ye-luo commented May 8, 2024

Some sort of unit test would be good. Other than checking pointerAttributes the CUDA device allocators are also lacking.

After a few previously close PRs exploring how to support CUDA only cases, I decided not to explore any CUDA-only no-offload handling of memory. Thus vendor device allocators will be deleted.

Regarding aync transfer #4976 enables that in a limited scope namely only when using dual-space allocators. It seems enough for us.

@PDoakORNL
Copy link
Contributor

I'm very much not in favor of that. I wasn't able to follow your no-offload handling PR's but I was under the impression we were finally going to clearly handle memory instead of further rely on implicit omp_target semantics. 😞

@ye-luo
Copy link
Contributor Author

ye-luo commented May 8, 2024

I'm very much not in favor of that. I wasn't able to follow your no-offload handling PR's but I was under the impression we were finally going to clearly handle memory instead of further rely on implicit omp_target semantics. 😞

You impression was not wrong. I also thought we could keep OMP and CUDA separate. After playing with the code in the past few weeks. I questioned myself what is benefit of doing so and how much effort is in need. I realized that the benefit is negative. When OpenMP offload manages all the memory allocation/deallocation, the host/device pair relationship is encoded both explicitly inside our source code an in the OpenMP runtime. This makes extension to single memory space like APUs very easily and CUDA async transfers still can be explicitly called. However, if we use CUDA to handle host/device allocation, handling APU requires source code change and the OpenMP side still doesn't recognize the pair well when having two memory spaces.

CUDA/SYCL still can be used for writing kernels and handles async streams. More in a accelerated library fashion.

@PDoakORNL
Copy link
Contributor

We should continue the discussion of device allocators outside of this PR. I would have thought the APU or unified memory case would be best handled by using a dual allocator that just nop'd the transfers and was attached to just one memory space.

@ye-luo
Copy link
Contributor Author

ye-luo commented May 8, 2024

Test this please

@ye-luo ye-luo merged commit 30c758a into QMCPACK:develop May 8, 2024
40 checks passed
@ye-luo ye-luo deleted the add-omptarget-device2 branch May 8, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants