-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected result of __hip_atomic_fetch_xor
with malloc managed
#2481
Comments
@hdelan not sure why you're using a builtin rather than the standard atomicXor nor how you chose that scope? My theory is that with your setup, the atomic is trying to cross the PCIe bus, and PCIe does not support atomic XOR. Your test might pass with atomic ADD IF your system supports PCIe atomics (not all do). |
Hi @b-sumner thanks for the quick response. The behaviour is the same if I use the standard atomic ops:
So the PCIe bus must support atomicXor in order to succeed? Could you possibly link me to some docs where I can read more about this? Thanks |
For some extra context, hdelan is following up after we added a workaround for this issue in the dpc++ runtime. My investigation might be useful context intel/llvm#7252 (comment) As for the builtins: In hdelan's original example they're simply expanded directly from the header for clarity. |
This may help: https://rocm.docs.amd.com/en/latest/understand/More-about-how-ROCm-uses-PCIe-Atomics.html . There is also a great deal of discussion about AMD's use of PCIe atomics in other issues. Unfortunately, many platform vendors do not include atomic support some of all of their PCIe pathways. |
Thanks very much @b-sumner . Just out of curiosity, why does HIP not choose to emulate atomicXor with a CAS loop, given that support for native atomic Xor is patchy for PCIe vendors? |
Hi @hdelan. It's being discussed. There is a concern about pessimizing existing code that may be running on systems with more capable interconnects. Also, this may not work anyway given my previous comments about PCIe. And we've observed that atomic AND, OR, and XOR are rarely used for communication between agents but frequently within an agent. |
In CUDA, managed memory allocated with
cudaMallocManaged
can be accessed on host and device without needing to prefetch. Since the HIP API is meant to match the CUDA API closely, I expect the same behaviour forhipMallocManaged
.However, this simple example fails when
__hip_atomic_fetch_xor
is used with memory allocated withhipMallocManaged
.Output:
The example will work if
__hip_atomic_fetch_add
is used instead of__hip_atomic_fetch_xor
:The sample will also work with a prefetch and xor:
Is a prefetch needed in order to use atomic ops on the GPU? If so then why does it work without the prefetch for
__hip_atomic_fetch_add
and not for__hip_atomic_fetch_xor
? Or is there a problem with this single builtin (__hip_atomic_fetch_xor
)? Thanks in advance.Ping @ldrumm
OS: Ubuntu 22.04
GPU: W6800 gfx1030
ROCm: Tested on rocm/5.4.3 and rocm/5.6.1.
The text was updated successfully, but these errors were encountered: