Skip to content
This repository has been archived by the owner on Jan 26, 2024. It is now read-only.

Status of atomicAddNoRet #19

Closed
al42and opened this issue Mar 10, 2022 · 11 comments
Closed

Status of atomicAddNoRet #19

al42and opened this issue Mar 10, 2022 · 11 comments

Comments

@al42and
Copy link

al42and commented Mar 10, 2022

Hello!

I would like to inquire about the state of the atomicAddNoRet function. It gives our code (GROMACS) a 2x speed-up in one of the kernels when running on MI100 (gfx908), compared to a plain atomicAdd (which gets compiled into a CAS-loop). So, I would really like to keep using the noret version, since the return value is anyway ignored.

However, atomicAddNoRet is marked as deprecated, and a plain atomicAdd is suggested instead (with no indications of possible performance degradation, by the way!). Could you please advise on what function should be used? I also considered using the __ockl_atomic_add_noret_f32 intrinsic directly, but it's also not documented.

We are using with ROCm 4.5.2 and hipSYCL for our code. However, the problem is easily demonstrated with the plain HIP (ROCm 4.5.2 and 5.0.0 tested):

#include "hip/hip_runtime.h"

__global__ void atomicAddKernel(float *__restrict__ a) {
  // Return value is clearly unused.
  atomicAdd(a, 1);
}

__global__ void atomicAddNoRetKernel(float *__restrict__ a) {
  // 'atomicAddNoRet' is deprecated: use atomicAdd instead
  atomicAddNoRet(a, 1);
}

int main() { return 0; }
$ hipcc --version
HIP version: 4.4.21432-f9dccde4
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.2 21432 9bbd96fd1936641cd47defd8022edafd063019d5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.2/llvm/bin
$ hipcc --offload-arch=gfx908 -O3 test.cpp -save-temps -Wno-deprecated-declarations

Examining the test-hip-amdgcn-amd-amdhsa-gfx908.s file, we see that _Z15atomicAddKernelPf contains a loop of global_atomic_cmpswap, while _Z20atomicAddNoRetKernelPf only has one nice little global_atomic_add_f32 call.

@b-sumner
Copy link
Contributor

@al42and there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default. You can either get it, as you indicate, via a call to atomicAddNoRet, or by adding -munsafe-fp-atomics to the compiler options.

@al42and
Copy link
Author

al42and commented Mar 16, 2022

Thanks for the reply, @b-sumner!

So, atomicAddNoRet can be relied upon and is not going away in the near future despite being deprecated?

@b-sumner
Copy link
Contributor

@al42and correct, it is not going away in the near future.

@al42and
Copy link
Author

al42and commented Mar 17, 2022

@b-sumner, Thank you for confirming!

One more question, if allowed by NDA: shall atomicAddNoRet be used on all hardware, or would you recommend only using it for gfx908, which has limited hardware atomics? Does it offer any benefits on gfx90a over plain atomicAdd? For gfx906, both seem to get compiled to CAS loop, so no difference here.

@b-sumner
Copy link
Contributor

@al42and I'd suggest using atomicAddNoRet() only on gfx908. On gfx90a, only, unsafeAtomicAdd() can be used instead (and supports a return value). There is a double overload in addition to the float overload. But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

@al42and
Copy link
Author

al42and commented Mar 17, 2022

But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

The problem with this solution for me is that I'm working on a pretty large codebase. Currently, this option can be enabled just fine because the return value from atomicAdd is not used anywhere.

But introducing an option that alters a major way (different return value) the behavior of a common function globally but only on certain hardware (MI100) is very dangerous long-term. We might introduce a new kernel or add a library that relies on a standard-compliant behavior of atomicAdd, and then have a hard time finding the bug.

@b-sumner
Copy link
Contributor

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

@al42and
Copy link
Author

al42and commented Mar 17, 2022

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

Oh, that's great news!

My questions are answered, but I think it might be good if the topic of FP atomic support was more elaborated in the docs. I saw some scattered mentions that they are not supported on AMD hardware, but any deeper info (e.g., that the "noret" version exists and that -munsafe-fp-atomics can improve things dramatically) is hard to discover.

@b-sumner
Copy link
Contributor

Thanks, I'll pass this along.

@al42and al42and closed this as completed Mar 17, 2022
@pszi1ard
Copy link

there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default.

@b-sumner can you please help with clarifying things a bit further, the atomics support and intrinsics are unfortunately frustratingly undocumented by AMD.
Do I understand correctly that atomicAddNoRet() is not just "noret" but it also has non IEEE 754-compliant behavior? If so, it would be probably best if it is called unsafeAtomicAddNoRet() as it differs from atomicAdd not only in that it does not return.

In addition, you suggest that unsafeAtomicAdd() does not work on gfx908. Would it not make more sense to allow unsafeAtomicAdd() on all uarch and let the compiler infer whether the return value is used or not (and if it is not emit __ockl_atomic_add_noret_f32)?

@b-sumner
Copy link
Contributor

@pszi1ard, the documentation issue is known and steps are being taken to improve it.

Regarding IEEE 754 compilance, note that C++20 states that he floating-point environment for atomic arithmetic operations on
floating-point may be different than the calling thread’s floating-point environment. You should already be aware that GPU hardware floating point atomics frequently flush subnormal values to 0 and may have other differences.

But the main issue here is that for the devices that support them, non shared memory atomic floating point add is implemented in the device L2 cache and if the pointed-to memory is not cacheable, the add may have no effect. The compiler has no control over where the pointer is pointing, so it is up to the developer to assert that they accept this behavior either by using atomicAddNoRet (gfx908) or unsafeAtomicAdd (gfx90a) or use -munsafe_fp_atomics.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants