Status of atomicAddNoRet #19

al42and · 2022-03-10T23:10:31Z

Hello!

I would like to inquire about the state of the atomicAddNoRet function. It gives our code (GROMACS) a 2x speed-up in one of the kernels when running on MI100 (gfx908), compared to a plain atomicAdd (which gets compiled into a CAS-loop). So, I would really like to keep using the noret version, since the return value is anyway ignored.

However, atomicAddNoRet is marked as deprecated, and a plain atomicAdd is suggested instead (with no indications of possible performance degradation, by the way!). Could you please advise on what function should be used? I also considered using the __ockl_atomic_add_noret_f32 intrinsic directly, but it's also not documented.

We are using with ROCm 4.5.2 and hipSYCL for our code. However, the problem is easily demonstrated with the plain HIP (ROCm 4.5.2 and 5.0.0 tested):

#include "hip/hip_runtime.h"

__global__ void atomicAddKernel(float *__restrict__ a) {
  // Return value is clearly unused.
  atomicAdd(a, 1);
}

__global__ void atomicAddNoRetKernel(float *__restrict__ a) {
  // 'atomicAddNoRet' is deprecated: use atomicAdd instead
  atomicAddNoRet(a, 1);
}

int main() { return 0; }

$ hipcc --version
HIP version: 4.4.21432-f9dccde4
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.2 21432 9bbd96fd1936641cd47defd8022edafd063019d5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.2/llvm/bin
$ hipcc --offload-arch=gfx908 -O3 test.cpp -save-temps -Wno-deprecated-declarations

Examining the test-hip-amdgcn-amd-amdhsa-gfx908.s file, we see that _Z15atomicAddKernelPf contains a loop of global_atomic_cmpswap, while _Z20atomicAddNoRetKernelPf only has one nice little global_atomic_add_f32 call.

The text was updated successfully, but these errors were encountered:

b-sumner · 2022-03-16T17:43:28Z

@al42and there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default. You can either get it, as you indicate, via a call to atomicAddNoRet, or by adding -munsafe-fp-atomics to the compiler options.

al42and · 2022-03-16T18:43:04Z

Thanks for the reply, @b-sumner!

So, atomicAddNoRet can be relied upon and is not going away in the near future despite being deprecated?

b-sumner · 2022-03-16T19:29:41Z

@al42and correct, it is not going away in the near future.

al42and · 2022-03-17T00:05:58Z

@b-sumner, Thank you for confirming!

One more question, if allowed by NDA: shall atomicAddNoRet be used on all hardware, or would you recommend only using it for gfx908, which has limited hardware atomics? Does it offer any benefits on gfx90a over plain atomicAdd? For gfx906, both seem to get compiled to CAS loop, so no difference here.

b-sumner · 2022-03-17T01:27:24Z

@al42and I'd suggest using atomicAddNoRet() only on gfx908. On gfx90a, only, unsafeAtomicAdd() can be used instead (and supports a return value). There is a double overload in addition to the float overload. But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

al42and · 2022-03-17T14:16:53Z

But I still encourage -munsafe-fp-atomics so you can use the standard atomicAdd().

The problem with this solution for me is that I'm working on a pretty large codebase. Currently, this option can be enabled just fine because the return value from atomicAdd is not used anywhere.

But introducing an option that alters a major way (different return value) the behavior of a common function globally but only on certain hardware (MI100) is very dangerous long-term. We might introduce a new kernel or add a library that relies on a standard-compliant behavior of atomicAdd, and then have a hard time finding the bug.

b-sumner · 2022-03-17T15:46:12Z

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

al42and · 2022-03-17T17:23:48Z

Note that If the return value is used, then the MI-100 no-return atomic add instruction won't be generated with -munsafe-fp-atomics.

Oh, that's great news!

My questions are answered, but I think it might be good if the topic of FP atomic support was more elaborated in the docs. I saw some scattered mentions that they are not supported on AMD hardware, but any deeper info (e.g., that the "noret" version exists and that -munsafe-fp-atomics can improve things dramatically) is hard to discover.

b-sumner · 2022-03-17T17:56:03Z

Thanks, I'll pass this along.

pszi1ard · 2022-03-18T10:31:51Z

there are restrictions on when global_atomic_add_f32 can be used, so the compiler can't generate it by default.

@b-sumner can you please help with clarifying things a bit further, the atomics support and intrinsics are unfortunately frustratingly undocumented by AMD.
Do I understand correctly that atomicAddNoRet() is not just "noret" but it also has non IEEE 754-compliant behavior? If so, it would be probably best if it is called unsafeAtomicAddNoRet() as it differs from atomicAdd not only in that it does not return.

In addition, you suggest that unsafeAtomicAdd() does not work on gfx908. Would it not make more sense to allow unsafeAtomicAdd() on all uarch and let the compiler infer whether the return value is used or not (and if it is not emit __ockl_atomic_add_noret_f32)?

b-sumner · 2022-03-18T16:39:15Z

@pszi1ard, the documentation issue is known and steps are being taken to improve it.

Regarding IEEE 754 compilance, note that C++20 states that he floating-point environment for atomic arithmetic operations on
floating-point may be different than the calling thread’s floating-point environment. You should already be aware that GPU hardware floating point atomics frequently flush subnormal values to 0 and may have other differences.

But the main issue here is that for the devices that support them, non shared memory atomic floating point add is implemented in the device L2 cache and if the pointed-to memory is not cacheable, the add may have no effect. The compiler has no control over where the pointer is pointing, so it is up to the developer to assert that they accept this behavior either by using atomicAddNoRet (gfx908) or unsafeAtomicAdd (gfx90a) or use -munsafe_fp_atomics.

al42and closed this as completed Mar 17, 2022

al42and mentioned this issue Apr 11, 2022

Atomic performance issue when using the buffer-accessor model AdaptiveCpp/AdaptiveCpp#729

Closed

Luigi-Crisci mentioned this issue Dec 23, 2023

[AMDGPU] -munsafe-fp-atomics does nothing on gfx908 intel/llvm#12241

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status of atomicAddNoRet #19

Status of atomicAddNoRet #19

al42and commented Mar 10, 2022

b-sumner commented Mar 16, 2022

al42and commented Mar 16, 2022

b-sumner commented Mar 16, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

pszi1ard commented Mar 18, 2022

b-sumner commented Mar 18, 2022

Status of atomicAddNoRet #19

Status of atomicAddNoRet #19

Comments

al42and commented Mar 10, 2022

b-sumner commented Mar 16, 2022

al42and commented Mar 16, 2022

b-sumner commented Mar 16, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

al42and commented Mar 17, 2022

b-sumner commented Mar 17, 2022

pszi1ard commented Mar 18, 2022

b-sumner commented Mar 18, 2022