-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Do warp cross-lane functions work in branching code at all? #2474
Comments
They should work in a diverging control flow. |
The 64 threads in a wavefront are all active though they execute in one of the two branches. Since the predicate is positive for some threads, the output is all ones. Is my understanding right for your 'any' example ? |
The threads in a wavefront could be partially inactive and these functions will still work. For 'any', a comparison is done for all active lanes and the results are collected as bits. For inactive lanes, the bits are 0. If any bit is 1, then the result is 1. Therefore 'any' means there is at least one active lane with the predicate evaluated to be true. |
The expected result is not '0101...01' for the 'any' example in #952. Is that right ? |
I would expect the result to be '0101...01' but currently hip-clang gets '1111...111'. This because LLVM pass SimplifyCFG merged the two calls of __any, which makes the kernel equivalent to:
Obviously, SimplifyCFG does not consider divergent execution of __any. If you compile with -O0, the test will pass because SimplifyCFG is disabled. I am wondering what is the result of nvcc? Thanks. |
Thanks for sharing your findings. May I know which version of ROCm and what AMD GPU model you are using? I just tested on an Nvidia Ampere GPU (sm_86), the result is expected where c[tid] stores 1 for even thread indexes, and 0 for odd. The source has to be modified a bit by using the
But at any rate, the Nvidia compiler generates the right code by not merging the two calls of
In fact, the SASS seems to indicate that the Nvidia compiler has completely optimized away the
|
I am using ROCm5.1 and gfx906. The issue also exists with llvm trunk. |
It seems the issue is with SimplifyCFG no matter whether __any is inlined or not. If __any is not inlined, the IR before and after SimplifyCFG is: `*** IR Dump After CoroElidePass on _Z8test_anyPi *** if.then: ; preds = %entry if.else: ; preds = %entry if.end: ; preds = %if.else, %if.then If __any is inlined, the IR before and after SimplifyCFG is: `*** IR Dump After CoroElidePass on _Z8test_anyPi *** if.then: ; preds = %entry if.else: ; preds = %entry if.end: ; preds = %if.else, %if.then if.then: ; preds = %entry if.else: ; preds = %entry if.end: ; preds = %if.else, %if.then In either case, two calls of a function are merged as one call. Basically, SimplifyCFG assumes |
posted an RFC to llvm for fixing this issue https://discourse.llvm.org/t/rfc-introduce-cross-lane-function-attribute-to-prevent-merging-call-of-cross-lane-functions/62148 |
The HIP documentation does not mention whether
__any
,__all
,__ballot
,__shfl
,__shfl_down/up/xor
work in branching code where some of the threads will become inactive.The CUDA API exposes the
*_sync
version of the intrinsics (along with__activemask
) that ask users to explicitly specify the mask for the participating threads. Even prior to the existence of*_sync
API, CUDA's warp cross-lane functions worked correctly in branch coding, whereas HIP, at least on a Vega 56 GPU that I tested, did not (see this previous bug report: #952).So my question is as the title indicates: are warp cross-lane functions intended to work within branching code? Or users shall ensure all 64 threads in a warp be active when invoking these functions? I appreciate any comment from the dev!
The text was updated successfully, but these errors were encountered: