Question: Do warp cross-lane functions work in branching code at all? #2474

kingcrimsontianyu · 2022-02-06T16:08:37Z

The HIP documentation does not mention whether __any, __all, __ballot, __shfl, __shfl_down/up/xor work in branching code where some of the threads will become inactive.

The CUDA API exposes the *_sync version of the intrinsics (along with __activemask) that ask users to explicitly specify the mask for the participating threads. Even prior to the existence of *_sync API, CUDA's warp cross-lane functions worked correctly in branch coding, whereas HIP, at least on a Vega 56 GPU that I tested, did not (see this previous bug report: #952).

So my question is as the title indicates: are warp cross-lane functions intended to work within branching code? Or users shall ensure all 64 threads in a warp be active when invoking these functions? I appreciate any comment from the dev!

The text was updated successfully, but these errors were encountered:

yxsamliu · 2022-02-28T21:27:15Z

They should work in a diverging control flow.

zjin-lcf · 2022-04-26T18:32:45Z

The 64 threads in a wavefront are all active though they execute in one of the two branches. Since the predicate is positive for some threads, the output is all ones. Is my understanding right for your 'any' example ?

yxsamliu · 2022-04-26T19:47:21Z

The threads in a wavefront could be partially inactive and these functions will still work.

For 'any', a comparison is done for all active lanes and the results are collected as bits. For inactive lanes, the bits are 0. If any bit is 1, then the result is 1. Therefore 'any' means there is at least one active lane with the predicate evaluated to be true.

zjin-lcf · 2022-04-26T21:13:41Z

The expected result is not '0101...01' for the 'any' example in #952. Is that right ?

yxsamliu · 2022-04-27T03:57:03Z

I would expect the result to be '0101...01' but currently hip-clang gets '1111...111'.

This because LLVM pass SimplifyCFG merged the two calls of __any, which makes the kernel equivalent to:

__global__ void TestKernel(int* c) { int x = hipThreadIdx_x % 2 == 0; c[hipThreadIdx_x] = __any(x); }

Obviously, SimplifyCFG does not consider divergent execution of __any.

If you compile with -O0, the test will pass because SimplifyCFG is disabled.

I am wondering what is the result of nvcc? Thanks.

kingcrimsontianyu · 2022-04-27T17:19:36Z

I would expect the result to be '0101...01' but currently hip-clang gets '1111...111'.

This because LLVM pass SimplifyCFG merged the two calls of __any, which makes the kernel equivalent to:

__global__ void TestKernel(int* c) { int x = hipThreadIdx_x % 2 == 0; c[hipThreadIdx_x] = __any(x); }

Obviously, SimplifyCFG does not consider divergent execution of __any.

If you compile with -O0, the test will pass because SimplifyCFG is disabled.

I am wondering what is the result of nvcc? Thanks.

Thanks for sharing your findings. May I know which version of ROCm and what AMD GPU model you are using?

I just tested on an Nvidia Ampere GPU (sm_86), the result is expected where c[tid] stores 1 for even thread indexes, and 0 for odd. The source has to be modified a bit by using the __any_sync syntax due to Ampere's independent thread scheduling:

__global__ void TestKernel(int* c)
{
    int isEven = threadIdx.x % 2 == 0;

    if (isEven)
    {
        int x = 1;
        c[threadIdx.x] = __any_sync(0x55555555U, x);
    }
    else
    {
        int y = 0;
        c[threadIdx.x] = __any_sync(0xaaaaaaaaU, y);
    }
}

But at any rate, the Nvidia compiler generates the right code by not merging the two calls of __any, as can be seen by the PTX (optimization on):

@%p7 bra 	$L__BB0_2; // %p7 stores isEven
bra.uni 	$L__BB0_1; // odd threads go to L__BB0_1

$L__BB0_2: // for even threads
mov.pred 	%p9, -1;
mov.u32 	%r5, 1431655765; // even mask 0x55555555U
vote.sync.any.pred 	%p10, %p9, %r5; // Predicate %p9 is non-zero, result stored in %p10
bra.uni 	$L__BB0_3; // active threads in the branch go to L__BB0_3

$L__BB0_1: // for odd threads
mov.u32 	%r3, -1431655766; // odd mask 0xaaaaaaaaU
vote.sync.any.pred 	%p10, %p5, %r3; // Predicate %p5 is 0, result stored in %p10

$L__BB0_3:
selp.u32 	%r7, 1, 0, %p10;
cvta.to.global.u64 	%rd2, %rd1;
mul.wide.u32 	%rd3, %r1, 4;
add.s64 	%rd4, %rd2, %rd3;
st.global.u32 	[%rd4], %r7;

In fact, the SASS seems to indicate that the Nvidia compiler has completely optimized away the __any call for odd threads:

0000000b 00bb1160	@P0   BRA 0xb00bb11b0 // even threads go to 0xb00bb11b0
0000000b 00bb1170	      UMOV UR4, 0xaaaaaaaa // for odd threads. No vote!!!
0000000b 00bb1180	      BRA.DIV UR4, 0xb00bb1260 
0000000b 00bb1190	      PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 
0000000b 00bb11a0	      BRA 0xb00bb1200 // odd threads go to the reconvergence point
0000000b 00bb11b0	      UMOV UR4, 0x55555555 // for even threads
0000000b 00bb11c0	      BRA.DIV UR4, 0xb00bb12c0 
0000000b 00bb11d0	      VOTEU.ANY UR4, UPT, PT // vote among even threads
0000000b 00bb11e0	      ULOP3.LUT UP0, URZ, UR4, 0x55555555, URZ, 0xc0, !UPT 
0000000b 00bb11f0	      PLOP3.LUT P0, PT, PT, PT, UP0, 0x80, 0x0 
0000000b 00bb1200	      BSYNC B0 // reconvergence point for all threads
0000000b 00bb1210	      IMAD.MOV.U32 R3, RZ, RZ, 0x4 
0000000b 00bb1220	      SEL R5, RZ, 0x1, !P0 
0000000b 00bb1230	      IMAD.WIDE.U32 R2, R4, R3, c[0x0][0x160] 
0000000b 00bb1240	      STG.E [R2.64], R5 
0000000b 00bb1250	      EXIT

yxsamliu · 2022-04-27T22:46:15Z

I am using ROCm5.1 and gfx906. The issue also exists with llvm trunk.

yxsamliu · 2022-04-29T03:20:20Z

It seems the issue is with SimplifyCFG no matter whether __any is inlined or not.

If __any is not inlined, the IR before and after SimplifyCFG is:

`*** IR Dump After CoroElidePass on _Z8test_anyPi ***
; Function Attrs: convergent mustprogress norecurse nounwind
define protected amdgpu_kernel void @_Z8test_anyPi(i32 addrspace(1)* nocapture noundef writeonly %a.coerce) local_unnamed_addr #3 {
entry:
%0 = tail call i32 @llvm.amdgcn.workitem.id.x() #5, !range !6
%1 = and i32 %0, 1
%cmp = icmp eq i32 %1, 0
br i1 %cmp, label %if.then, label %if.else

if.then: ; preds = %entry
%call2 = tail call fastcc noundef zeroext i1 @_Z5wfanyi(i32 noundef 1) #6
%conv = zext i1 %call2 to i32
%2 = zext i32 %0 to i64
%arrayidx8 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %2
store i32 %conv, i32 addrspace(1)* %arrayidx8, align 4, !tbaa !7
br label %if.end

if.else: ; preds = %entry
%call3 = tail call fastcc noundef zeroext i1 @_Z5wfanyi(i32 noundef 0) #6
%conv4 = zext i1 %call3 to i32
%3 = zext i32 %0 to i64
%arrayidx67 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %3
store i32 %conv4, i32 addrspace(1)* %arrayidx67, align 4, !tbaa !7
br label %if.end

if.end: ; preds = %if.else, %if.then
ret void
}
*** IR Dump After SimplifyCFGPass on _Z8test_anyPi ***
; Function Attrs: convergent mustprogress norecurse nounwind
define protected amdgpu_kernel void @_Z8test_anyPi(i32 addrspace(1)* nocapture noundef writeonly %a.coerce) local_unnamed_addr #3 {
entry:
%0 = tail call i32 @llvm.amdgcn.workitem.id.x() #5, !range !6
%1 = and i32 %0, 1
%cmp = icmp eq i32 %1, 0
%. = select i1 %cmp, i32 1, i32 0
%call3 = tail call fastcc noundef zeroext i1 @_Z5wfanyi(i32 noundef %.) #6
%conv4 = zext i1 %call3 to i32
%2 = zext i32 %0 to i64
%arrayidx67 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %2
store i32 %conv4, i32 addrspace(1)* %arrayidx67, align 4, !tbaa !7
ret void
}`

If __any is inlined, the IR before and after SimplifyCFG is:

`*** IR Dump After CoroElidePass on _Z8test_anyPi ***
; Function Attrs: convergent mustprogress norecurse nounwind
define protected amdgpu_kernel void @_Z8test_anyPi(i32 addrspace(1)* nocapture noundef writeonly %a.coerce) local_unnamed_addr #2 {
entry:
%0 = tail call i32 @llvm.amdgcn.workitem.id.x() #4, !range !5
%1 = and i32 %0, 1
%cmp = icmp eq i32 %1, 0
br i1 %cmp, label %if.then, label %if.else

if.then: ; preds = %entry
%2 = tail call i32 asm sideeffect "; ockl ballot hoisting hack $0", "=v,0"(i32 1) #5, !srcloc !6
%call1.i = tail call noundef i64 @llvm.amdgcn.icmp.i64.i32(i32 noundef %2, i32 noundef 0, i32 noundef 33) #5
%cmp.i = icmp ne i64 %call1.i, 0
%conv = zext i1 %cmp.i to i32
%3 = zext i32 %0 to i64
%arrayidx8 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %3
store i32 %conv, i32 addrspace(1)* %arrayidx8, align 4, !tbaa !7
br label %if.end

if.else: ; preds = %entry
%4 = tail call i32 asm sideeffect "; ockl ballot hoisting hack $0", "=v,0"(i32 0) #5, !srcloc !6
%call1.i9 = tail call noundef i64 @llvm.amdgcn.icmp.i64.i32(i32 noundef %4, i32 noundef 0, i32 noundef 33) #5
%cmp.i10 = icmp ne i64 %call1.i9, 0
%conv4 = zext i1 %cmp.i10 to i32
%5 = zext i32 %0 to i64
%arrayidx67 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %5
store i32 %conv4, i32 addrspace(1)* %arrayidx67, align 4, !tbaa !7
br label %if.end

if.end: ; preds = %if.else, %if.then
ret void
}
*** IR Dump After SimplifyCFGPass on _Z8test_anyPi ***
; Function Attrs: convergent mustprogress norecurse nounwind
define protected amdgpu_kernel void @_Z8test_anyPi(i32 addrspace(1)* nocapture noundef writeonly %a.coerce) local_unnamed_addr #2 {
entry:
%0 = tail call i32 @llvm.amdgcn.workitem.id.x() #4, !range !5
%1 = and i32 %0, 1
%cmp = icmp eq i32 %1, 0
br i1 %cmp, label %if.then, label %if.else

if.then: ; preds = %entry
%2 = tail call i32 asm sideeffect "; ockl ballot hoisting hack $0", "=v,0"(i32 1) #5, !srcloc !6
br label %if.end

if.else: ; preds = %entry
%3 = tail call i32 asm sideeffect "; ockl ballot hoisting hack $0", "=v,0"(i32 0) #5, !srcloc !6
br label %if.end

if.end: ; preds = %if.else, %if.then
%.sink = phi i32 [ %3, %if.else ], [ %2, %if.then ]
%call1.i9 = tail call noundef i64 @llvm.amdgcn.icmp.i64.i32(i32 noundef %.sink, i32 noundef 0, i32 noundef 33) #5
%cmp.i10 = icmp ne i64 %call1.i9, 0
%conv4 = zext i1 %cmp.i10 to i32
%4 = zext i32 %0 to i64
%arrayidx67 = getelementptr inbounds i32, i32 addrspace(1)* %a.coerce, i64 %4
store i32 %conv4, i32 addrspace(1)* %arrayidx67, align 4, !tbaa !7
ret void
}`

In either case, two calls of a function are merged as one call. Basically, SimplifyCFG assumes if (x) f(a) else f(b) is equivalent to f(x?a:b). This is true for normal programs, but not true for GPU, since if and else branches are executed sequentially with partially active lanes in each branch, whereas after merging the function is executed with all lanes active.

yxsamliu · 2022-04-29T16:36:45Z

posted an RFC to llvm for fixing this issue https://discourse.llvm.org/t/rfc-introduce-cross-lane-function-attribute-to-prevent-merging-call-of-cross-lane-functions/62148

ppanchad-amd closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Do warp cross-lane functions work in branching code at all? #2474

Question: Do warp cross-lane functions work in branching code at all? #2474

kingcrimsontianyu commented Feb 6, 2022

yxsamliu commented Feb 28, 2022

zjin-lcf commented Apr 26, 2022

yxsamliu commented Apr 26, 2022

zjin-lcf commented Apr 26, 2022

yxsamliu commented Apr 27, 2022

kingcrimsontianyu commented Apr 27, 2022

yxsamliu commented Apr 27, 2022

yxsamliu commented Apr 29, 2022

yxsamliu commented Apr 29, 2022 •

edited

Loading

Question: Do warp cross-lane functions work in branching code at all? #2474

Question: Do warp cross-lane functions work in branching code at all? #2474

Comments

kingcrimsontianyu commented Feb 6, 2022

yxsamliu commented Feb 28, 2022

zjin-lcf commented Apr 26, 2022

yxsamliu commented Apr 26, 2022

zjin-lcf commented Apr 26, 2022

yxsamliu commented Apr 27, 2022

kingcrimsontianyu commented Apr 27, 2022

yxsamliu commented Apr 27, 2022

yxsamliu commented Apr 29, 2022

yxsamliu commented Apr 29, 2022 • edited Loading

yxsamliu commented Apr 29, 2022 •

edited

Loading