Skip to content

[Codegen] Reduce warp-divergence with predicated instruction emitting#114

Merged
yaoyaoding merged 1 commit intomainfrom
optimize-sync
Apr 9, 2026
Merged

[Codegen] Reduce warp-divergence with predicated instruction emitting#114
yaoyaoding merged 1 commit intomainfrom
optimize-sync

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

@yaoyaoding yaoyaoding commented Apr 9, 2026

Several PTX instructions (tcgen05.mma, tcgen05.commit, tcgen05.cp, TMA copy, clc.try_cancel) are warp-cooperative at the SASS level — all 32 threads participate and hardware issues a single operation. However, the previous codegen wrapped them in if (elect_sync()) { asm(...); } which caused ptxas to emit BSSY/BSYNC divergence pairs around every call.

This change introduces a pre-computed is_leader_lane predicate (via elect.sync at kernel start) and passes it directly into the PTX inline asm as @__pred <instruction>. This lets ptxas emit predicated instructions without divergent branches.

Before (per warp-cooperative instruction):

    ELECT P0, ...
    BSSY.RECONVERGENT B0, skip
    @!P0 BRA skip
    UTMASTG.2D / TCGEN05.MMA / ...
    BSYNC.RECONVERGENT B0

After:

    @!P0 UTMASTG.2D / TCGEN05.MMA / ...

In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50 to 6, with the remaining pairs only in the one-time prologue (barrier init, arrive_and_expect_tx) and the epilogue inter-warp dispatch.

…minate BSSY/BSYNC

Several PTX instructions (tcgen05.mma, tcgen05.commit, tcgen05.cp, TMA
copy, clc.try_cancel) are warp-cooperative at the SASS level — all 32
threads participate and hardware issues a single operation. However, the
previous codegen wrapped them in `if (elect_sync()) { asm(...); }` which
caused ptxas to emit BSSY/BSYNC divergence pairs around every call.

This change introduces a pre-computed `is_leader_lane` predicate (via
elect.sync at kernel start) and passes it directly into the PTX inline
asm as `@__pred <instruction>`. This lets ptxas emit predicated
instructions without divergent branches.

Before (per warp-cooperative instruction):
    ELECT P0, ...
    BSSY.RECONVERGENT B0, skip
    @!P0 BRA skip
    UTMASTG.2D / TCGEN05.MMA / ...
    BSYNC.RECONVERGENT B0

After:
    @!P0 UTMASTG.2D / TCGEN05.MMA / ...

In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50
to 6, with the remaining pairs only in the one-time prologue (barrier
init, arrive_and_expect_tx) and the epilogue inter-warp dispatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding merged commit 3017377 into main Apr 9, 2026
9 checks passed
@yaoyaoding yaoyaoding deleted the optimize-sync branch April 9, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant