[Codegen] Reduce warp-divergence with predicated instruction emitting#114
Merged
yaoyaoding merged 1 commit intomainfrom Apr 9, 2026
Merged
[Codegen] Reduce warp-divergence with predicated instruction emitting#114yaoyaoding merged 1 commit intomainfrom
yaoyaoding merged 1 commit intomainfrom
Conversation
…minate BSSY/BSYNC
Several PTX instructions (tcgen05.mma, tcgen05.commit, tcgen05.cp, TMA
copy, clc.try_cancel) are warp-cooperative at the SASS level — all 32
threads participate and hardware issues a single operation. However, the
previous codegen wrapped them in `if (elect_sync()) { asm(...); }` which
caused ptxas to emit BSSY/BSYNC divergence pairs around every call.
This change introduces a pre-computed `is_leader_lane` predicate (via
elect.sync at kernel start) and passes it directly into the PTX inline
asm as `@__pred <instruction>`. This lets ptxas emit predicated
instructions without divergent branches.
Before (per warp-cooperative instruction):
ELECT P0, ...
BSSY.RECONVERGENT B0, skip
@!P0 BRA skip
UTMASTG.2D / TCGEN05.MMA / ...
BSYNC.RECONVERGENT B0
After:
@!P0 UTMASTG.2D / TCGEN05.MMA / ...
In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50
to 6, with the remaining pairs only in the one-time prologue (barrier
init, arrive_and_expect_tx) and the epilogue inter-warp dispatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Several PTX instructions (
tcgen05.mma,tcgen05.commit,tcgen05.cp, TMA copy,clc.try_cancel) are warp-cooperative at the SASS level — all 32 threads participate and hardware issues a single operation. However, the previous codegen wrapped them inif (elect_sync()) { asm(...); }which caused ptxas to emit BSSY/BSYNC divergence pairs around every call.This change introduces a pre-computed
is_leader_lanepredicate (via elect.sync at kernel start) and passes it directly into the PTX inline asm as@__pred <instruction>. This lets ptxas emit predicated instructions without divergent branches.Before (per warp-cooperative instruction):
After:
In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50 to 6, with the remaining pairs only in the one-time prologue (barrier init, arrive_and_expect_tx) and the epilogue inter-warp dispatch.