[Codegen] Reduce warp-divergence with predicated instruction emitting by yaoyaoding · Pull Request #114 · NVIDIA/tilus

yaoyaoding · 2026-04-09T23:21:16Z

Several PTX instructions (tcgen05.mma, tcgen05.commit, tcgen05.cp, TMA copy, clc.try_cancel) are warp-cooperative at the SASS level — all 32 threads participate and hardware issues a single operation. However, the previous codegen wrapped them in if (elect_sync()) { asm(...); } which caused ptxas to emit BSSY/BSYNC divergence pairs around every call.

This change introduces a pre-computed is_leader_lane predicate (via elect.sync at kernel start) and passes it directly into the PTX inline asm as @__pred <instruction>. This lets ptxas emit predicated instructions without divergent branches.

Before (per warp-cooperative instruction):

    ELECT P0, ...
    BSSY.RECONVERGENT B0, skip
    @!P0 BRA skip
    UTMASTG.2D / TCGEN05.MMA / ...
    BSYNC.RECONVERGENT B0

After:

    @!P0 UTMASTG.2D / TCGEN05.MMA / ...

In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50 to 6, with the remaining pairs only in the one-time prologue (barrier init, arrive_and_expect_tx) and the epilogue inter-warp dispatch.

…minate BSSY/BSYNC Several PTX instructions (tcgen05.mma, tcgen05.commit, tcgen05.cp, TMA copy, clc.try_cancel) are warp-cooperative at the SASS level — all 32 threads participate and hardware issues a single operation. However, the previous codegen wrapped them in `if (elect_sync()) { asm(...); }` which caused ptxas to emit BSSY/BSYNC divergence pairs around every call. This change introduces a pre-computed `is_leader_lane` predicate (via elect.sync at kernel start) and passes it directly into the PTX inline asm as `@__pred <instruction>`. This lets ptxas emit predicated instructions without divergent branches. Before (per warp-cooperative instruction): ELECT P0, ... BSSY.RECONVERGENT B0, skip @!P0 BRA skip UTMASTG.2D / TCGEN05.MMA / ... BSYNC.RECONVERGENT B0 After: @!P0 UTMASTG.2D / TCGEN05.MMA / ... In the matmul_v8 GEMM kernel, this reduces BSSY/BSYNC count from ~50 to 6, with the remaining pairs only in the one-time prologue (barrier init, arrive_and_expect_tx) and the epilogue inter-warp dispatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding merged commit 3017377 into main Apr 9, 2026
9 checks passed

yaoyaoding deleted the optimize-sync branch April 9, 2026 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen] Reduce warp-divergence with predicated instruction emitting#114

[Codegen] Reduce warp-divergence with predicated instruction emitting#114
yaoyaoding merged 1 commit intomainfrom
optimize-sync

yaoyaoding commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyaoding commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yaoyaoding commented Apr 9, 2026 •

edited

Loading