[Instruction] Add .read modifier to cp.async.bulk.wait_group by yaoyaoding · Pull Request #106 · NVIDIA/tilus

yaoyaoding · 2026-04-02T20:03:37Z

In the epilogue of shared-to-global TMA stores, the default cp.async.bulk.wait_group inserts an unnecessary L1 cache invalidation. The .read variant only waits for source reads to complete, which is sufficient when reusing shared memory buffers without subsequent global loads of the TMA-written data.

Changes across the full stack:

IR: add read field to CopyAsyncTensorWaitGroupInst
Hidet primitive: register _read variants emitting .read PTX
Emitter: pass read through to primitive
Builder/Lang: expose read param on wait_group()
Example: use read=True in matmul_v9 epilogue
mbarrier: reduce try_wait ticks from 10M to 50K to match nvjet

copy-pr-bot · 2026-04-02T20:03:40Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyaoding · 2026-04-02T20:12:13Z

/ok to test 58c9682

In the epilogue of shared-to-global TMA stores, the default cp.async.bulk.wait_group inserts an unnecessary L1 cache invalidation. The .read variant only waits for source reads to complete, which is sufficient when reusing shared memory buffers without subsequent global loads of the TMA-written data. Changes across the full stack: - IR: add `read` field to CopyAsyncTensorWaitGroupInst - Hidet primitive: register `_read` variants emitting `.read` PTX - Emitter: pass `read` through to primitive - Builder/Lang: expose `read` param on wait_group() - Example: use read=True in matmul_v9 epilogue - mbarrier: reduce try_wait ticks from 10M to 50K to match nvjet Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding force-pushed the b200-gemm-opt branch from 58c9682 to 6dc4a60 Compare April 2, 2026 20:50

yaoyaoding and others added 2 commits April 2, 2026 20:50

add matmul v9 to test

41e7cc2

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding force-pushed the b200-gemm-opt branch from 6dc4a60 to 41e7cc2 Compare April 2, 2026 20:50

yaoyaoding merged commit d894fa8 into main Apr 2, 2026
8 checks passed

yaoyaoding deleted the b200-gemm-opt branch April 2, 2026 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Instruction] Add .read modifier to cp.async.bulk.wait_group#106

[Instruction] Add .read modifier to cp.async.bulk.wait_group#106
yaoyaoding merged 2 commits intomainfrom
b200-gemm-opt

yaoyaoding commented Apr 2, 2026

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

yaoyaoding commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaoyaoding commented Apr 2, 2026

Uh oh!

copy-pr-bot Bot commented Apr 2, 2026

Uh oh!

yaoyaoding commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant