Skip to content

[Instruction] Add .read modifier to cp.async.bulk.wait_group#106

Merged
yaoyaoding merged 2 commits intomainfrom
b200-gemm-opt
Apr 2, 2026
Merged

[Instruction] Add .read modifier to cp.async.bulk.wait_group#106
yaoyaoding merged 2 commits intomainfrom
b200-gemm-opt

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

In the epilogue of shared-to-global TMA stores, the default cp.async.bulk.wait_group inserts an unnecessary L1 cache invalidation. The .read variant only waits for source reads to complete, which is sufficient when reusing shared memory buffers without subsequent global loads of the TMA-written data.

Changes across the full stack:

  • IR: add read field to CopyAsyncTensorWaitGroupInst
  • Hidet primitive: register _read variants emitting .read PTX
  • Emitter: pass read through to primitive
  • Builder/Lang: expose read param on wait_group()
  • Example: use read=True in matmul_v9 epilogue
  • mbarrier: reduce try_wait ticks from 10M to 50K to match nvjet

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyaoding
Copy link
Copy Markdown
Member Author

/ok to test 58c9682

yaoyaoding and others added 2 commits April 2, 2026 20:50
In the epilogue of shared-to-global TMA stores, the default
cp.async.bulk.wait_group inserts an unnecessary L1 cache invalidation.
The .read variant only waits for source reads to complete, which is
sufficient when reusing shared memory buffers without subsequent global
loads of the TMA-written data.

Changes across the full stack:
- IR: add `read` field to CopyAsyncTensorWaitGroupInst
- Hidet primitive: register `_read` variants emitting `.read` PTX
- Emitter: pass `read` through to primitive
- Builder/Lang: expose `read` param on wait_group()
- Example: use read=True in matmul_v9 epilogue
- mbarrier: reduce try_wait ticks from 10M to 50K to match nvjet

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding merged commit d894fa8 into main Apr 2, 2026
8 checks passed
@yaoyaoding yaoyaoding deleted the b200-gemm-opt branch April 2, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant