Skip to content

[Misc] Improve debug tooling and fix semaphore codegen#112

Merged
yaoyaoding merged 3 commits intomainfrom
blackwell-matmul-splitk
Apr 9, 2026
Merged

[Misc] Improve debug tooling and fix semaphore codegen#112
yaoyaoding merged 3 commits intomainfrom
blackwell-matmul-splitk

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

  • Add disable_ptxas_opt debug option (replaces launch_blocking) to compile with ptxas -O0 for easier PTX/SASS debugging
  • Include disable_ptxas_opt and target in program cache key so different debug/target configs don't share cached builds
  • Improve SASS dump to use nvdisasm -g for source-annotated disassembly
  • Add env var support for TILUS_CACHE_DIR and TILUS_DUMP_IR options
  • Fix LockSemaphoreEmitter to generate correct spin-wait code when inside nested thread groups (where warp-level sync_reduce is unavailable)
  • Add current_thread_group_depth property to BaseInstEmitter
  • Use cluster_sync instead of sync before tmem dealloc in matmul_v8
  • Remove unused matmul_v9 example

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyaoding yaoyaoding force-pushed the blackwell-matmul-splitk branch 2 times, most recently from 495b2b1 to 68acf2c Compare April 8, 2026 17:15
yaoyaoding and others added 3 commits April 8, 2026 18:39
- Add `disable_ptxas_opt` debug option (replaces `launch_blocking`) to compile
  with ptxas -O0 for easier PTX/SASS debugging
- Include `disable_ptxas_opt` and `target` in program cache key so different
  debug/target configs don't share cached builds
- Improve SASS dump to use nvdisasm -g for source-annotated disassembly
- Add env var support for TILUS_CACHE_DIR and TILUS_DUMP_IR options
- Fix LockSemaphoreEmitter to generate correct spin-wait code when inside
  nested thread groups (where warp-level sync_reduce is unavailable)
- Add `current_thread_group_depth` property to BaseInstEmitter
- Use cluster_sync instead of sync before tmem dealloc in matmul_v8
- Remove unused matmul_v9 example

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding force-pushed the blackwell-matmul-splitk branch from 68acf2c to 4f022e7 Compare April 8, 2026 18:39
@yaoyaoding yaoyaoding merged commit 8682816 into main Apr 9, 2026
8 checks passed
@yaoyaoding yaoyaoding deleted the blackwell-matmul-splitk branch April 12, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant