Skip to content

[Codegen] Avoid local memory spill for barrier register tensors#93

Merged
yaoyaoding merged 1 commit intomainfrom
const-reg-tensor
Mar 12, 2026
Merged

[Codegen] Avoid local memory spill for barrier register tensors#93
yaoyaoding merged 1 commit intomainfrom
const-reg-tensor

Conversation

@yaoyaoding
Copy link
Copy Markdown
Member

Barrier addresses stored in register arrays (e.g., uint32_t regs[6]) were being spilled to local memory by nvcc when indexed with runtime variables (e.g., pipeline stage counters in persistent kernels).

This change introduces a ConstRegTensorEmitContext that tracks CTA-invariant register tensors and their element expressions. When SliceRegisterInst accesses a tracked tensor, the emitter computes the value via arithmetic (e.g., barriers + stage * 8) instead of array indexing, keeping everything in registers.

Changes:

  • Add ConstRegTensorEmitContext for tracking CTA-invariant tensors
  • Update BarrierAllocContext to return contiguous base address
  • Update AllocBarrierInst emitter to register barrier tensors
  • Update SliceRegisterInst emitter to use arithmetic for tracked tensors
  • Suppress nvcc warning 550 (set but unused variable) for fallback arrays

Result: eliminates 64 bytes stack frame in blackwell matmul v7/v8 kernels.

Barrier addresses stored in register arrays (e.g., `uint32_t regs[6]`)
were being spilled to local memory by nvcc when indexed with runtime
variables (e.g., pipeline stage counters in persistent kernels).

This change introduces a ConstRegTensorEmitContext that tracks
CTA-invariant register tensors and their element expressions. When
SliceRegisterInst accesses a tracked tensor, the emitter computes
the value via arithmetic (e.g., `barriers + stage * 8`) instead of
array indexing, keeping everything in registers.

Changes:
- Add ConstRegTensorEmitContext for tracking CTA-invariant tensors
- Update BarrierAllocContext to return contiguous base address
- Update AllocBarrierInst emitter to register barrier tensors
- Update SliceRegisterInst emitter to use arithmetic for tracked tensors
- Suppress nvcc warning 550 (set but unused variable) for fallback arrays

Result: eliminates 64 bytes stack frame in blackwell matmul v7/v8 kernels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding merged commit 5d62202 into main Mar 12, 2026
11 of 13 checks passed
@yaoyaoding yaoyaoding deleted the const-reg-tensor branch March 12, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant