-
Notifications
You must be signed in to change notification settings - Fork 31
[WS1] Batch-invariant elementwise / RoPE pass-through audit #149
Copy link
Copy link
Open
Labels
component: kernelsTasks involving the development of CUDA and Triton underlying operatorsTasks involving the development of CUDA and Triton underlying operatorsfeatureplatform: cudaSpecific optimizations or bugs in NVIDIA graphics cards (such as FlashInfer, TMA optimizations)Specific optimizations or bugs in NVIDIA graphics cards (such as FlashInfer, TMA optimizations)priority: highSevere congestion issues require the highest priority for resolution.Severe congestion issues require the highest priority for resolution.sprint-0615
Metadata
Metadata
Assignees
Labels
component: kernelsTasks involving the development of CUDA and Triton underlying operatorsTasks involving the development of CUDA and Triton underlying operatorsfeatureplatform: cudaSpecific optimizations or bugs in NVIDIA graphics cards (such as FlashInfer, TMA optimizations)Specific optimizations or bugs in NVIDIA graphics cards (such as FlashInfer, TMA optimizations)priority: highSevere congestion issues require the highest priority for resolution.Severe congestion issues require the highest priority for resolution.sprint-0615
Type
Fields
Give feedbackNo fields configured for issues without a type.
Part of WS1 — Full Batch-Invariant Forward Chain (epic: #)
Why
Pointwise ops (activations, residual adds, scaling) and RoPE are usually assumed "safe" because they have no cross-element reduction — but that assumption needs to be verified, not trusted. A stray fused reduction, a dtype downcast at the wrong point, or a position-dependent RoPE precision issue can quietly reintroduce drift between the residual stream of batch=1 and batch=N.
Scope
Audit every elementwise op and RoPE on the forward path and confirm they are genuine pass-throughs with respect to batch configuration.
Out of scope
Acceptance criteria
Notes
Planned PRs