Add partial-width tile shapes to x.all_gather tests#435
Add partial-width tile shapes to x.all_gather tests#435mawad-amd merged 2 commits intomuhaawad/hintsfrom
Conversation
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
|
@copilot can you investigate this bug a bit more. Is this a triton bug we should report? Or is our code incorrect! |
This is our code's bug, not a Triton bug. Here's the full root-cause analysis: How
Why The semantic of For the failing shape Why Within any row of a row-major tensor, In summary: Triton's hint mechanism is a low-level developer contract — the compiler trusts you. Passing a hint that asserts cross-row contiguity when the tile doesn't span the full tensor width is a correctness bug in the caller. No Triton fix is needed; the correct call pattern is |
|
@copilot so are we losing on some potential vectorization here or no? |
No vectorization is lost. Here's why: Vector width is hardware-capped at 128 bits (from
The old hint was never giving wider stores — it was just choosing the wrong axis for vectorization, which is a correctness issue on partial-width tiles. On full-width tiles it happened to work because |
PR #434 fixed
x.all_gatherto usehint=(1, tile.block_n)instead ofhint=(tile.block_m, tile.block_n), preventing incorrect cross-row contiguity assertions whenBLOCK_N < N(stride_m > BLOCK_N), which caused scalarbuffer_store_shortwrites to wrong addresses for 16-bit dtypes.Test coverage additions (
tests/x/test_all_gather.py)(128, 128, 64, 32)— BLOCK_N < N/world_size; exercises the multi-block partial-width path (2 tiles per rank in the N direction)(256, 128, 64, 16)— Minimum BLOCK_N=16; directly stresses the 16-bit vectorization path that emitted incorrect scalar stores under the old hintBoth shapes added to
test_all_gather(all dtypes × both gather dims) and totest_all_gather_ctx_api. The original failing shape(128, 64, 64, 32)was already present; the new shapes extend coverage to multi-block and narrow-block configurations.✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.