fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step by valarLip · Pull Request #703 · ROCm/ATOM

valarLip · 2026-05-06T17:49:53Z

Summary

Two hotfixes for CI regressions exposed after the V4-Pro PR1 merge (#650):

1. Nightly Docker Release — `simple_inference` long prompt overflows block_tables

The new arithmetic stress prompt (3000 terms ≈ 16k tokens) added in c3ec204 crashed the Docker test (`atom-release` runs Llama-3-8B-Instruct):

ValueError: could not broadcast input array from shape (1005,) into shape (512,)
  at atom/model_ops/attentions/backends.py:249 (prepare_block_tables)

Llama-3-8B's max_model_len=8192 → block_tables buffer dim = 8192 / 16 = 512. The 16k-token seq passes the prefill scheduler (max_num_batched_tokens=16384 ≥ 16016) but its KV needs ~1000 blocks > 512. Fix: shrink the prompt to 1500 terms (~7k tokens, 438 blocks → 74-block margin).

Locally reproduced with gpt-oss-120b -tp 1 --max-model-len 8192.

2. V4-Pro per-PR accuracy step — `safetensors==0.7.0` missing F8_E8M0

The new V4-Pro accuracy entry (also c3ec204) failed at model load:

KeyError: 'F8_E8M0'
  at safetensors/torch.py:427 (_getdtype)

V4-Pro shards contain MX scale tensors with F8_E8M0 dtype. Both torch.float8_e8m0fnu and the safetensors-rust binary support it, but the Python _TYPES dict in safetensors==0.7.0 (latest pinned in CI) is missing the mapping. The mmap path (safe_open + get_tensor) goes through Rust and works fine, but the ATOM_DISABLE_MMAP=true path (set by atom-test.yaml's container env) calls safetensors.torch.load(bytes) which fails the dict lookup.

Fix: monkey-patch the missing mapping at loader import time. No-op when safetensors ships the entry natively.

End-to-end verified locally: previously crashed at shard 2/64; with patch, all 10000 scanned tensors load (4951 are F8_E8M0).

Test plan

Local repro: gpt-oss-120b --max-model-len 8192 → simple_inference crashes on the old 3000-term prompt with the same (1005,) → (512,) error
Local F8_E8M0 repro: safetensors.torch.load(bytes) on a V4-Pro shard → KeyError: 'F8_E8M0'
Local F8_E8M0 fix verified: same call path now loads 2337 tensors per shard cleanly via the loader's disable_mmap=True branch
Wait for CI to re-run nightly Docker + V4-Pro accuracy after merge

Two regressions introduced/exposed by the V4-Pro per-PR CI entry (commits c3ec204 onward): 1. simple_inference.py: shrink the new arithmetic stress prompt from 3000 terms (~16k tokens) to 1500 terms (~7k tokens). At 16k tokens the seq's KV needs ~1000 blocks, but the `block_tables` buffer second-dim is `max_model_len / block_size`. Llama-3-8B-Instruct used by `Test Docker image` step has `max_model_len=8192` → buffer dim = 512, so prefill schedules but decode crashes in `prepare_block_tables` with `ValueError: could not broadcast input array from shape (1005,) into shape (512,)`. 1500 terms → 7005 tokens → 438 blocks, well under 512 with comfortable margin. 2. loader.py: monkey-patch `safetensors.torch._TYPES` to register `F8_E8M0 -> torch.float8_e8m0fnu`. The Python wrapper in `safetensors==0.7.0` is missing this entry even though both torch and safetensors-rust support it. The mmap path (`safe_open + get_tensor`) goes through Rust and works fine, but the `ATOM_DISABLE_MMAP=true` path (used by `atom-test.yaml` in CI) calls `safetensors.torch.load(bytes)` which does the dict lookup and raises `KeyError: 'F8_E8M0'` on V4-Pro shards containing MX scale tensors. Patch is no-op when safetensors ships the entry natively.

…ill (#710) Two V4-Pro CI failures surfaced after PR #703 unblocked the loader path: 1. Triton MoE OOM on AMD CDNA4 with triton 3.6+/3.7+ - `triton_kernels.matmul_ogs_details.opt_flags_amd` has a CDNA4 special case `if cdna4 and block_m == 128: block_n = 512`, giving BLOCK_M*BLOCK_N = 64K FP32 acc entries. triton 3.6+ spills the accumulator to LDS more aggressively than 3.5, exceeding the MI355X 160 KiB LDS budget (observed 269 KiB). - Fix: wrap matmul_ogs calls with a CDNA4-only context manager that pins block_m=64 / block_n=256 (BLOCK_M*BLOCK_N = 16K, fits regs). Tunable via `ATOM_TRITON_MOE_BLOCK_{M,N}` env vars. - Other GPU families and triton ≤3.5 paths are unaffected. 2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument" when `cu_committed_cpu[-1] == 0` (fresh prefill with prompt shorter than the CSA `ratio`). The kernel grid is computed as `(num_tokens + BLOCK_Y_SIZE - 1) / BLOCK_Y_SIZE` so `num_tokens=0` makes grid.x=0 and the HIP launcher rejects it. - Fix: clamp `cu_committed_cpu[-1]` to ≥1 in the indexer-meta builder. The dummy +1 row is gathered from the last seq's first cache block but never read downstream, because `fp8_mqa_logits` and `top_k_per_row_prefill` honor per-token `cu_starts`/`cu_ends` derived from `cu_committed_gpu[:-1]` and `n_committed_per_seq`, both of which remain 0. Output stays all -1 sentinels, matching the all-empty semantics. Pure host-side scalar arithmetic on a value already host-synced; no CG/torch.compile graph branch added. Verified locally with triton 3.7.0+amd.rocm7.1.0: - DeepSeek-V4-Pro server starts (no OOM) - 1-token "Hi" curl returns successfully (was crashing pre-fix) - GSM8K-50 fewshot=5 = 0.94 (matches pre-PR-703 baseline)

Copilot AI review requested due to automatic review settings May 6, 2026 17:49

Copilot started reviewing on behalf of valarLip May 6, 2026 17:51 View session

valarLip merged commit 09b157e into main May 6, 2026
10 of 12 checks passed

valarLip deleted the fix/v4-pr1-ci-hotfix branch May 6, 2026 17:51

valarLip mentioned this pull request May 7, 2026

fix(v4): triton-3.6+ MoE SMEM OOM + empty-batch indexer prefill #710

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703

fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703
valarLip merged 1 commit intomainfrom
fix/v4-pr1-ci-hotfix

valarLip commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valarLip commented May 6, 2026

Summary

1. Nightly Docker Release — simple_inference long prompt overflows block_tables

2. V4-Pro per-PR accuracy step — safetensors==0.7.0 missing F8_E8M0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Nightly Docker Release — `simple_inference` long prompt overflows block_tables

2. V4-Pro per-PR accuracy step — `safetensors==0.7.0` missing F8_E8M0