fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703
Merged
fix(v4-ci): unblock nightly Docker test + V4-Pro accuracy step#703
Conversation
Two regressions introduced/exposed by the V4-Pro per-PR CI entry (commits c3ec204 onward): 1. simple_inference.py: shrink the new arithmetic stress prompt from 3000 terms (~16k tokens) to 1500 terms (~7k tokens). At 16k tokens the seq's KV needs ~1000 blocks, but the `block_tables` buffer second-dim is `max_model_len / block_size`. Llama-3-8B-Instruct used by `Test Docker image` step has `max_model_len=8192` → buffer dim = 512, so prefill schedules but decode crashes in `prepare_block_tables` with `ValueError: could not broadcast input array from shape (1005,) into shape (512,)`. 1500 terms → 7005 tokens → 438 blocks, well under 512 with comfortable margin. 2. loader.py: monkey-patch `safetensors.torch._TYPES` to register `F8_E8M0 -> torch.float8_e8m0fnu`. The Python wrapper in `safetensors==0.7.0` is missing this entry even though both torch and safetensors-rust support it. The mmap path (`safe_open + get_tensor`) goes through Rust and works fine, but the `ATOM_DISABLE_MMAP=true` path (used by `atom-test.yaml` in CI) calls `safetensors.torch.load(bytes)` which does the dict lookup and raises `KeyError: 'F8_E8M0'` on V4-Pro shards containing MX scale tensors. Patch is no-op when safetensors ships the entry natively.
2 tasks
valarLip
added a commit
that referenced
this pull request
May 7, 2026
…ill (#710) Two V4-Pro CI failures surfaced after PR #703 unblocked the loader path: 1. Triton MoE OOM on AMD CDNA4 with triton 3.6+/3.7+ - `triton_kernels.matmul_ogs_details.opt_flags_amd` has a CDNA4 special case `if cdna4 and block_m == 128: block_n = 512`, giving BLOCK_M*BLOCK_N = 64K FP32 acc entries. triton 3.6+ spills the accumulator to LDS more aggressively than 3.5, exceeding the MI355X 160 KiB LDS budget (observed 269 KiB). - Fix: wrap matmul_ogs calls with a CDNA4-only context manager that pins block_m=64 / block_n=256 (BLOCK_M*BLOCK_N = 16K, fits regs). Tunable via `ATOM_TRITON_MOE_BLOCK_{M,N}` env vars. - Other GPU families and triton ≤3.5 paths are unaffected. 2. `cp_gather_indexer_k_quant_cache` HIP "invalid configuration argument" when `cu_committed_cpu[-1] == 0` (fresh prefill with prompt shorter than the CSA `ratio`). The kernel grid is computed as `(num_tokens + BLOCK_Y_SIZE - 1) / BLOCK_Y_SIZE` so `num_tokens=0` makes grid.x=0 and the HIP launcher rejects it. - Fix: clamp `cu_committed_cpu[-1]` to ≥1 in the indexer-meta builder. The dummy +1 row is gathered from the last seq's first cache block but never read downstream, because `fp8_mqa_logits` and `top_k_per_row_prefill` honor per-token `cu_starts`/`cu_ends` derived from `cu_committed_gpu[:-1]` and `n_committed_per_seq`, both of which remain 0. Output stays all -1 sentinels, matching the all-empty semantics. Pure host-side scalar arithmetic on a value already host-synced; no CG/torch.compile graph branch added. Verified locally with triton 3.7.0+amd.rocm7.1.0: - DeepSeek-V4-Pro server starts (no OOM) - 1-token "Hi" curl returns successfully (was crashing pre-fix) - GSM8K-50 fewshot=5 = 0.94 (matches pre-PR-703 baseline)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two hotfixes for CI regressions exposed after the V4-Pro PR1 merge (#650):
1. Nightly Docker Release —
simple_inferencelong prompt overflows block_tablesThe new arithmetic stress prompt (3000 terms ≈ 16k tokens) added in c3ec204 crashed the Docker test (`atom-release` runs Llama-3-8B-Instruct):
Llama-3-8B's
max_model_len=8192→block_tablesbuffer dim =8192 / 16 = 512. The 16k-token seq passes the prefill scheduler (max_num_batched_tokens=16384≥ 16016) but its KV needs ~1000 blocks > 512. Fix: shrink the prompt to 1500 terms (~7k tokens, 438 blocks → 74-block margin).Locally reproduced with
gpt-oss-120b -tp 1 --max-model-len 8192.2. V4-Pro per-PR accuracy step —
safetensors==0.7.0missing F8_E8M0The new V4-Pro accuracy entry (also c3ec204) failed at model load:
V4-Pro shards contain MX scale tensors with
F8_E8M0dtype. Bothtorch.float8_e8m0fnuand the safetensors-rust binary support it, but the Python_TYPESdict insafetensors==0.7.0(latest pinned in CI) is missing the mapping. The mmap path (safe_open + get_tensor) goes through Rust and works fine, but theATOM_DISABLE_MMAP=truepath (set by atom-test.yaml's container env) callssafetensors.torch.load(bytes)which fails the dict lookup.Fix: monkey-patch the missing mapping at loader import time. No-op when safetensors ships the entry natively.
End-to-end verified locally: previously crashed at shard 2/64; with patch, all 10000 scanned tensors load (4951 are F8_E8M0).
Test plan
gpt-oss-120b --max-model-len 8192→ simple_inference crashes on the old 3000-term prompt with the same(1005,) → (512,)errorsafetensors.torch.load(bytes)on a V4-Pro shard →KeyError: 'F8_E8M0'disable_mmap=Truebranch