mi355x test#1165
Conversation
Adds DeepSeek-V4-Pro FP4 MI355X ATOM Day-0 single-sequence benchmark config, gated to the limitations of ROCm/ATOM#650 (PR1 skeleton): TP=8, conc=1 only, --enforce-eager, ATOM_USE_TRITON_MOE=1. Image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post (matches qwen3.5-fp8-mi355x-atom base). The DSv4 PR is overlaid at runtime by the benchmark script via pip install --no-deps -e . from a pinned SHA (cdbff35), so no new image needs to be published. The script enforces PR1 invariants (CONC=1, EP_SIZE=1) and runs a preflight that asserts the editable atom install took effect, that transformers can resolve deepseek_v3 (the type ATOM maps deepseek_v4 to), and that triton_kernels exposes CDNA4MXScaleLayout (renamed from GFX950MXScaleLayout in the PR). Sweep will expand to TP=4/8 conc 4-256 once ROCm/ATOM PR3 (multi-request) and PR4 (CUDAGraph) land.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
The release rocm/atom:atom0.1.2.post image cleans up the build-stage path /triton-test/python/triton_kernels/, so the conditional editable install was a no-op and the preflight failed with "No module named 'triton_kernels'". Clone triton-lang/triton at a pinned SHA (028e5da5, latest commit to python/triton_kernels/triton_kernels/tensor_details/layout.py as of 2026-04-10) and pip install --no-deps -e the subpackage. triton_kernels is self-contained (pyproject deps: numpy, pytest), so this does not perturb the image's triton itself.
| python3 -m atom.entrypoints.openai_server \ | ||
| --model $MODEL \ | ||
| --server-port $PORT \ | ||
| -tp $TP \ | ||
| --kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \ | ||
| --block-size $BLOCK_SIZE \ | ||
| --enforce-eager \ | ||
| --max-num-seqs 1 > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ |
There was a problem hiding this comment.
🔴 The new ATOM server launch (line 156) and run_benchmark_serving call (line 170) both omit --trust-remote-code for deepseek-ai/DeepSeek-V4-Pro, while the peer ATOM recipe on the same image (qwen3.5_fp8_mi355x_atom.sh lines 52, 72) passes it on both, and every other DSv4 launch script in this repo passes it at least on the server. The script's own preflight verifies that transformers knows deepseek_v3 but does NOT verify that ATOM's _CONFIG_REGISTRY actually intercepts an AutoConfig.from_pretrained lookup for a model_type=deepseek_v4 config — adding --trust-remote-code on the server (mirroring qwen3.5_fp8_mi355x_atom.sh:52) is a one-line defense against the most likely first-run failure mode the PR description already flags.
Extended reasoning...
What the bug is
benchmarks/single_node/dsv4_fp4_mi355x_atom.sh launches atom.entrypoints.openai_server for deepseek-ai/DeepSeek-V4-Pro (lines 156–163) without --trust-remote-code. The PR description itself flags compatibility with the public HF checkpoint as a known risk ("first run will tell us"). DeepSeek-V4-Pro's HF config.json declares model_type=deepseek_v4 — that's the exact reason ATOM ships a _CONFIG_REGISTRY remap to deepseek_v3. transformers 5.2.0's stock CONFIG_MAPPING does not contain deepseek_v4 (the script's own preflight implicitly confirms this — it only asserts that deepseek_v3 is present). So if anything in the server's startup path calls AutoConfig.from_pretrained(deepseek-ai/DeepSeek-V4-Pro) before ATOM's registry intercepts, transformers will raise on the unknown model_type unless trust_remote_code=True is set.
Why this is the most likely first-run failure mode
Strong precedent across this repo and the same image base:
dsv4_fp4_b300_vllm.shlines 62 + 94: server and client get--trust-remote-codedsv4_fp8_h200.shlines 48 + 83: server and client get--trust-remote-codedsv4_fp8_mi355x.sh: server gets--trust-remote-code(line 81) and the script additionally patches the cachedconfig.jsonfromdeepseek_v4→deepseek_v3on disk (lines 30–40) — belt + suspendersqwen3.5_fp8_mi355x_atom.sh(the peer ATOM recipe on the same image base,atom0.1.2.post): server and client both get--trust-remote-code(lines 52, 72)
The new ATOM script in this PR passes neither --trust-remote-code nor patches the cached config — it relies entirely on ATOM's internal _CONFIG_REGISTRY doing the intercept.
Addressing the refutation
The refutation correctly observes that dsv4_fp4_b300_sglang.sh and dsv4_fp4_b200.sh use the same canonical deepseek-ai/DeepSeek-V4-Pro checkpoint and call run_benchmark_serving without --trust-remote-code, and they have been merged. That's a fair point about the client side: backend_request_func.get_tokenizer (utils/bench_serving/backend_request_func.py:512–542) just calls AutoTokenizer.from_pretrained, which reads tokenizer_config.json's tokenizer_class and does not need to resolve model_type through transformers' CONFIG_MAPPING. So I am less certain the client-side omission breaks anything in practice, and the client-side fix is best framed as defensive consistency with the peer ATOM recipe.
The server side is the harder claim. The refutation argues that PR #650's _CONFIG_REGISTRY is precisely designed to handle deepseek_v4, so the omission is intentional. That is plausible, but the script's own preflight does not verify it. The preflight only asserts that deepseek_v3 in CONFIG_MAPPING (line 116) — which would already be true in the upstream wheel without PR #650, since transformers 5.2.0 itself ships DeepseekV3Config. There is no test in the preflight that PR #650's _CONFIG_REGISTRY actually intercepts AutoConfig.from_pretrained("deepseek-ai/DeepSeek-V4-Pro"). If the registry is internal to ATOM (i.e. doesn't patch transformers' CONFIG_MAPPING directly), then any code path inside the server that resolves the config via stock transformers before ATOM's interception will fail.
Step-by-step proof
- Image
rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.postships transformers 5.2.0 (the script even prints this version in the preflight). - transformers 5.2.0's
CONFIG_MAPPINGcontainsdeepseek_v3but notdeepseek_v4(otherwise PR Claude Opus 4.6 #650 wouldn't need a remap; the script's own preflight comment confirms this). huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/config.jsondeclaresmodel_type: deepseek_v4.atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-V4-Pro …(line 156) is launched without--trust-remote-code.- If ATOM's startup imports the model class via
AutoConfig.from_pretrained(or any code that delegates to it) before installing its_CONFIG_REGISTRYshim, transformers will raise:
ValueError: The checkpoint you are trying to load has model type \deepseek_v4` but Transformers does not recognize this architecture. … you can use `trust_remote_code=True`.` - With
--trust-remote-code, transformers falls back to the model's ownconfiguration_deepseek.pyfrom the HF repo, which would resolve regardless of the registry shim.
The qwen3.5 ATOM peer on the same image passes --trust-remote-code on the server even though Qwen3.5 doesn't suffer from this exact model_type mismatch — that itself shows the prevailing convention is defensive.
Fix
Add --trust-remote-code to the server invocation (line 156-163) and ideally also to run_benchmark_serving (line 170-180), mirroring qwen3.5_fp8_mi355x_atom.sh:52,72. The server-side fix is the load-bearing one; the client-side fix is defensive consistency given the precedent set by every other DSv4 launch script in the repo. Alternatively, follow dsv4_fp8_mi355x.sh's pattern and patch the cached config.json model_type from deepseek_v4 to deepseek_v3 in a Python preflight block.
| - "Model: deepseek-ai/DeepSeek-V4-Pro (same canonical checkpoint used by dsv4-fp4-b300-vllm); compatibility with PR #650's WeightsMapper not yet verified — first run will tell us" | ||
| - "Pinned to PR1 limitations: single-sequence kv_cache hardcode, --enforce-eager required, ATOM_USE_TRITON_MOE=1 (aiter fused_moe broken on gfx950)" | ||
| - "Sweep will expand to TP=4/8 conc 4–256 once ROCm/ATOM PR3 (multi-request) and PR4 (CUDAGraph) land" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO |
There was a problem hiding this comment.
🔴 perf-changelog.yaml:1845 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO — a literal placeholder that was never replaced with the actual PR number (1165). Every other entry in the file uses a real numeric PR ID, so this should be changed to /pull/1165 before merge to avoid leaving a permanently broken hyperlink in the changelog.
Extended reasoning...
What the bug is. Line 1845 of perf-changelog.yaml — the pr-link field for the new dsv4-fp4-mi355x-atom changelog entry added by this PR — contains the literal string https://github.com/SemiAnalysisAI/InferenceX/pull/TODO. The TODO token is a placeholder the author meant to substitute with the real PR number (1165, per the PR metadata) but forgot to fill in.\n\nHow it manifests. When the changelog is rendered (e.g., in the merged file on GitHub or via any tool that consumes pr-link), the URL resolves to https://github.com/SemiAnalysisAI/InferenceX/pull/TODO, which 404s. Every other entry in the 200+-entry file uses a numeric PR ID — for instance, line 1835 (the entry immediately above) reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160. The placeholder in this PR is the lone outlier.\n\nWhy existing validation does not catch it. utils/matrix_logic/validation.py:344 declares pr_link as a plain str on the Pydantic model with no URL or numeric-suffix constraint. The string '.../pull/TODO' is a perfectly valid Python str, so it passes schema validation silently. There is no other validator in the repo that asserts pr_link ends in a digit or resolves to a real PR.\n\nImpact. Once merged the changelog will permanently contain a broken hyperlink for the dsv4-fp4-mi355x-atom entry; anyone clicking through the changelog to find the originating PR for that benchmark will get a 404. Any downstream tooling that parses pr_link to extract the numeric PR id (e.g., for cross-referencing with GitHub PR metadata) will either fail outright or treat the literal 'TODO' as a string id. This is also a poor signal to readers — the changelog is a permanent historical artifact and a TODO in the merged version looks unfinished.\n\nHow to fix. Replace pull/TODO with pull/1165 on line 1845. One-line change:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1165\n\n\nStep-by-step proof.\n1. Open perf-changelog.yaml and jump to line 1845. The line reads: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO.\n2. Compare with the previous entry at line 1835: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160 — note the numeric ID.\n3. The PR metadata at the top of this review identifies this PR as #1165.\n4. Construct the URL the merged changelog will render: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO. GitHub interprets the path segment after /pull/ as a PR number; TODO is not numeric, so the URL 404s.\n5. Confirm validation does not catch this: utils/matrix_logic/validation.py:344 types pr_link as str with no further constraint, so the string .../pull/TODO passes the schema.\n\nSeverity is normal because it is a real defect in the merged artifact, the fix is a one-token substitution, and it is much easier to fix now than to land a follow-up PR later.
| # PR #650's repro explicitly reinstalls triton_kernels editable. Conditional | ||
| # in case the path differs in the chosen image; safe no-op if already present. | ||
| if [ -d /triton-test/python/triton_kernels/ ]; then | ||
| pip install --no-deps -e /triton-test/python/triton_kernels/ | ||
| fi |
There was a problem hiding this comment.
🟡 Line 74 runs pip install --no-deps -e /triton-test/python/triton_kernels/ without --force-reinstall, while the immediately-preceding atom install on line 68 explicitly uses --force-reinstall with the comment 'replace the wheel-installed atom with the editable copy' — the same logic applies here, since triton_kernels is also wheel-installed in this image. If pip considers the wheel's requirement already satisfied (same name/version), it can short-circuit and skip the editable switch, leaving the wheel build active; the CDNA4MXScaleLayout preflight only catches the GFX950→CDNA4 rename, so any image whose wheel already exports the renamed symbol will pass preflight while every other PR-only change in triton_kernels silently fails to apply. Fix: add --force-reinstall to mirror the atom install three lines above.
Extended reasoning...
What the bug is
benchmarks/single_node/dsv4_fp4_mi355x_atom.sh overlays two editable installs onto the wheel-built image: ATOM (lines 47–69) and triton_kernels (lines 71–75). The atom install passes --force-reinstall with an explicit comment justifying it:
# --force-reinstall: replace the wheel-installed atom with the editable copy.
pip install --no-deps --force-reinstall -e .Three lines later, the triton_kernels install omits the same flag:
if [ -d /triton-test/python/triton_kernels/ ]; then
pip install --no-deps -e /triton-test/python/triton_kernels/
fiThe script author already understood that pip will not necessarily replace a wheel-installed package with the editable copy without --force-reinstall — they applied that reasoning correctly to atom but missed it for triton_kernels.
How it manifests / proof walkthrough
- The base image
rocm/atom:…atom0.1.2.postships triton_kernels as a built wheel under site-packages./triton-test/python/triton_kernels/is a sibling source checkout that PR Claude Opus 4.6 #650 expects to be installed editable so itsfused_moe_triton.py(and other) edits are live. - The script runs
pip install --no-deps -e /triton-test/python/triton_kernels/. Because--no-depsis set and no--force-reinstallis passed, pip's resolver compares theName/Versionmetadata of the source tree against what's already installed. If they match (very likely for a vendored wheel built from the same upstream source), pip prints 'Requirement already satisfied' and does not switch the import target to the editable path. Behavior is pip-version-dependent: modern pip usually still reinstalls when-eis given, but older pip and edge cases (matching version, RECORD already present) can short-circuit. - The downstream preflight on lines 117–126 checks only one specific symbol —
triton_kernels.tensor_details.layout.CDNA4MXScaleLayout. PR Claude Opus 4.6 #650 renamesGFX950MXScaleLayout → CDNA4MXScaleLayoutin that module, so if the wheel still uses the old name, the preflight catches the no-op. But if the image's wheel happens to already carry the rename (e.g., the rename has already landed in upstream triton_kernels and been baked into the wheel build), the preflight passes while the editable overlay silently failed. Every other PR-only change in triton_kernels (infused_moe_triton.pyand elsewhere) then runs against the wheel's code path, not the PR's.
Why the existing comment doesn't prevent it
The # safe no-op if already present comment refers to the [ -d /triton-test/python/triton_kernels/ ] directory guard — i.e., 'safe to skip if the directory doesn't exist in this image' — not pip-level idempotency. The pip command itself is the one that may quietly no-op.
Impact
When the wheel and the PR-only edit diverge in something other than the renamed Layout symbol (very plausible: PR #650 is described as the first of a series with hard limitations like the kv_cache[:1,...] hardcode, and triton_kernels lives alongside the model code), the benchmark would silently run the wheel's code path while the preflight reports success. Failure mode is silently-wrong throughput numbers or a confusing crash deep inside server startup, not a clear 'editable install did not take effect' message.
Fix
Add --force-reinstall to mirror the atom install:
pip install --no-deps --force-reinstall -e /triton-test/python/triton_kernels/This is consistent with the script's own justification three lines earlier and with the defensive atom.__file__-under-ATOM_PR_DIR check that already guards the atom case. Severity is on the boundary between nit and normal — narrow failure window (only matters when the wheel name+version matches the source tree and no other preflight catches the divergence), but the fix is a one-flag change that exactly matches the author's stated intent.
The previous SHA (028e5da5, 2026-04-10) imports is_hip_gfx1250 from triton.language.target_info, which the image's installed triton (older, matched to atom0.1.2.post) does not export. Result: ImportError at triton_kernels load, before any of our layout-class checks run. Pin to d28db13d (2026-03-05, parent of commit 11aac682 "[AMD] Add is_hip_gfx1250 target check"). At this SHA layout.py still has CDNA4MXScaleLayout (post the 2026-01-10 rename) and target_info.py imports only is_hip / is_hip_cdna3 / is_hip_cdna4. Verified both properties against the GitHub raw blobs at this SHA. Bump only after the image's triton is upgraded to one that exposes is_hip_gfx1250.
Upstream triton-lang/triton refactored matmul_ogs into matmul.py (removing routing.py too), but PR #650's fused_moe_triton.py imports `from triton_kernels.matmul_ogs import matmul_ogs, PrecisionConfig` and `from triton_kernels.routing import routing`. Those only resolve against the ROCm fork's RI3.5.x branch. Pin to ROCm/triton@e491726 (RI3.5.x HEAD, 2025-11-20). Verified at that SHA: * matmul_ogs.py has class PrecisionConfig and def matmul_ogs * routing.py exists * tensor_details/layout.py exports CDNA4MXScaleLayout * target_info.py imports only is_hip / is_hip_cdna3 / is_hip_cdna4 (no is_hip_gfx1250, which the image's bundled triton rejects) This matches what AMD's /triton-test/python/triton_kernels/ would have been on the build host before the release image's cleanup stage.
The HIP device-guard crash stack only implicated mhc_pre_big_fuse from the aiter MHC family. Patching out both mhc_pre and mhc_post forced the torch fallback for the full MHC pipeline (RMSNorm + linear + 20-iter Sinkhorn + reduce, all in fp32), which is roughly 40% slower than the PR's reported aiter-MHC baseline of 213 ms/token. Restore aiter mhc_post and keep only mhc_pre on the torch fallback. If mhc_post turns out to crash with the same HIP guard assertion on a real forward, re-add the second sed line; for now the cheaper experiment is to find out whether it works. Verify-grep updated from `^2$` to `^1$` so the script still FATALs if upstream renames or removes the mhc_pre lookup site.
dsv4-fp4-mi355x-atom (ROCm/ATOM#650 PR1, single-sequence at TP=8 with torch-fallback hc_pre because aiter mhc_pre crashes on this image) runs at ~5 min per request in steady state. With 1k1k at 12 prompts plus 8k1k at the same shape, the full sweep can exceed the 300-min cap that #1148 set for the SGLang-DSv4 path. Bump both the SLURM allocation in runners/launch_mi355x-amds.sh and the GitHub Actions timeout-minutes in benchmark-tmpl.yml together — either expiring first kills the job, so they need to stay aligned. Note: this is a global bump that affects every MI355X benchmark and every job that uses the shared workflow template, not just the dsv4 ATOM one. Drop back to 300 once the slow paths are gone (PR4 CUDAGraph + a working aiter MHC).
Bundle of changes from this iteration: - amd-master.yaml: expand search-space from `conc-start/end: 1` to explicit conc=[1, 4, 16, 32] for both 1k1k and 8k1k. conc-list isn't supported in the single-node sweep generator (only multinode), so use four single-conc entries per ISL. - dsv4_fp4_mi355x_atom.sh: drop --max-num-seqs 4 — every other ATOM benchmark in the repo uses ATOM's default 512, and the explicit value offered no protective benefit. The PR1 kv_cache[:1,...] hardcode corrupts non-slot-0 lanes whenever the scheduler assembles batch>1, regardless of max-num-seqs. The CONC=1 sanity check was already dropped to allow the conc>1 sweep entries; eval (gsm8k) at conc>1 is the canary for kv_cache[:1] silent corruption. Address @claude review on PR #1165: - Add --trust-remote-code on both server launch and run_benchmark_serving (mirrors qwen3.5_fp8_mi355x_atom.sh:52,72 — the peer ATOM recipe on the same atom0.1.2.post image). The model has loaded successfully so far without it (ATOM's _CONFIG_REGISTRY does intercept), but the flag is the prevailing convention and a one-line defense if any startup path resolves AutoConfig.from_pretrained before the registry shim. - Add --force-reinstall to both triton_kernels editable installs (the /triton-test/ path and the ROCm/triton fallback). Mirrors the atom install above, which uses --force-reinstall for the same wheel-shadow reason. Without it, pip can short-circuit the editable switch when the wheel's name/version match the source tree. - Fix pr-link in perf-changelog.yaml from /pull/TODO to /pull/1165.
Duplicates the #1165 entry to retrigger dsv4-fp4-mi355x-atom against the new commit (max-num-seqs floor of 4) on top of the merged Day-0 marker. Same configuration narrative; only pr-link differs.
Summary
dsv4-fp4-mi355x-atombenchmark config — Day-0 single-sequence marker for DeepSeek-V4-Pro on ATOM, gated to the limitations of ROCm/ATOM#650 (PR1 skeleton): TP=8, conc=1,--enforce-eager,ATOM_USE_TRITON_MOE=1.rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post(same base asqwen3.5-fp8-mi355x-atom); the DSv4 PR is overlaid at runtime by the benchmark script viapip install --no-deps -e .from a pinned SHA (cdbff35).deepseek-ai/DeepSeek-V4-Pro— same canonical checkpoint used bydsv4-fp4-b300-vllm.How the runtime patch works
ATOM ships as a built wheel in the chosen image. The new script clones
ROCm/ATOMto/tmp/atom-pr650, checks out the pinned PR SHA, and runspip install --no-deps --force-reinstall -e .so the editable PR copy replaces the wheel without churning ROCm/torch/triton/aiter. A preflight Python block then asserts:atom.__file__is under$ATOM_PR_DIR(the editable install actually took effect)transformers.models.auto.configuration_auto.CONFIG_MAPPINGcontainsdeepseek_v3(ATOM's_CONFIG_REGISTRYmapsdeepseek_v4→deepseek_v3)triton_kernels.tensor_details.layout.CDNA4MXScaleLayoutis importable (PR renamed it fromGFX950MXScaleLayout)If any preflight check fails, the script exits with a clear message before launching the server, instead of timing out
wait_for_server_readyon a confusing log error.Why not a real sweep yet
PR #650 is explicitly the first of a series with hard limitations:
kv_cache[:1,...]is hardcoded; multi-request needs a future ATOM PR3)--enforce-eagerrequired (no CUDAGraph; deferred to ATOM PR4)So this benchmark only sweeps
TP=8, conc=1for both 1k1k and 8k1k. The script enforces these as fatal env-var checks (CONC != 1orEP_SIZE != 1exits early) — silent corruption is the worst failure mode and we don't want it. Sweep will expand to TP=4/8 conc 4–256 once ATOM PR3/PR4 land.Test plan
dsv4-fp4-mi355x-atomattp=8, conc=1for both 1k1k and 8k1katom imported from: /tmp/atom-pr650/...,transformers version: 5.2.0,deepseek_v3 config class: DeepseekV3Config,triton_kernels.CDNA4MXScaleLayout: present)dsv4-fp8-mi355x-sglang)RUN_EVAL=true) — gsm8k coherent output, mirrors the PR's verified single-prompt outputKnown risks (first-run failure modes to watch for)
deepseek-ai/DeepSeek-V4-Proweight layout vs PR1WeightsMapper: PR was tested against AMD's local/data/DeepSeek-V4-Pro; whether the public HF checkpoint's layout matches the mapper has not been verified. Failure mode: stack trace from the loader complaining about unmapped or shape-mismatched params./triton-test/python/triton_kernels/path: PR's repro reinstalls it editable. The image we chose may or may not have that exact path; if missing, the script skips the reinstall and the preflight check onCDNA4MXScaleLayoutwill tell us whether the image's existing triton_kernels has the renamed class.pip install --no-deps -e .build step: if ATOM'ssetup.py/pyproject.tomltriggers a build that needs deps not in the image, the install fails. Fallback would be applying the PR diff in-place viagit diff | patch; not implemented yet.