Skip to content

mi355x test#1165

Merged
Oseltamivir merged 10 commits intomainfrom
mi355x-test
Apr 26, 2026
Merged

mi355x test#1165
Oseltamivir merged 10 commits intomainfrom
mi355x-test

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Adds dsv4-fp4-mi355x-atom benchmark config — Day-0 single-sequence marker for DeepSeek-V4-Pro on ATOM, gated to the limitations of ROCm/ATOM#650 (PR1 skeleton): TP=8, conc=1, --enforce-eager, ATOM_USE_TRITON_MOE=1.
  • Image is the existing rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post (same base as qwen3.5-fp8-mi355x-atom); the DSv4 PR is overlaid at runtime by the benchmark script via pip install --no-deps -e . from a pinned SHA (cdbff35).
  • Model is deepseek-ai/DeepSeek-V4-Pro — same canonical checkpoint used by dsv4-fp4-b300-vllm.

How the runtime patch works

ATOM ships as a built wheel in the chosen image. The new script clones ROCm/ATOM to /tmp/atom-pr650, checks out the pinned PR SHA, and runs pip install --no-deps --force-reinstall -e . so the editable PR copy replaces the wheel without churning ROCm/torch/triton/aiter. A preflight Python block then asserts:

  • atom.__file__ is under $ATOM_PR_DIR (the editable install actually took effect)
  • transformers.models.auto.configuration_auto.CONFIG_MAPPING contains deepseek_v3 (ATOM's _CONFIG_REGISTRY maps deepseek_v4deepseek_v3)
  • triton_kernels.tensor_details.layout.CDNA4MXScaleLayout is importable (PR renamed it from GFX950MXScaleLayout)

If any preflight check fails, the script exits with a clear message before launching the server, instead of timing out wait_for_server_ready on a confusing log error.

Why not a real sweep yet

PR #650 is explicitly the first of a series with hard limitations:

  • Single-sequence only (kv_cache[:1,...] is hardcoded; multi-request needs a future ATOM PR3)
  • --enforce-eager required (no CUDAGraph; deferred to ATOM PR4)
  • TPOT ~213ms (no kernel optimization yet)

So this benchmark only sweeps TP=8, conc=1 for both 1k1k and 8k1k. The script enforces these as fatal env-var checks (CONC != 1 or EP_SIZE != 1 exits early) — silent corruption is the worst failure mode and we don't want it. Sweep will expand to TP=4/8 conc 4–256 once ATOM PR3/PR4 land.

Test plan

  • CI run produces a result JSON for dsv4-fp4-mi355x-atom at tp=8, conc=1 for both 1k1k and 8k1k
  • Server log shows the preflight checks all passing (atom imported from: /tmp/atom-pr650/..., transformers version: 5.2.0, deepseek_v3 config class: DeepseekV3Config, triton_kernels.CDNA4MXScaleLayout: present)
  • Server reaches ready state under the 300-min SLURM cap (DSv4 has slow first-run JIT; comparable to dsv4-fp8-mi355x-sglang)
  • Eval pass (RUN_EVAL=true) — gsm8k coherent output, mirrors the PR's verified single-prompt output

Known risks (first-run failure modes to watch for)

  • deepseek-ai/DeepSeek-V4-Pro weight layout vs PR1 WeightsMapper: PR was tested against AMD's local /data/DeepSeek-V4-Pro; whether the public HF checkpoint's layout matches the mapper has not been verified. Failure mode: stack trace from the loader complaining about unmapped or shape-mismatched params.
  • /triton-test/python/triton_kernels/ path: PR's repro reinstalls it editable. The image we chose may or may not have that exact path; if missing, the script skips the reinstall and the preflight check on CDNA4MXScaleLayout will tell us whether the image's existing triton_kernels has the renamed class.
  • pip install --no-deps -e . build step: if ATOM's setup.py/pyproject.toml triggers a build that needs deps not in the image, the install fails. Fallback would be applying the PR diff in-place via git diff | patch; not implemented yet.

Adds DeepSeek-V4-Pro FP4 MI355X ATOM Day-0 single-sequence benchmark
config, gated to the limitations of ROCm/ATOM#650 (PR1 skeleton):
TP=8, conc=1 only, --enforce-eager, ATOM_USE_TRITON_MOE=1.

Image: rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post
(matches qwen3.5-fp8-mi355x-atom base). The DSv4 PR is overlaid at
runtime by the benchmark script via pip install --no-deps -e . from a
pinned SHA (cdbff35), so no new image needs to be published.

The script enforces PR1 invariants (CONC=1, EP_SIZE=1) and runs a
preflight that asserts the editable atom install took effect, that
transformers can resolve deepseek_v3 (the type ATOM maps deepseek_v4
to), and that triton_kernels exposes CDNA4MXScaleLayout (renamed from
GFX950MXScaleLayout in the PR).

Sweep will expand to TP=4/8 conc 4-256 once ROCm/ATOM PR3
(multi-request) and PR4 (CUDAGraph) land.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

The release rocm/atom:atom0.1.2.post image cleans up the build-stage
path /triton-test/python/triton_kernels/, so the conditional editable
install was a no-op and the preflight failed with "No module named
'triton_kernels'".

Clone triton-lang/triton at a pinned SHA (028e5da5, latest commit to
python/triton_kernels/triton_kernels/tensor_details/layout.py as of
2026-04-10) and pip install --no-deps -e the subpackage. triton_kernels
is self-contained (pyproject deps: numpy, pytest), so this does not
perturb the image's triton itself.
Comment on lines +156 to +180
python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
-tp $TP \
--kv_cache_dtype fp8 $CALCULATED_MAX_MODEL_LEN $EP \
--block-size $BLOCK_SIZE \
--enforce-eager \
--max-num-seqs 1 > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new ATOM server launch (line 156) and run_benchmark_serving call (line 170) both omit --trust-remote-code for deepseek-ai/DeepSeek-V4-Pro, while the peer ATOM recipe on the same image (qwen3.5_fp8_mi355x_atom.sh lines 52, 72) passes it on both, and every other DSv4 launch script in this repo passes it at least on the server. The script's own preflight verifies that transformers knows deepseek_v3 but does NOT verify that ATOM's _CONFIG_REGISTRY actually intercepts an AutoConfig.from_pretrained lookup for a model_type=deepseek_v4 config — adding --trust-remote-code on the server (mirroring qwen3.5_fp8_mi355x_atom.sh:52) is a one-line defense against the most likely first-run failure mode the PR description already flags.

Extended reasoning...

What the bug is

benchmarks/single_node/dsv4_fp4_mi355x_atom.sh launches atom.entrypoints.openai_server for deepseek-ai/DeepSeek-V4-Pro (lines 156–163) without --trust-remote-code. The PR description itself flags compatibility with the public HF checkpoint as a known risk ("first run will tell us"). DeepSeek-V4-Pro's HF config.json declares model_type=deepseek_v4 — that's the exact reason ATOM ships a _CONFIG_REGISTRY remap to deepseek_v3. transformers 5.2.0's stock CONFIG_MAPPING does not contain deepseek_v4 (the script's own preflight implicitly confirms this — it only asserts that deepseek_v3 is present). So if anything in the server's startup path calls AutoConfig.from_pretrained(deepseek-ai/DeepSeek-V4-Pro) before ATOM's registry intercepts, transformers will raise on the unknown model_type unless trust_remote_code=True is set.

Why this is the most likely first-run failure mode

Strong precedent across this repo and the same image base:

  • dsv4_fp4_b300_vllm.sh lines 62 + 94: server and client get --trust-remote-code
  • dsv4_fp8_h200.sh lines 48 + 83: server and client get --trust-remote-code
  • dsv4_fp8_mi355x.sh: server gets --trust-remote-code (line 81) and the script additionally patches the cached config.json from deepseek_v4deepseek_v3 on disk (lines 30–40) — belt + suspenders
  • qwen3.5_fp8_mi355x_atom.sh (the peer ATOM recipe on the same image base, atom0.1.2.post): server and client both get --trust-remote-code (lines 52, 72)

The new ATOM script in this PR passes neither --trust-remote-code nor patches the cached config — it relies entirely on ATOM's internal _CONFIG_REGISTRY doing the intercept.

Addressing the refutation

The refutation correctly observes that dsv4_fp4_b300_sglang.sh and dsv4_fp4_b200.sh use the same canonical deepseek-ai/DeepSeek-V4-Pro checkpoint and call run_benchmark_serving without --trust-remote-code, and they have been merged. That's a fair point about the client side: backend_request_func.get_tokenizer (utils/bench_serving/backend_request_func.py:512–542) just calls AutoTokenizer.from_pretrained, which reads tokenizer_config.json's tokenizer_class and does not need to resolve model_type through transformers' CONFIG_MAPPING. So I am less certain the client-side omission breaks anything in practice, and the client-side fix is best framed as defensive consistency with the peer ATOM recipe.

The server side is the harder claim. The refutation argues that PR #650's _CONFIG_REGISTRY is precisely designed to handle deepseek_v4, so the omission is intentional. That is plausible, but the script's own preflight does not verify it. The preflight only asserts that deepseek_v3 in CONFIG_MAPPING (line 116) — which would already be true in the upstream wheel without PR #650, since transformers 5.2.0 itself ships DeepseekV3Config. There is no test in the preflight that PR #650's _CONFIG_REGISTRY actually intercepts AutoConfig.from_pretrained("deepseek-ai/DeepSeek-V4-Pro"). If the registry is internal to ATOM (i.e. doesn't patch transformers' CONFIG_MAPPING directly), then any code path inside the server that resolves the config via stock transformers before ATOM's interception will fail.

Step-by-step proof

  1. Image rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post ships transformers 5.2.0 (the script even prints this version in the preflight).
  2. transformers 5.2.0's CONFIG_MAPPING contains deepseek_v3 but not deepseek_v4 (otherwise PR Claude Opus 4.6 #650 wouldn't need a remap; the script's own preflight comment confirms this).
  3. huggingface.co/deepseek-ai/DeepSeek-V4-Pro/resolve/main/config.json declares model_type: deepseek_v4.
  4. atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-V4-Pro … (line 156) is launched without --trust-remote-code.
  5. If ATOM's startup imports the model class via AutoConfig.from_pretrained (or any code that delegates to it) before installing its _CONFIG_REGISTRY shim, transformers will raise:
    ValueError: The checkpoint you are trying to load has model type \deepseek_v4` but Transformers does not recognize this architecture. … you can use `trust_remote_code=True`.`
  6. With --trust-remote-code, transformers falls back to the model's own configuration_deepseek.py from the HF repo, which would resolve regardless of the registry shim.

The qwen3.5 ATOM peer on the same image passes --trust-remote-code on the server even though Qwen3.5 doesn't suffer from this exact model_type mismatch — that itself shows the prevailing convention is defensive.

Fix

Add --trust-remote-code to the server invocation (line 156-163) and ideally also to run_benchmark_serving (line 170-180), mirroring qwen3.5_fp8_mi355x_atom.sh:52,72. The server-side fix is the load-bearing one; the client-side fix is defensive consistency given the precedent set by every other DSv4 launch script in the repo. Alternatively, follow dsv4_fp8_mi355x.sh's pattern and patch the cached config.json model_type from deepseek_v4 to deepseek_v3 in a Python preflight block.

Comment thread perf-changelog.yaml Outdated
- "Model: deepseek-ai/DeepSeek-V4-Pro (same canonical checkpoint used by dsv4-fp4-b300-vllm); compatibility with PR #650's WeightsMapper not yet verified — first run will tell us"
- "Pinned to PR1 limitations: single-sequence kv_cache hardcode, --enforce-eager required, ATOM_USE_TRITON_MOE=1 (aiter fused_moe broken on gfx950)"
- "Sweep will expand to TP=4/8 conc 4–256 once ROCm/ATOM PR3 (multi-request) and PR4 (CUDAGraph) land"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 perf-changelog.yaml:1845 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO — a literal placeholder that was never replaced with the actual PR number (1165). Every other entry in the file uses a real numeric PR ID, so this should be changed to /pull/1165 before merge to avoid leaving a permanently broken hyperlink in the changelog.

Extended reasoning...

What the bug is. Line 1845 of perf-changelog.yaml — the pr-link field for the new dsv4-fp4-mi355x-atom changelog entry added by this PR — contains the literal string https://github.com/SemiAnalysisAI/InferenceX/pull/TODO. The TODO token is a placeholder the author meant to substitute with the real PR number (1165, per the PR metadata) but forgot to fill in.\n\nHow it manifests. When the changelog is rendered (e.g., in the merged file on GitHub or via any tool that consumes pr-link), the URL resolves to https://github.com/SemiAnalysisAI/InferenceX/pull/TODO, which 404s. Every other entry in the 200+-entry file uses a numeric PR ID — for instance, line 1835 (the entry immediately above) reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160. The placeholder in this PR is the lone outlier.\n\nWhy existing validation does not catch it. utils/matrix_logic/validation.py:344 declares pr_link as a plain str on the Pydantic model with no URL or numeric-suffix constraint. The string '.../pull/TODO' is a perfectly valid Python str, so it passes schema validation silently. There is no other validator in the repo that asserts pr_link ends in a digit or resolves to a real PR.\n\nImpact. Once merged the changelog will permanently contain a broken hyperlink for the dsv4-fp4-mi355x-atom entry; anyone clicking through the changelog to find the originating PR for that benchmark will get a 404. Any downstream tooling that parses pr_link to extract the numeric PR id (e.g., for cross-referencing with GitHub PR metadata) will either fail outright or treat the literal 'TODO' as a string id. This is also a poor signal to readers — the changelog is a permanent historical artifact and a TODO in the merged version looks unfinished.\n\nHow to fix. Replace pull/TODO with pull/1165 on line 1845. One-line change:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1165\n\n\nStep-by-step proof.\n1. Open perf-changelog.yaml and jump to line 1845. The line reads: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO.\n2. Compare with the previous entry at line 1835: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160 — note the numeric ID.\n3. The PR metadata at the top of this review identifies this PR as #1165.\n4. Construct the URL the merged changelog will render: https://github.com/SemiAnalysisAI/InferenceX/pull/TODO. GitHub interprets the path segment after /pull/ as a PR number; TODO is not numeric, so the URL 404s.\n5. Confirm validation does not catch this: utils/matrix_logic/validation.py:344 types pr_link as str with no further constraint, so the string .../pull/TODO passes the schema.\n\nSeverity is normal because it is a real defect in the merged artifact, the fix is a one-token substitution, and it is much easier to fix now than to land a follow-up PR later.

Comment on lines +71 to +75
# PR #650's repro explicitly reinstalls triton_kernels editable. Conditional
# in case the path differs in the chosen image; safe no-op if already present.
if [ -d /triton-test/python/triton_kernels/ ]; then
pip install --no-deps -e /triton-test/python/triton_kernels/
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Line 74 runs pip install --no-deps -e /triton-test/python/triton_kernels/ without --force-reinstall, while the immediately-preceding atom install on line 68 explicitly uses --force-reinstall with the comment 'replace the wheel-installed atom with the editable copy' — the same logic applies here, since triton_kernels is also wheel-installed in this image. If pip considers the wheel's requirement already satisfied (same name/version), it can short-circuit and skip the editable switch, leaving the wheel build active; the CDNA4MXScaleLayout preflight only catches the GFX950→CDNA4 rename, so any image whose wheel already exports the renamed symbol will pass preflight while every other PR-only change in triton_kernels silently fails to apply. Fix: add --force-reinstall to mirror the atom install three lines above.

Extended reasoning...

What the bug is

benchmarks/single_node/dsv4_fp4_mi355x_atom.sh overlays two editable installs onto the wheel-built image: ATOM (lines 47–69) and triton_kernels (lines 71–75). The atom install passes --force-reinstall with an explicit comment justifying it:

# --force-reinstall: replace the wheel-installed atom with the editable copy.
pip install --no-deps --force-reinstall -e .

Three lines later, the triton_kernels install omits the same flag:

if [ -d /triton-test/python/triton_kernels/ ]; then
    pip install --no-deps -e /triton-test/python/triton_kernels/
fi

The script author already understood that pip will not necessarily replace a wheel-installed package with the editable copy without --force-reinstall — they applied that reasoning correctly to atom but missed it for triton_kernels.

How it manifests / proof walkthrough

  1. The base image rocm/atom:…atom0.1.2.post ships triton_kernels as a built wheel under site-packages. /triton-test/python/triton_kernels/ is a sibling source checkout that PR Claude Opus 4.6 #650 expects to be installed editable so its fused_moe_triton.py (and other) edits are live.
  2. The script runs pip install --no-deps -e /triton-test/python/triton_kernels/. Because --no-deps is set and no --force-reinstall is passed, pip's resolver compares the Name/Version metadata of the source tree against what's already installed. If they match (very likely for a vendored wheel built from the same upstream source), pip prints 'Requirement already satisfied' and does not switch the import target to the editable path. Behavior is pip-version-dependent: modern pip usually still reinstalls when -e is given, but older pip and edge cases (matching version, RECORD already present) can short-circuit.
  3. The downstream preflight on lines 117–126 checks only one specific symbol — triton_kernels.tensor_details.layout.CDNA4MXScaleLayout. PR Claude Opus 4.6 #650 renames GFX950MXScaleLayout → CDNA4MXScaleLayout in that module, so if the wheel still uses the old name, the preflight catches the no-op. But if the image's wheel happens to already carry the rename (e.g., the rename has already landed in upstream triton_kernels and been baked into the wheel build), the preflight passes while the editable overlay silently failed. Every other PR-only change in triton_kernels (in fused_moe_triton.py and elsewhere) then runs against the wheel's code path, not the PR's.

Why the existing comment doesn't prevent it

The # safe no-op if already present comment refers to the [ -d /triton-test/python/triton_kernels/ ] directory guard — i.e., 'safe to skip if the directory doesn't exist in this image' — not pip-level idempotency. The pip command itself is the one that may quietly no-op.

Impact

When the wheel and the PR-only edit diverge in something other than the renamed Layout symbol (very plausible: PR #650 is described as the first of a series with hard limitations like the kv_cache[:1,...] hardcode, and triton_kernels lives alongside the model code), the benchmark would silently run the wheel's code path while the preflight reports success. Failure mode is silently-wrong throughput numbers or a confusing crash deep inside server startup, not a clear 'editable install did not take effect' message.

Fix

Add --force-reinstall to mirror the atom install:

pip install --no-deps --force-reinstall -e /triton-test/python/triton_kernels/

This is consistent with the script's own justification three lines earlier and with the defensive atom.__file__-under-ATOM_PR_DIR check that already guards the atom case. Severity is on the boundary between nit and normal — narrow failure window (only matters when the wheel name+version matches the source tree and no other preflight catches the divergence), but the fix is a one-flag change that exactly matches the author's stated intent.

Oseltamivir and others added 8 commits April 25, 2026 20:57
The previous SHA (028e5da5, 2026-04-10) imports is_hip_gfx1250 from
triton.language.target_info, which the image's installed triton (older,
matched to atom0.1.2.post) does not export. Result: ImportError at
triton_kernels load, before any of our layout-class checks run.

Pin to d28db13d (2026-03-05, parent of commit 11aac682 "[AMD] Add
is_hip_gfx1250 target check"). At this SHA layout.py still has
CDNA4MXScaleLayout (post the 2026-01-10 rename) and target_info.py
imports only is_hip / is_hip_cdna3 / is_hip_cdna4. Verified both
properties against the GitHub raw blobs at this SHA.

Bump only after the image's triton is upgraded to one that exposes
is_hip_gfx1250.
Upstream triton-lang/triton refactored matmul_ogs into matmul.py
(removing routing.py too), but PR #650's fused_moe_triton.py imports
`from triton_kernels.matmul_ogs import matmul_ogs, PrecisionConfig`
and `from triton_kernels.routing import routing`. Those only resolve
against the ROCm fork's RI3.5.x branch.

Pin to ROCm/triton@e491726 (RI3.5.x HEAD, 2025-11-20). Verified at
that SHA:
  * matmul_ogs.py has class PrecisionConfig and def matmul_ogs
  * routing.py exists
  * tensor_details/layout.py exports CDNA4MXScaleLayout
  * target_info.py imports only is_hip / is_hip_cdna3 / is_hip_cdna4
    (no is_hip_gfx1250, which the image's bundled triton rejects)

This matches what AMD's /triton-test/python/triton_kernels/ would have
been on the build host before the release image's cleanup stage.
The HIP device-guard crash stack only implicated mhc_pre_big_fuse from
the aiter MHC family. Patching out both mhc_pre and mhc_post forced the
torch fallback for the full MHC pipeline (RMSNorm + linear + 20-iter
Sinkhorn + reduce, all in fp32), which is roughly 40% slower than the
PR's reported aiter-MHC baseline of 213 ms/token.

Restore aiter mhc_post and keep only mhc_pre on the torch fallback. If
mhc_post turns out to crash with the same HIP guard assertion on a real
forward, re-add the second sed line; for now the cheaper experiment is
to find out whether it works.

Verify-grep updated from `^2$` to `^1$` so the script still FATALs if
upstream renames or removes the mhc_pre lookup site.
dsv4-fp4-mi355x-atom (ROCm/ATOM#650 PR1, single-sequence at TP=8 with
torch-fallback hc_pre because aiter mhc_pre crashes on this image)
runs at ~5 min per request in steady state. With 1k1k at 12 prompts
plus 8k1k at the same shape, the full sweep can exceed the 300-min
cap that #1148 set for the SGLang-DSv4 path.

Bump both the SLURM allocation in runners/launch_mi355x-amds.sh and
the GitHub Actions timeout-minutes in benchmark-tmpl.yml together —
either expiring first kills the job, so they need to stay aligned.

Note: this is a global bump that affects every MI355X benchmark and
every job that uses the shared workflow template, not just the dsv4
ATOM one. Drop back to 300 once the slow paths are gone (PR4
CUDAGraph + a working aiter MHC).
Bundle of changes from this iteration:

- amd-master.yaml: expand search-space from `conc-start/end: 1` to
  explicit conc=[1, 4, 16, 32] for both 1k1k and 8k1k. conc-list isn't
  supported in the single-node sweep generator (only multinode), so use
  four single-conc entries per ISL.

- dsv4_fp4_mi355x_atom.sh: drop --max-num-seqs 4 — every other ATOM
  benchmark in the repo uses ATOM's default 512, and the explicit value
  offered no protective benefit. The PR1 kv_cache[:1,...] hardcode
  corrupts non-slot-0 lanes whenever the scheduler assembles batch>1,
  regardless of max-num-seqs. The CONC=1 sanity check was already
  dropped to allow the conc>1 sweep entries; eval (gsm8k) at conc>1
  is the canary for kv_cache[:1] silent corruption.

Address @claude review on PR #1165:

- Add --trust-remote-code on both server launch and run_benchmark_serving
  (mirrors qwen3.5_fp8_mi355x_atom.sh:52,72 — the peer ATOM recipe on
  the same atom0.1.2.post image). The model has loaded successfully so
  far without it (ATOM's _CONFIG_REGISTRY does intercept), but the flag
  is the prevailing convention and a one-line defense if any startup
  path resolves AutoConfig.from_pretrained before the registry shim.

- Add --force-reinstall to both triton_kernels editable installs (the
  /triton-test/ path and the ROCm/triton fallback). Mirrors the atom
  install above, which uses --force-reinstall for the same wheel-shadow
  reason. Without it, pip can short-circuit the editable switch when
  the wheel's name/version match the source tree.

- Fix pr-link in perf-changelog.yaml from /pull/TODO to /pull/1165.
@Oseltamivir Oseltamivir merged commit a1adf12 into main Apr 26, 2026
17 checks passed
@Oseltamivir Oseltamivir deleted the mi355x-test branch April 26, 2026 08:47
Oseltamivir added a commit that referenced this pull request Apr 26, 2026
Duplicates the #1165 entry to retrigger dsv4-fp4-mi355x-atom against
the new commit (max-num-seqs floor of 4) on top of the merged Day-0
marker. Same configuration narrative; only pr-link differs.
@claude claude Bot mentioned this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant