ci(disagg): fail before writing result file + surface real failure class by arygupt · Pull Request #1591 · SemiAnalysisAI/InferenceX

arygupt · 2026-05-29T22:59:10Z

What

Two independent, verified disagg-CI hardening fixes that make AMD multinode failures fail loudly and legibly instead of silently producing misleading artifacts.

A — `benchmark_serving.py`: gate before write

The request-failure-rate gate ran after the result JSON was written, so a zero-/sub-threshold run left a schema-valid JSON on disk and then exited non-zero. Downstream collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml) key on file existence, not exit code — so a written-then-failed file looked successful. Moved the gate above if args.save_result: so disk state stays consistent with the exit code. Complements #1590 (that guard is downstream in process_result; this stops the misleading file at the source).

B — `launch_mi355x-amds.sh`: emit `::error::` with the real failure class

cleanup_and_save_logs tailed the slurm .err but emitted no GitHub annotation, so the Actions UI showed only the generic "No benchmark result files found". Now it classifies the .err and emits one ::error:: naming the class — recipe model-not-found (#1581) / readiness-timeout / fp8_blockwise IntraNode combine (#1584) / MoRI transport / segfault — deterministic recipe errors checked first so a config bug is never mislabeled as a flake. Guarded on GITHUB_ACTIONS.

Held for follow-up

Bounded flake-only retry + node-quarantine in benchmark-multinode-tmpl.yml. Needs the submit.sh node-allocation path traced to thread a --exclude, plus offline classifier validation against canned .err fixtures — not shipping a half-verified control-flow change to production CI.

Test

A: python3 -m py_compile passes; gate logic unchanged, only reordered (no re-indent). Unit-testable by forcing completed=0 / completed<95% and asserting no JSON is written + exit≠0.
B: bash -n passes; pure observability inside the existing EXIT trap, runs only under GITHUB_ACTIONS.

🤖 Generated with Claude Code

Note

Low Risk
CI-only ordering and observability changes; benchmark gate logic is unchanged, and launcher edits only add stderr pattern matching for Actions annotations.

Overview
Hardens AMD disaggregated multinode CI so failed runs are easier to diagnose and no longer look successful downstream.

In benchmark_serving.py, the 5% request failure-rate check now runs before any result JSON is written. Collectors treat a result file on disk as success even when the process exits non-zero; moving the gate avoids leaving a valid JSON artifact on sub-threshold or zero-completion runs.

In launch_mi355x-amds.sh, cleanup_and_save_logs still tails the Slurm .err log but now emits a GitHub ::error:: annotation (when GITHUB_ACTIONS is set) that classifies the failure—recipe/model-not-found and MoRI fp8_blockwise config issues are matched before transport-flake patterns so deterministic misconfigurations are not labeled as flakes.

^{Reviewed by Cursor Bugbot for commit 992b90f. Bugbot is set up for automated code reviews on this repo. Configure here.}

Two independent disagg-CI hardening fixes (both verified drop-in): A) benchmark_serving.py: move the request-failure-rate gate ABOVE the `if args.save_result:` block. Previously the result JSON was written (json.dump + pytorch sidecar) and THEN the gate raised SystemExit, so a zero-/sub-threshold run left a schema-valid JSON on disk. Downstream collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml) key on file EXISTENCE, not exit code, so a written-then-failed file looks successful. Gating first keeps disk state consistent with the exit code. Complements PR #1590's process_result zero-guard (that one is downstream; this stops the misleading file at the source). B) launch_mi355x-amds.sh cleanup_and_save_logs: after tailing the slurm .err, classify it and emit a single GitHub ::error:: annotation naming the failure class (recipe model-not-found / readiness-timeout / fp8_blockwise IntraNode combine / MoRI transport / segfault), with a generic fallback. Without this the Actions UI shows only "No benchmark result files found". Deterministic recipe errors are checked first so a config bug is never mislabeled as a transport flake. Guarded on GITHUB_ACTIONS. Held for a follow-up: bounded flake-only retry + node-quarantine in benchmark-multinode-tmpl.yml (needs the submit.sh node-allocation path traced to thread a --exclude, and offline classifier validation against canned .err fixtures). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 992b90f. Configure here.}

cursor · 2026-05-29T23:03:40Z

+                else
+                    echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above"
+                fi
+            fi


Error annotation emits on successful runs with stderr

Medium Severity

The cleanup_and_save_logs EXIT trap fires on all exits, including successful ones. The new ::error:: annotation block only checks whether the .err file is non-empty (-s), not whether the job actually failed. Slurm .err files commonly contain content on success (Python deprecation warnings, CUDA init messages, library logs). On a successful run with any stderr output, this will either match a broad pattern like MoRI (line 84 — likely present in normal operational logs) and emit a false transport-flake annotation, or fall through and emit the misleading "Unclassified failure" annotation. This defeats the purpose of surfacing real failure classes by flooding the Actions UI with false positives.

^{Reviewed by Cursor Bugbot for commit 992b90f. Configure here.}

claude · 2026-05-29T23:13:31Z

+            # Surface the real failure class in the Actions UI. Without this, a
+            # launch failure shows only the generic "No benchmark result files
+            # found" from benchmark-multinode-tmpl.yml. Order matters: check the
+            # deterministic recipe error (model-not-found, #1581) before the
+            # transport-flake patterns (#1584 MoRI/readiness) so a config bug is
+            # never mislabeled as a flake.
+            if [[ -n "${GITHUB_ACTIONS:-}" ]]; then
+                local sig=""
+                if   grep -qiE "Model '.*' not found|FATAL: Model|model .* not found" "$err_file"; then
+                    sig="recipe-error: model not found (deterministic - check MODEL/MODEL_PATH, not MoRI)"
+                elif grep -qiE "ReadTimeout|readiness.*timeout|warmup.*time(d)? ?out|health.*timeout" "$err_file"; then
+                    sig="transport-flake: readiness/warmup timeout (MoRI pd-disagg)"
+                elif grep -qiE "Fp8BlockwiseQuant.*IntraNode|dispatch_combine|combine.*IntraNode" "$err_file"; then
+                    sig="config-error: MoRI fp8_blockwise combine needs IntraNode (disable TBO/SDMA on FP4 prefill, #1584)"
+                elif grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"; then
+                    sig="transport-flake: MoRI KV-transport error"
+                elif grep -qiE "segfault|Segmentation fault|signal 11|core dumped|gpucore" "$err_file"; then
+                    sig="transport-flake: server segfault / core dump"
+                fi
+                if [[ -n "$sig" ]]; then
+                    echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::${sig} (see slurm .err artifact)"
+                else
+                    echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above"
+                fi
+            fi


🔴 The new ::error:: annotation block in cleanup_and_save_logs is gated only on stderr being non-empty and GITHUB_ACTIONS being set — neither implies the job actually failed. Since the trap runs on EXIT unconditionally (line 101) and SLURM .err is essentially always non-empty on a successful MoRI run (NCCL/MoRI INFO banners, tqdm to stderr, Python warnings, srun output), every green run will emit a red ::error title=AMD disagg job ... failed:: banner — inverting the PR's stated goal. Fix: capture local rc=$? as the very first line of the function and gate the entire ::error:: emission on rc -ne 0 (leaving the diagnostic .err tail unconditional).

Extended reasoning...

What the bug is

cleanup_and_save_logs is registered as a generic EXIT trap at line 101 (trap cleanup_and_save_logs EXIT), so it runs on every script exit — both success (exit 0) and failure. The new annotation block (lines 76–94) is guarded by only two conditions:

[[ -s "$err_file" ]] — stderr file is non-empty (line 66 enclosing block)

[[ -n "${GITHUB_ACTIONS:-}" ]] — running under Actions (line 76)

Neither indicates that the job actually failed. There is no $? capture and no exit-code gate.

Why a successful run will hit it

SGLang/vLLM multinode jobs with MoRI + NCCL + SLURM essentially always produce non-empty stderr on success:

tqdm writes progress bars to stderr by default

NCCL/RCCL prints INFO banners to stderr

warnings.warn(...) writes to stderr

SGLang/vLLM server INFO logs often route to stderr

srun --unbuffered and docker stop $(docker ps -a -q) || true (job.slurm) leak banners and "no such container" messages to stderr — the trailing || true swallows the exit code but not the output

The submit.sh wrapper sets --error=slurm_job-%j.err, so all of the above lands in the file the trap inspects

Two false-positive paths on green runs

Pattern match (line 84): grep -qiE "MoRI|mori_conn|pd[- ]?disagg" matches case-insensitively against the literal string MoRI anywhere in stderr. For a MoRI-based benchmark this is effectively guaranteed (framework name appears in INFO/log lines and startup banners). Result: ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error.

Else branch (line 92): Even when no class matches, any non-empty unmatched stderr emits ::error title=AMD disagg job <id> failed::Unclassified failure - see last 100 lines of slurm .err above.

Either path renders a red banner in the Actions UI claiming the job FAILED on a run whose exit code is 0 and whose result JSON was written normally.

Step-by-step proof

Script runs, trap cleanup_and_save_logs EXIT is set (line 101).

Multinode benchmark completes successfully, exit code 0.

cleanup_and_save_logs fires on EXIT.

err_file=$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.err is non-empty (contains NCCL/MoRI INFO banners + tqdm + srun stderr).

[[ -s "$err_file" ]] → true → enter the block.

[[ -n "${GITHUB_ACTIONS:-}" ]] → true (we are in CI) → enter annotation block.

First three patterns (Model not found, ReadTimeout, Fp8BlockwiseQuant) don't match a clean run.

Fourth elif: grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file" → matches the literal "MoRI" in any startup log line → sig="transport-flake: MoRI KV-transport error".

Line 90 emits ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error (see slurm .err artifact).

Actions UI shows a red error banner on a successful, exit-0 run.

Existing code already acknowledges this

The pre-existing if [[ -s "$err_file" ]] guard at line 66 plus the unconditional tail -100 block already implicitly acknowledges that .err is routinely non-empty on non-failure runs (otherwise the tail print wouldn't be conditional). What the new block does is escalate that benign non-emptiness into a misleading ::error:: annotation.

Impact

The annotation directly defeats the PR's stated goal ("surface real failure class", "fail loudly and legibly"). After a few green-but-red runs, reviewers learn to ignore the annotation entirely, and the next real failure looks identical to the prior false positives.

Fix

Capture the trap-entry exit code as the very first line of cleanup_and_save_logs and gate only the ::error:: emission on it. The diagnostic .err tail (lines 67–69) is fine to leave unconditional since it's only printed output, not an Actions annotation:

cleanup_and_save_logs() { local rc=$? # MUST be first statement # ... existing cp/tail logic unchanged ... if [[ $rc -ne 0 && -n "${GITHUB_ACTIONS:-}" && -s "$err_file" ]]; then # existing classifier + ::error:: emission fi }

arygupt requested a review from a team May 29, 2026 22:59

github-project-automation Bot added this to InferenceMAX Board May 29, 2026

cursor Bot reviewed May 29, 2026

View reviewed changes

claude Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(disagg): fail before writing result file + surface real failure class#1591

ci(disagg): fail before writing result file + surface real failure class#1591
arygupt wants to merge 1 commit into
mainfrom
ci/disagg-hardening-fail-loud

arygupt commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

claude Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arygupt commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

A — benchmark_serving.py: gate before write

B — launch_mi355x-amds.sh: emit ::error:: with the real failure class

Held for follow-up

Test

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Error annotation emits on successful runs with stderr

Uh oh!

claude Bot May 29, 2026

Choose a reason for hiding this comment

What the bug is

Why a successful run will hit it

Two false-positive paths on green runs

Step-by-step proof

Existing code already acknowledges this

Impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arygupt commented May 29, 2026 •

edited by cursor Bot

Loading

A — `benchmark_serving.py`: gate before write

B — `launch_mi355x-amds.sh`: emit `::error::` with the real failure class