Skip to content

ci(disagg): fail before writing result file + surface real failure class#1591

Open
arygupt wants to merge 1 commit into
mainfrom
ci/disagg-hardening-fail-loud
Open

ci(disagg): fail before writing result file + surface real failure class#1591
arygupt wants to merge 1 commit into
mainfrom
ci/disagg-hardening-fail-loud

Conversation

@arygupt
Copy link
Copy Markdown
Collaborator

@arygupt arygupt commented May 29, 2026

What

Two independent, verified disagg-CI hardening fixes that make AMD multinode failures fail loudly and legibly instead of silently producing misleading artifacts.

A — benchmark_serving.py: gate before write

The request-failure-rate gate ran after the result JSON was written, so a zero-/sub-threshold run left a schema-valid JSON on disk and then exited non-zero. Downstream collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml) key on file existence, not exit code — so a written-then-failed file looked successful. Moved the gate above if args.save_result: so disk state stays consistent with the exit code. Complements #1590 (that guard is downstream in process_result; this stops the misleading file at the source).

B — launch_mi355x-amds.sh: emit ::error:: with the real failure class

cleanup_and_save_logs tailed the slurm .err but emitted no GitHub annotation, so the Actions UI showed only the generic "No benchmark result files found". Now it classifies the .err and emits one ::error:: naming the class — recipe model-not-found (#1581) / readiness-timeout / fp8_blockwise IntraNode combine (#1584) / MoRI transport / segfault — deterministic recipe errors checked first so a config bug is never mislabeled as a flake. Guarded on GITHUB_ACTIONS.

Held for follow-up

Bounded flake-only retry + node-quarantine in benchmark-multinode-tmpl.yml. Needs the submit.sh node-allocation path traced to thread a --exclude, plus offline classifier validation against canned .err fixtures — not shipping a half-verified control-flow change to production CI.

Test

  • A: python3 -m py_compile passes; gate logic unchanged, only reordered (no re-indent). Unit-testable by forcing completed=0 / completed<95% and asserting no JSON is written + exit≠0.
  • B: bash -n passes; pure observability inside the existing EXIT trap, runs only under GITHUB_ACTIONS.

🤖 Generated with Claude Code


Note

Low Risk
CI-only ordering and observability changes; benchmark gate logic is unchanged, and launcher edits only add stderr pattern matching for Actions annotations.

Overview
Hardens AMD disaggregated multinode CI so failed runs are easier to diagnose and no longer look successful downstream.

In benchmark_serving.py, the 5% request failure-rate check now runs before any result JSON is written. Collectors treat a result file on disk as success even when the process exits non-zero; moving the gate avoids leaving a valid JSON artifact on sub-threshold or zero-completion runs.

In launch_mi355x-amds.sh, cleanup_and_save_logs still tails the Slurm .err log but now emits a GitHub ::error:: annotation (when GITHUB_ACTIONS is set) that classifies the failure—recipe/model-not-found and MoRI fp8_blockwise config issues are matched before transport-flake patterns so deterministic misconfigurations are not labeled as flakes.

Reviewed by Cursor Bugbot for commit 992b90f. Bugbot is set up for automated code reviews on this repo. Configure here.

Two independent disagg-CI hardening fixes (both verified drop-in):

A) benchmark_serving.py: move the request-failure-rate gate ABOVE the
   `if args.save_result:` block. Previously the result JSON was written
   (json.dump + pytorch sidecar) and THEN the gate raised SystemExit, so a
   zero-/sub-threshold run left a schema-valid JSON on disk. Downstream
   collectors (launch_mi355x-amds.sh, benchmark-multinode-tmpl.yml) key on
   file EXISTENCE, not exit code, so a written-then-failed file looks
   successful. Gating first keeps disk state consistent with the exit code.
   Complements PR #1590's process_result zero-guard (that one is downstream;
   this stops the misleading file at the source).

B) launch_mi355x-amds.sh cleanup_and_save_logs: after tailing the slurm .err,
   classify it and emit a single GitHub ::error:: annotation naming the
   failure class (recipe model-not-found / readiness-timeout / fp8_blockwise
   IntraNode combine / MoRI transport / segfault), with a generic fallback.
   Without this the Actions UI shows only "No benchmark result files found".
   Deterministic recipe errors are checked first so a config bug is never
   mislabeled as a transport flake. Guarded on GITHUB_ACTIONS.

Held for a follow-up: bounded flake-only retry + node-quarantine in
benchmark-multinode-tmpl.yml (needs the submit.sh node-allocation path traced
to thread a --exclude, and offline classifier validation against canned
.err fixtures).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arygupt arygupt requested a review from a team May 29, 2026 22:59
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 992b90f. Configure here.

else
echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above"
fi
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error annotation emits on successful runs with stderr

Medium Severity

The cleanup_and_save_logs EXIT trap fires on all exits, including successful ones. The new ::error:: annotation block only checks whether the .err file is non-empty (-s), not whether the job actually failed. Slurm .err files commonly contain content on success (Python deprecation warnings, CUDA init messages, library logs). On a successful run with any stderr output, this will either match a broad pattern like MoRI (line 84 — likely present in normal operational logs) and emit a false transport-flake annotation, or fall through and emit the misleading "Unclassified failure" annotation. This defeats the purpose of surfacing real failure classes by flooding the Actions UI with false positives.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 992b90f. Configure here.

Comment on lines +70 to +94
# Surface the real failure class in the Actions UI. Without this, a
# launch failure shows only the generic "No benchmark result files
# found" from benchmark-multinode-tmpl.yml. Order matters: check the
# deterministic recipe error (model-not-found, #1581) before the
# transport-flake patterns (#1584 MoRI/readiness) so a config bug is
# never mislabeled as a flake.
if [[ -n "${GITHUB_ACTIONS:-}" ]]; then
local sig=""
if grep -qiE "Model '.*' not found|FATAL: Model|model .* not found" "$err_file"; then
sig="recipe-error: model not found (deterministic - check MODEL/MODEL_PATH, not MoRI)"
elif grep -qiE "ReadTimeout|readiness.*timeout|warmup.*time(d)? ?out|health.*timeout" "$err_file"; then
sig="transport-flake: readiness/warmup timeout (MoRI pd-disagg)"
elif grep -qiE "Fp8BlockwiseQuant.*IntraNode|dispatch_combine|combine.*IntraNode" "$err_file"; then
sig="config-error: MoRI fp8_blockwise combine needs IntraNode (disable TBO/SDMA on FP4 prefill, #1584)"
elif grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file"; then
sig="transport-flake: MoRI KV-transport error"
elif grep -qiE "segfault|Segmentation fault|signal 11|core dumped|gpucore" "$err_file"; then
sig="transport-flake: server segfault / core dump"
fi
if [[ -n "$sig" ]]; then
echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::${sig} (see slurm .err artifact)"
else
echo "::error title=AMD disagg job ${JOB_ID:-unknown} failed::Unclassified failure - see last 100 lines of slurm .err above"
fi
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new ::error:: annotation block in cleanup_and_save_logs is gated only on stderr being non-empty and GITHUB_ACTIONS being set — neither implies the job actually failed. Since the trap runs on EXIT unconditionally (line 101) and SLURM .err is essentially always non-empty on a successful MoRI run (NCCL/MoRI INFO banners, tqdm to stderr, Python warnings, srun output), every green run will emit a red ::error title=AMD disagg job ... failed:: banner — inverting the PR's stated goal. Fix: capture local rc=$? as the very first line of the function and gate the entire ::error:: emission on rc -ne 0 (leaving the diagnostic .err tail unconditional).

Extended reasoning...

What the bug is

cleanup_and_save_logs is registered as a generic EXIT trap at line 101 (trap cleanup_and_save_logs EXIT), so it runs on every script exit — both success (exit 0) and failure. The new annotation block (lines 76–94) is guarded by only two conditions:

  1. [[ -s "$err_file" ]] — stderr file is non-empty (line 66 enclosing block)
  2. [[ -n "${GITHUB_ACTIONS:-}" ]] — running under Actions (line 76)

Neither indicates that the job actually failed. There is no $? capture and no exit-code gate.

Why a successful run will hit it

SGLang/vLLM multinode jobs with MoRI + NCCL + SLURM essentially always produce non-empty stderr on success:

  • tqdm writes progress bars to stderr by default
  • NCCL/RCCL prints INFO banners to stderr
  • warnings.warn(...) writes to stderr
  • SGLang/vLLM server INFO logs often route to stderr
  • srun --unbuffered and docker stop $(docker ps -a -q) || true (job.slurm) leak banners and "no such container" messages to stderr — the trailing || true swallows the exit code but not the output
  • The submit.sh wrapper sets --error=slurm_job-%j.err, so all of the above lands in the file the trap inspects

Two false-positive paths on green runs

  1. Pattern match (line 84): grep -qiE "MoRI|mori_conn|pd[- ]?disagg" matches case-insensitively against the literal string MoRI anywhere in stderr. For a MoRI-based benchmark this is effectively guaranteed (framework name appears in INFO/log lines and startup banners). Result: ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error.
  2. Else branch (line 92): Even when no class matches, any non-empty unmatched stderr emits ::error title=AMD disagg job <id> failed::Unclassified failure - see last 100 lines of slurm .err above.

Either path renders a red banner in the Actions UI claiming the job FAILED on a run whose exit code is 0 and whose result JSON was written normally.

Step-by-step proof

  1. Script runs, trap cleanup_and_save_logs EXIT is set (line 101).
  2. Multinode benchmark completes successfully, exit code 0.
  3. cleanup_and_save_logs fires on EXIT.
  4. err_file=$BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.err is non-empty (contains NCCL/MoRI INFO banners + tqdm + srun stderr).
  5. [[ -s "$err_file" ]] → true → enter the block.
  6. [[ -n "${GITHUB_ACTIONS:-}" ]] → true (we are in CI) → enter annotation block.
  7. First three patterns (Model not found, ReadTimeout, Fp8BlockwiseQuant) don't match a clean run.
  8. Fourth elif: grep -qiE "MoRI|mori_conn|pd[- ]?disagg" "$err_file" → matches the literal "MoRI" in any startup log line → sig="transport-flake: MoRI KV-transport error".
  9. Line 90 emits ::error title=AMD disagg job <id> failed::transport-flake: MoRI KV-transport error (see slurm .err artifact).
  10. Actions UI shows a red error banner on a successful, exit-0 run.

Existing code already acknowledges this

The pre-existing if [[ -s "$err_file" ]] guard at line 66 plus the unconditional tail -100 block already implicitly acknowledges that .err is routinely non-empty on non-failure runs (otherwise the tail print wouldn't be conditional). What the new block does is escalate that benign non-emptiness into a misleading ::error:: annotation.

Impact

The annotation directly defeats the PR's stated goal ("surface real failure class", "fail loudly and legibly"). After a few green-but-red runs, reviewers learn to ignore the annotation entirely, and the next real failure looks identical to the prior false positives.

Fix

Capture the trap-entry exit code as the very first line of cleanup_and_save_logs and gate only the ::error:: emission on it. The diagnostic .err tail (lines 67–69) is fine to leave unconditional since it's only printed output, not an Actions annotation:

cleanup_and_save_logs() {
    local rc=$?  # MUST be first statement
    # ... existing cp/tail logic unchanged ...
    if [[ $rc -ne 0 && -n "${GITHUB_ACTIONS:-}" && -s "$err_file" ]]; then
        # existing classifier + ::error:: emission
    fi
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant