Skip to content

trigger H100 multinode evals#1120

Merged
Oseltamivir merged 3 commits intomainfrom
h100-multinode-eval-v2
Apr 23, 2026
Merged

trigger H100 multinode evals#1120
Oseltamivir merged 3 commits intomainfrom
h100-multinode-eval-v2

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir merged commit ed6006e into main Apr 23, 2026
16 checks passed
@Oseltamivir Oseltamivir deleted the h100-multinode-eval-v2 branch April 23, 2026 04:05
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 perf-changelog.yaml:1705-1711 — This changelog entry is missing evals-only: true, which will cause unintended H100 multinode benchmark sweeps to run alongside the intended eval runs for both dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. The PR title/description indicate evals-only intent, and every other analogous eval-trigger entry in this file (PRs #558, #892, #911, #1000, and the directly parallel H200 entry from #1094) sets the flag. Also a minor nit: pr-link points to #1119 (the prior timeout-fix PR referenced in the description) instead of this PR — convention across the file is that pr-link identifies the PR that adds the entry.

    Extended reasoning...

    The bug

    The new changelog entry at perf-changelog.yaml:1705-1711 adds an entry for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang without an evals-only: true field. The intent (per PR title "trigger H100 multinode evals" and description "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") is clearly to kick off evals-only for these two config-keys — nothing about the entry signals a perf/benchmark change, and there are no recipe or configuration changes in this PR.

    Code path that triggers unintended benchmark runs

    In utils/process_changelog.py (around line 107), benchmark-config generation is gated on if not entry.evals_only:. The field defaults to False (utils/matrix_logic/validation.py:345: evals_only: bool = Field(alias="evals-only", default=False)), so omitting the field takes the default branch that generates both benchmark configs (lines ~108-132) and eval configs (lines ~134-158).

    Concretely, when the changelog-processing CI job runs on the post-merge state, it sees the added entry with evals_only=False, so it iterates the config-keys list into benchmark_configs, invokes generate_sweep_configs.py with --no-evals (producing a full benchmark sweep), and separately invokes the evals branch. Both dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang exist in .github/configs/nvidia-master.yaml, so these are fully realized multinode sweeps — not filtered out as unknown keys.

    Why the existing code doesn't prevent it

    The evals_only default is False; there is no heuristic that infers eval-only intent from the description, nor any lint preventing eval-trigger entries from omitting the flag. PR-review is effectively the only line of defense. The immediate-above entry that this PR is patterned after (PR #1094 "Add H200 multinode evals-only runs" at lines 1673-1680) correctly sets evals-only: true, and every other multinode eval trigger entry in this file (PRs #558, #892, #911, #1000, #1094) also sets it. This entry breaks that unbroken convention.

    Step-by-step proof

    1. CI runs utils/process_changelog.py on the post-merge tree and reads the newly added block at lines 1705-1711.
    2. Pydantic validation runs on the entry; since evals-only is absent, entry.evals_only defaults to False (validation.py:345).
    3. process_changelog.py:107 evaluates if not entry.evals_only:True → enters the benchmark branch.
    4. config-keys = [dsr1-fp8-h100-dynamo-trt, dsr1-fp8-h100-dynamo-sglang] are added to benchmark_configs (neither has been seen before in this PR's changelog_data, which contains only this one added block).
    5. generate_sweep_configs.py --no-evals is invoked for both config-keys, producing a full multinode benchmark sweep across all sequence lengths and concurrencies defined in nvidia-master.yaml.
    6. The code then falls through to the evals branch and also generates the intended eval configs.
    7. Net effect: both a full H100 multinode benchmark sweep and the intended eval run are launched, costing significant multinode GPU time and polluting the perf-history classification for these config-keys (they'll show perf-changelog entries that imply a perf/benchmark change when the PR made no recipe or config change).

    How to fix

    Add evals-only: true to the entry, matching the immediately analogous H200 predecessor at lines 1673-1680:

    - config-keys:
        - dsr1-fp8-h100-dynamo-trt
        - dsr1-fp8-h100-dynamo-sglang
      description:
        - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1120
      evals-only: true

    Also (nit): update pr-link from /pull/1119 to /pull/1120. The file-wide convention (documented at .github/workflows/claude.yml:243 and followed by 160+ entries including the analogous H200 entry which links to its own #1094) is that pr-link identifies the PR that adds the row, not a motivating reference PR. Precedent: the dsr1-fp8-h100-dynamo-trt entry at PR #663 references PR #651 inside its description text but still uses #663 as its pr-link. The pr-link has no runtime semantics (it's only type-checked as a string in validation.py:344), so this part is documentation/traceability only — hence noted as a nit, not blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant