trigger H100 multinode evals by Oseltamivir · Pull Request #1120 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-23T04:00:30Z

No description provided.

github-actions · 2026-04-23T04:00:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-23T04:00:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

…-v2' into h100-multinode-eval-v2

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 perf-changelog.yaml:1705-1711 — This changelog entry is missing evals-only: true, which will cause unintended H100 multinode benchmark sweeps to run alongside the intended eval runs for both dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang. The PR title/description indicate evals-only intent, and every other analogous eval-trigger entry in this file (PRs #558, #892, #911, #1000, and the directly parallel H200 entry from #1094) sets the flag. Also a minor nit: pr-link points to #1119 (the prior timeout-fix PR referenced in the description) instead of this PR — convention across the file is that pr-link identifies the PR that adds the entry.
Extended reasoning...

The bug

The new changelog entry at perf-changelog.yaml:1705-1711 adds an entry for dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang without an evals-only: true field. The intent (per PR title "trigger H100 multinode evals" and description "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes") is clearly to kick off evals-only for these two config-keys — nothing about the entry signals a perf/benchmark change, and there are no recipe or configuration changes in this PR.

Code path that triggers unintended benchmark runs

In utils/process_changelog.py (around line 107), benchmark-config generation is gated on if not entry.evals_only:. The field defaults to False (utils/matrix_logic/validation.py:345: evals_only: bool = Field(alias="evals-only", default=False)), so omitting the field takes the default branch that generates both benchmark configs (lines ~108-132) and eval configs (lines ~134-158).

Concretely, when the changelog-processing CI job runs on the post-merge state, it sees the added entry with evals_only=False, so it iterates the config-keys list into benchmark_configs, invokes generate_sweep_configs.py with --no-evals (producing a full benchmark sweep), and separately invokes the evals branch. Both dsr1-fp8-h100-dynamo-trt and dsr1-fp8-h100-dynamo-sglang exist in .github/configs/nvidia-master.yaml, so these are fully realized multinode sweeps — not filtered out as unknown keys.

Why the existing code doesn't prevent it

The evals_only default is False; there is no heuristic that infers eval-only intent from the description, nor any lint preventing eval-trigger entries from omitting the flag. PR-review is effectively the only line of defense. The immediate-above entry that this PR is patterned after (PR #1094 "Add H200 multinode evals-only runs" at lines 1673-1680) correctly sets evals-only: true, and every other multinode eval trigger entry in this file (PRs #558, #892, #911, #1000, #1094) also sets it. This entry breaks that unbroken convention.

Step-by-step proof
1. CI runs utils/process_changelog.py on the post-merge tree and reads the newly added block at lines 1705-1711.
2. Pydantic validation runs on the entry; since evals-only is absent, entry.evals_only defaults to False (validation.py:345).
3. process_changelog.py:107 evaluates if not entry.evals_only: → True → enters the benchmark branch.
4. config-keys = [dsr1-fp8-h100-dynamo-trt, dsr1-fp8-h100-dynamo-sglang] are added to benchmark_configs (neither has been seen before in this PR's changelog_data, which contains only this one added block).
5. generate_sweep_configs.py --no-evals is invoked for both config-keys, producing a full multinode benchmark sweep across all sequence lengths and concurrencies defined in nvidia-master.yaml.
6. The code then falls through to the evals branch and also generates the intended eval configs.
7. Net effect: both a full H100 multinode benchmark sweep and the intended eval run are launched, costing significant multinode GPU time and polluting the perf-history classification for these config-keys (they'll show perf-changelog entries that imply a perf/benchmark change when the PR made no recipe or config change).
How to fix

Add evals-only: true to the entry, matching the immediately analogous H200 predecessor at lines 1673-1680:
```
- config-keys:
    - dsr1-fp8-h100-dynamo-trt
    - dsr1-fp8-h100-dynamo-sglang
  description:
    - "Trigger H100 multinode evals after dist-timeout and health-check timeout fixes"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1120
  evals-only: true
```
Also (nit): update pr-link from /pull/1119 to /pull/1120. The file-wide convention (documented at .github/workflows/claude.yml:243 and followed by 160+ entries including the analogous H200 entry which links to its own #1094) is that pr-link identifies the PR that adds the row, not a motivating reference PR. Precedent: the dsr1-fp8-h100-dynamo-trt entry at PR #663 references PR #651 inside its description text but still uses #663 as its pr-link. The pr-link has no runtime semantics (it's only type-checked as a string in validation.py:344), so this part is documentation/traceability only — hence noted as a nit, not blocking.

trigger H100 multinode evals

c1667f3

Oseltamivir requested a review from a team April 23, 2026 04:00

github-project-automation Bot added this to InferenceMAX Board Apr 23, 2026

Oseltamivir added 2 commits April 22, 2026 21:04

Merge remote-tracking branch 'refs/remotes/origin/h100-multinode-eval…

6d67947

…-v2' into h100-multinode-eval-v2

h100 evals

e4340bf

Oseltamivir merged commit ed6006e into main Apr 23, 2026
16 checks passed

Oseltamivir deleted the h100-multinode-eval-v2 branch April 23, 2026 04:05

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

This was referenced Apr 24, 2026

Add dsv4-fp8-h200-sglang single-node config #1136

Closed

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) #1137

Open

[NVIDIA] chore: B200 single node DeepSeek v4 SGLang MTP #1145

Open

Increase timeout #1148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trigger H100 multinode evals#1120

trigger H100 multinode evals#1120
Oseltamivir merged 3 commits intomainfrom
h100-multinode-eval-v2

Oseltamivir commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

The bug

Code path that triggers unintended benchmark runs

Why the existing code doesn't prevent it

Step-by-step proof

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant