Skip to content

update b200 runner#1192

Merged
Oseltamivir merged 1 commit intomainfrom
update-b200-runners
Apr 27, 2026
Merged

update b200 runner#1192
Oseltamivir merged 1 commit intomainfrom
update-b200-runners

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

No description provided.

@Oseltamivir Oseltamivir requested a review from a team April 27, 2026 02:44
@Oseltamivir Oseltamivir merged commit 100a5ec into main Apr 27, 2026
4 checks passed
@Oseltamivir Oseltamivir deleted the update-b200-runners branch April 27, 2026 02:46
Comment on lines 54 to +63
- 'b200-cw_01'
- 'b200-nb_0'
- 'b200-nb_1'
- 'b200-dgxc_0'
- 'b200-dgxc_1'
- 'b200-dgxc_2'
- 'b200-dgxc_3'
- 'b200-dgxc_4'
- 'b200-dgxc_5'
- 'b200-dgxc_6'
- 'b200-dgxc_7'
- 'b200-dgxc_8'
- 'b200-dgxc_9'
- 'b200-dgxc_00'
- 'b200-dgxc_01'
- 'b200-dgxc_02'
- 'b200-dgxc_03'
- 'b200-dgxc_04'
- 'b200-dgxc_05'
- 'b200-dgxc_06'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The b200-multinode runner labels at runners.yaml:74-77 still use the b200-dgxc-slurm_* prefix, but this PR renamed runners/launch_b200-dgxc-slurm.shlaunch_b200-dgxc.sh. Since the workflow templates dispatch via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, all six runner: b200-multinode jobs in nvidia-master.yaml will fail at the launch step with 'No such file or directory'. Fix by renaming the b200-multinode entries to use the b200-dgxc prefix (and re-registering the self-hosted runners), or by adding a launch_b200-dgxc-slurm.sh symlink/wrapper.

Extended reasoning...

What the bug is

Commit f5ffe76 performs a 100% rename of runners/launch_b200-dgxc-slurm.shrunners/launch_b200-dgxc.sh (visible as R100 in git show f5ffe76 --name-status). The single-node b200 group in .github/configs/runners.yaml is also relabeled from b200-dgxc_0..9 to zero-padded b200-dgxc_00..09. However the b200-multinode group at runners.yaml:74-77 is left untouched and still reads:

b200-multinode:
- 'b200-dgxc-slurm_6'
- 'b200-dgxc-slurm_7'
- 'b200-dgxc-slurm_8'

The code path that triggers it

All three workflow dispatchers compute the launch script from the runner name with bash parameter expansion that strips everything from the first underscore:

  • .github/workflows/benchmark-multinode-tmpl.yml:177bash ./runners/launch_${RUNNER_NAME%%_*}.sh
  • .github/workflows/benchmark-tmpl.yml:154 — same
  • .github/workflows/profile.yml:167 — same

RUNNER_NAME is set from ${{ runner.name }}, i.e. the literal label from runners.yaml.

Step-by-step proof

  1. A runs-on: b200-multinode job lands on the runner named b200-dgxc-slurm_6.
  2. The workflow sets RUNNER_NAME=b200-dgxc-slurm_6.
  3. ${RUNNER_NAME%%_*} strips from the first _ to the end, yielding b200-dgxc-slurm (the hyphens are not delimiters).
  4. The workflow runs bash ./runners/launch_b200-dgxc-slurm.sh.
  5. That file no longer exists in the tree (ls runners/ shows only launch_b200-dgxc.sh; grep launch_b200-dgxc-slurm returns no hits).
  6. Bash exits with No such file or directory, and the workflow step fails.

Why existing code does not prevent it

Nothing else in the dispatch chain re-maps the prefix; the workflow uses the raw runner-name prefix to pick the script. The renamed launch_b200-dgxc.sh even contains the IS_MULTINODE=true branch, showing the intent was to cover the multinode case — but the runner labels were not updated to match.

Impact

.github/configs/nvidia-master.yaml declares runner: b200-multinode in six places (lines 5, 390, 6623, 6759, 6929, 7128), covering DSR1-FP4 / DSR1-FP8 dynamo-trt and dynamo-sglang multinode benchmarks. Every one of these will fail at the launch step immediately after this PR merges. The h100/h200 stacks are unaffected because their multinode labels (h100-dgxc-slurm_*, h200-dgxc-slurm_*) still match their existing launch_h100-dgxc-slurm.sh / launch_h200-dgxc-slurm.sh scripts — only b200 has the prefix mismatch after this PR.

Fix

Two straightforward options:

  1. Rename the b200-multinode entries to share the new prefix, e.g. b200-dgxc_06/07/08 (and re-register the self-hosted runners under the new labels) so ${RUNNER_NAME%%_*} resolves to b200-dgxc and dispatches to launch_b200-dgxc.sh.
  2. Restore the old name as a wrapper or symlink (runners/launch_b200-dgxc-slurm.shlaunch_b200-dgxc.sh) so the existing labels keep resolving.

cquil11 added a commit that referenced this pull request Apr 27, 2026
Matches the rename done on main in PR #1192 (commit 100a5ec). The
actual GitHub runner.name reports as 'b200-dgxc_NN', so
\${RUNNER_NAME%%_*} produces 'b200-dgxc' — the launcher must be named
launch_b200-dgxc.sh for the workflow's launcher-selection step to find
it. Without this rename, every job scheduled onto a b200-dgxc runner
fails immediately with "No such file or directory".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant