update b200 runner#1192
Conversation
| - 'b200-cw_01' | ||
| - 'b200-nb_0' | ||
| - 'b200-nb_1' | ||
| - 'b200-dgxc_0' | ||
| - 'b200-dgxc_1' | ||
| - 'b200-dgxc_2' | ||
| - 'b200-dgxc_3' | ||
| - 'b200-dgxc_4' | ||
| - 'b200-dgxc_5' | ||
| - 'b200-dgxc_6' | ||
| - 'b200-dgxc_7' | ||
| - 'b200-dgxc_8' | ||
| - 'b200-dgxc_9' | ||
| - 'b200-dgxc_00' | ||
| - 'b200-dgxc_01' | ||
| - 'b200-dgxc_02' | ||
| - 'b200-dgxc_03' | ||
| - 'b200-dgxc_04' | ||
| - 'b200-dgxc_05' | ||
| - 'b200-dgxc_06' |
There was a problem hiding this comment.
🔴 The b200-multinode runner labels at runners.yaml:74-77 still use the b200-dgxc-slurm_* prefix, but this PR renamed runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh. Since the workflow templates dispatch via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, all six runner: b200-multinode jobs in nvidia-master.yaml will fail at the launch step with 'No such file or directory'. Fix by renaming the b200-multinode entries to use the b200-dgxc prefix (and re-registering the self-hosted runners), or by adding a launch_b200-dgxc-slurm.sh symlink/wrapper.
Extended reasoning...
What the bug is
Commit f5ffe76 performs a 100% rename of runners/launch_b200-dgxc-slurm.sh → runners/launch_b200-dgxc.sh (visible as R100 in git show f5ffe76 --name-status). The single-node b200 group in .github/configs/runners.yaml is also relabeled from b200-dgxc_0..9 to zero-padded b200-dgxc_00..09. However the b200-multinode group at runners.yaml:74-77 is left untouched and still reads:
b200-multinode:
- 'b200-dgxc-slurm_6'
- 'b200-dgxc-slurm_7'
- 'b200-dgxc-slurm_8'The code path that triggers it
All three workflow dispatchers compute the launch script from the runner name with bash parameter expansion that strips everything from the first underscore:
.github/workflows/benchmark-multinode-tmpl.yml:177—bash ./runners/launch_${RUNNER_NAME%%_*}.sh.github/workflows/benchmark-tmpl.yml:154— same.github/workflows/profile.yml:167— same
RUNNER_NAME is set from ${{ runner.name }}, i.e. the literal label from runners.yaml.
Step-by-step proof
- A
runs-on: b200-multinodejob lands on the runner namedb200-dgxc-slurm_6. - The workflow sets
RUNNER_NAME=b200-dgxc-slurm_6. ${RUNNER_NAME%%_*}strips from the first_to the end, yieldingb200-dgxc-slurm(the hyphens are not delimiters).- The workflow runs
bash ./runners/launch_b200-dgxc-slurm.sh. - That file no longer exists in the tree (
ls runners/shows onlylaunch_b200-dgxc.sh;grep launch_b200-dgxc-slurmreturns no hits). - Bash exits with
No such file or directory, and the workflow step fails.
Why existing code does not prevent it
Nothing else in the dispatch chain re-maps the prefix; the workflow uses the raw runner-name prefix to pick the script. The renamed launch_b200-dgxc.sh even contains the IS_MULTINODE=true branch, showing the intent was to cover the multinode case — but the runner labels were not updated to match.
Impact
.github/configs/nvidia-master.yaml declares runner: b200-multinode in six places (lines 5, 390, 6623, 6759, 6929, 7128), covering DSR1-FP4 / DSR1-FP8 dynamo-trt and dynamo-sglang multinode benchmarks. Every one of these will fail at the launch step immediately after this PR merges. The h100/h200 stacks are unaffected because their multinode labels (h100-dgxc-slurm_*, h200-dgxc-slurm_*) still match their existing launch_h100-dgxc-slurm.sh / launch_h200-dgxc-slurm.sh scripts — only b200 has the prefix mismatch after this PR.
Fix
Two straightforward options:
- Rename the
b200-multinodeentries to share the new prefix, e.g.b200-dgxc_06/07/08(and re-register the self-hosted runners under the new labels) so${RUNNER_NAME%%_*}resolves tob200-dgxcand dispatches tolaunch_b200-dgxc.sh. - Restore the old name as a wrapper or symlink (
runners/launch_b200-dgxc-slurm.sh→launch_b200-dgxc.sh) so the existing labels keep resolving.
Matches the rename done on main in PR #1192 (commit 100a5ec). The actual GitHub runner.name reports as 'b200-dgxc_NN', so \${RUNNER_NAME%%_*} produces 'b200-dgxc' — the launcher must be named launch_b200-dgxc.sh for the workflow's launcher-selection step to find it. Without this rename, every job scheduled onto a b200-dgxc runner fails immediately with "No such file or directory". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No description provided.