Summary
.github/configs/runners.yaml currently only maps two things per cluster: labels: (scheduling-label → runner-name groups) and hardware: (available-cpu-dram-mib, gpus-per-node). Everything else that is genuinely per-cluster — most importantly filesystem directories — is hardcoded and duplicated across the 16 runners/launch_*.sh scripts. We should extend runners.yaml to be the single source of truth for these per-cluster mappings and have the launchers read from it.
Problem
Per-cluster paths are scattered across every launcher and diverge by site, so adding/moving a cluster or fixing a path means editing many bash files by hand. A grep across runners/ shows the kinds of directories currently baked in:
- HF hub cache:
/mnt/vast/gharunner/hf-hub-cache, /mnt/data/gharunners/hf-hub-cache/, ...
- Squashfs / container image dirs:
/mnt/vast/gharunner/squash/, /data/squash/, /home/slurm-shared/gharunners/squash, /data/home/sa-shared/gharunners/squash/, /mnt/nfs/lustre/containers/, /data/containers/
- Model weight-staging:
/data/models/{dsv,dsr,MiniMax-M...}, /mnt/nfs/lustre/models/...
- AIPerf cache / dataset mmap cache:
/mnt/vast/gharunner/ai-perf-cache, AIPERF_DATASET_MMAP_CACHE_DIR
- NFS home mounts:
/home/sa-shared/, /mnt/nfs/sa-shared/, /data/home/sa-shared/
- Scratch/lustre roots:
/mnt/lustre, /data/
These are exactly the values that differ per cluster and belong next to the existing hardware: metadata. A concrete example of why this matters: the gb300 NFS ELOOP workaround requires using /data/home/sa-shared/... instead of /home/sa-shared/... on that specific cluster — the sort of per-cluster path divergence that should be declared in one place, not remembered and hand-edited into a launcher.
Proposal
Add a new per-cluster section to runners.yaml, keyed by the same cluster:<name> keys already used under hardware:, e.g.:
paths:
cluster:h200-cw:
hf-cache-dir: /mnt/vast/gharunner/hf-hub-cache
squash-dir: /mnt/vast/gharunner/squash
aiperf-cache-dir: /mnt/vast/gharunner/ai-perf-cache
container-image-dir: /mnt/nfs/lustre/containers
model-weights-dir: /mnt/nfs/lustre/models
home-mount: /home/sa-shared
scratch-dir: /mnt/lustre
cluster:gb300-nv:
...
home-mount: /data/home/sa-shared # NFS ELOOP workaround, per-cluster
Then have the launchers resolve these from runners.yaml (via the same loader generate_sweep_configs.py already uses for labels/hardware, or a small shared helper) instead of hardcoding them, keeping the current values as defaults during migration.
Exact field names/shape are up for discussion — the goal is: per-cluster directories (and other cluster-specific config) declared once in runners.yaml, consumed by launchers.
Acceptance criteria
Notes
Could also fold in other per-cluster knobs that are currently implicit (e.g. filesystem quirks, default partitions/accounts) as follow-ups. Scope this issue to directories first.
Summary
.github/configs/runners.yamlcurrently only maps two things per cluster:labels:(scheduling-label → runner-name groups) andhardware:(available-cpu-dram-mib,gpus-per-node). Everything else that is genuinely per-cluster — most importantly filesystem directories — is hardcoded and duplicated across the 16runners/launch_*.shscripts. We should extendrunners.yamlto be the single source of truth for these per-cluster mappings and have the launchers read from it.Problem
Per-cluster paths are scattered across every launcher and diverge by site, so adding/moving a cluster or fixing a path means editing many bash files by hand. A grep across
runners/shows the kinds of directories currently baked in:/mnt/vast/gharunner/hf-hub-cache,/mnt/data/gharunners/hf-hub-cache/, .../mnt/vast/gharunner/squash/,/data/squash/,/home/slurm-shared/gharunners/squash,/data/home/sa-shared/gharunners/squash/,/mnt/nfs/lustre/containers/,/data/containers//data/models/{dsv,dsr,MiniMax-M...},/mnt/nfs/lustre/models/.../mnt/vast/gharunner/ai-perf-cache,AIPERF_DATASET_MMAP_CACHE_DIR/home/sa-shared/,/mnt/nfs/sa-shared/,/data/home/sa-shared//mnt/lustre,/data/These are exactly the values that differ per cluster and belong next to the existing
hardware:metadata. A concrete example of why this matters: the gb300 NFS ELOOP workaround requires using/data/home/sa-shared/...instead of/home/sa-shared/...on that specific cluster — the sort of per-cluster path divergence that should be declared in one place, not remembered and hand-edited into a launcher.Proposal
Add a new per-cluster section to
runners.yaml, keyed by the samecluster:<name>keys already used underhardware:, e.g.:Then have the launchers resolve these from
runners.yaml(via the same loadergenerate_sweep_configs.pyalready uses forlabels/hardware, or a small shared helper) instead of hardcoding them, keeping the current values as defaults during migration.Exact field names/shape are up for discussion — the goal is: per-cluster directories (and other cluster-specific config) declared once in
runners.yaml, consumed by launchers.Acceptance criteria
runners.yamlgains a per-cluster mapping (keyed bycluster:<name>) for the directories currently hardcoded inrunners/launch_*.sh(HF cache, squash/container dirs, model-weights/staging, aiperf + dataset mmap cache, home mount, scratch).runners.yamlrather than embedding literals (shared loader/helper)./data/home/sa-sharedNFS workaround) are expressed as data, not special-cased in bash.Notes
Could also fold in other per-cluster knobs that are currently implicit (e.g. filesystem quirks, default partitions/accounts) as follow-ups. Scope this issue to directories first.