Skip to content

runners.yaml: add per-cluster directory mappings (centralize paths hardcoded in launch_*.sh) #1973

Description

@cquil11

Summary

.github/configs/runners.yaml currently only maps two things per cluster: labels: (scheduling-label → runner-name groups) and hardware: (available-cpu-dram-mib, gpus-per-node). Everything else that is genuinely per-cluster — most importantly filesystem directories — is hardcoded and duplicated across the 16 runners/launch_*.sh scripts. We should extend runners.yaml to be the single source of truth for these per-cluster mappings and have the launchers read from it.

Problem

Per-cluster paths are scattered across every launcher and diverge by site, so adding/moving a cluster or fixing a path means editing many bash files by hand. A grep across runners/ shows the kinds of directories currently baked in:

  • HF hub cache: /mnt/vast/gharunner/hf-hub-cache, /mnt/data/gharunners/hf-hub-cache/, ...
  • Squashfs / container image dirs: /mnt/vast/gharunner/squash/, /data/squash/, /home/slurm-shared/gharunners/squash, /data/home/sa-shared/gharunners/squash/, /mnt/nfs/lustre/containers/, /data/containers/
  • Model weight-staging: /data/models/{dsv,dsr,MiniMax-M...}, /mnt/nfs/lustre/models/...
  • AIPerf cache / dataset mmap cache: /mnt/vast/gharunner/ai-perf-cache, AIPERF_DATASET_MMAP_CACHE_DIR
  • NFS home mounts: /home/sa-shared/, /mnt/nfs/sa-shared/, /data/home/sa-shared/
  • Scratch/lustre roots: /mnt/lustre, /data/

These are exactly the values that differ per cluster and belong next to the existing hardware: metadata. A concrete example of why this matters: the gb300 NFS ELOOP workaround requires using /data/home/sa-shared/... instead of /home/sa-shared/... on that specific cluster — the sort of per-cluster path divergence that should be declared in one place, not remembered and hand-edited into a launcher.

Proposal

Add a new per-cluster section to runners.yaml, keyed by the same cluster:<name> keys already used under hardware:, e.g.:

paths:
  cluster:h200-cw:
    hf-cache-dir: /mnt/vast/gharunner/hf-hub-cache
    squash-dir: /mnt/vast/gharunner/squash
    aiperf-cache-dir: /mnt/vast/gharunner/ai-perf-cache
    container-image-dir: /mnt/nfs/lustre/containers
    model-weights-dir: /mnt/nfs/lustre/models
    home-mount: /home/sa-shared
    scratch-dir: /mnt/lustre
  cluster:gb300-nv:
    ...
    home-mount: /data/home/sa-shared   # NFS ELOOP workaround, per-cluster

Then have the launchers resolve these from runners.yaml (via the same loader generate_sweep_configs.py already uses for labels/hardware, or a small shared helper) instead of hardcoding them, keeping the current values as defaults during migration.

Exact field names/shape are up for discussion — the goal is: per-cluster directories (and other cluster-specific config) declared once in runners.yaml, consumed by launchers.

Acceptance criteria

  • runners.yaml gains a per-cluster mapping (keyed by cluster:<name>) for the directories currently hardcoded in runners/launch_*.sh (HF cache, squash/container dirs, model-weights/staging, aiperf + dataset mmap cache, home mount, scratch).
  • Launchers read these paths from runners.yaml rather than embedding literals (shared loader/helper).
  • Per-cluster overrides (e.g. gb300 /data/home/sa-shared NFS workaround) are expressed as data, not special-cased in bash.
  • Existing behavior preserved (same resolved paths per cluster) after migration.

Notes

Could also fold in other per-cluster knobs that are currently implicit (e.g. filesystem quirks, default partitions/accounts) as follow-ups. Scope this issue to directories first.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions