runners.yaml: add per-cluster directory mappings (centralize paths hardcoded in launch_*.sh)

## Summary

`.github/configs/runners.yaml` currently only maps two things per cluster: `labels:` (scheduling-label → runner-name groups) and `hardware:` (`available-cpu-dram-mib`, `gpus-per-node`). Everything else that is genuinely **per-cluster** — most importantly filesystem **directories** — is hardcoded and duplicated across the 16 `runners/launch_*.sh` scripts. We should extend `runners.yaml` to be the single source of truth for these per-cluster mappings and have the launchers read from it.

## Problem

Per-cluster paths are scattered across every launcher and diverge by site, so adding/moving a cluster or fixing a path means editing many bash files by hand. A grep across `runners/` shows the kinds of directories currently baked in:

- HF hub cache: `/mnt/vast/gharunner/hf-hub-cache`, `/mnt/data/gharunners/hf-hub-cache/`, ...
- Squashfs / container image dirs: `/mnt/vast/gharunner/squash/`, `/data/squash/`, `/home/slurm-shared/gharunners/squash`, `/data/home/sa-shared/gharunners/squash/`, `/mnt/nfs/lustre/containers/`, `/data/containers/`
- Model weight-staging: `/data/models/{dsv,dsr,MiniMax-M...}`, `/mnt/nfs/lustre/models/...`
- AIPerf cache / dataset mmap cache: `/mnt/vast/gharunner/ai-perf-cache`, `AIPERF_DATASET_MMAP_CACHE_DIR`
- NFS home mounts: `/home/sa-shared/`, `/mnt/nfs/sa-shared/`, `/data/home/sa-shared/`
- Scratch/lustre roots: `/mnt/lustre`, `/data/`

These are exactly the values that differ per cluster and belong next to the existing `hardware:` metadata. A concrete example of why this matters: the gb300 NFS ELOOP workaround requires using `/data/home/sa-shared/...` instead of `/home/sa-shared/...` on that specific cluster — the sort of per-cluster path divergence that should be declared in one place, not remembered and hand-edited into a launcher.

## Proposal

Add a new per-cluster section to `runners.yaml`, keyed by the same `cluster:<name>` keys already used under `hardware:`, e.g.:

```yaml
paths:
  cluster:h200-cw:
    hf-cache-dir: /mnt/vast/gharunner/hf-hub-cache
    squash-dir: /mnt/vast/gharunner/squash
    aiperf-cache-dir: /mnt/vast/gharunner/ai-perf-cache
    container-image-dir: /mnt/nfs/lustre/containers
    model-weights-dir: /mnt/nfs/lustre/models
    home-mount: /home/sa-shared
    scratch-dir: /mnt/lustre
  cluster:gb300-nv:
    ...
    home-mount: /data/home/sa-shared   # NFS ELOOP workaround, per-cluster
```

Then have the launchers resolve these from `runners.yaml` (via the same loader `generate_sweep_configs.py` already uses for `labels`/`hardware`, or a small shared helper) instead of hardcoding them, keeping the current values as defaults during migration.

Exact field names/shape are up for discussion — the goal is: **per-cluster directories (and other cluster-specific config) declared once in `runners.yaml`, consumed by launchers.**

## Acceptance criteria

- [ ] `runners.yaml` gains a per-cluster mapping (keyed by `cluster:<name>`) for the directories currently hardcoded in `runners/launch_*.sh` (HF cache, squash/container dirs, model-weights/staging, aiperf + dataset mmap cache, home mount, scratch).
- [ ] Launchers read these paths from `runners.yaml` rather than embedding literals (shared loader/helper).
- [ ] Per-cluster overrides (e.g. gb300 `/data/home/sa-shared` NFS workaround) are expressed as data, not special-cased in bash.
- [ ] Existing behavior preserved (same resolved paths per cluster) after migration.

## Notes

Could also fold in other per-cluster knobs that are currently implicit (e.g. filesystem quirks, default partitions/accounts) as follow-ups. Scope this issue to directories first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runners.yaml: add per-cluster directory mappings (centralize paths hardcoded in launch_*.sh) #1973

Summary

Problem

Proposal

Acceptance criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

runners.yaml: add per-cluster directory mappings (centralize paths hardcoded in launch_*.sh) #1973

Description

Summary

Problem

Proposal

Acceptance criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions