GPU benchmark + consistency CI (issue #48)#53
Open
SandersAaronD wants to merge 8 commits into
Open
Conversation
Adds an ephemeral-runner workflow that, on ready-for-review PRs, provisions a G4 (RTX PRO 6000) VM in overworld-dev as a one-shot self-hosted runner, runs a 256-frame perf rollout (LFPS) and a deterministic consistency forward pass for each config against both main and the PR HEAD, posts a comparison table + MSE, then deletes the VM. - examples/ci.py: provider-agnostic perf+consistency harness and aggregator - .github/runner-startup.sh: registers the VM as an ephemeral runner (no SSH, fits the provisioner SA's create/delete-only permissions) - .github/workflows/benchmark.yml: start-runner / benchmark / aggregate / stop-runner Draft until the standard RTX PRO 6000 quota request for overworld-dev clears.
…ming The script is GCP/world_engine-specific CI glue, not a general-purpose example. Move examples/ci.py -> .github/benchmark.py and reword. Switch the workflow to invoke it via uv run --project per ref (drops the copy hack).
Author
|
Taking this off draft entirely to trigger/test GHA, not at all ready to merge. |
RUNNER_NAME is a built-in GitHub Actions environment variable that the runner overrides at runtime with its own name (e.g. "GitHub Actions 1000007239"). The workflow's `env: RUNNER_NAME` was therefore ignored, and start-runner passed the runner's name to `gcloud compute instances create`, which rejected it as an invalid GCE resource name: Invalid value for field 'resource.name': 'GitHub Actions 1000007239'. The same bug made stop-runner's `instances delete "$RUNNER_NAME"` target the wrong name (silently, via `|| true`), so a created VM would leak. Rename to GH_RUNNER_NAME / GH_RUNNER_LABEL: clear of the reserved RUNNER_ and GITHUB_ prefixes, and consistent with the GH_RUNNER_PAT secret. Updates the create positional arg, delete arg, and metadata. The benchmark job's runs-on label is a literal expression and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mint step runs under `bash -e` (no pipefail), so `curl -f | jq` hid a failing PAT call behind jq's exit 0 — yielding an empty registration token that silently flowed into the VM metadata and only surfaced later as an opaque create failure. Capture the HTTP status with -w, print it, and on non-201 emit the API's .message/.documentation_url (non-sensitive error text, never the token) and exit 1. Only parse/emit the token on success; keep masking it. Also switch to `Authorization: Bearer` (the documented scheme for fine-grained PATs) and pass the PAT via env rather than inline interpolation. This both fixes the silent-empty-token footgun and tells us the exact reason the current GH_RUNNER_PAT mint is failing (the token is configured correctly per the UI, so the failure is something only the HTTP response will reveal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A fine-grained PAT is capped by its owner's repo role; the token owner
has Maintain (not Admin) on world_engine, and creating a runner
registration token requires Admin, so the PAT 403'd ("Resource not
accessible by personal access token") despite carrying Administration:write.
Switch to a GitHub App installation token (actions/create-github-app-token),
whose permissions come from the App install (Administration:write) and are
not bounded by any human's role. Auto-scoped to this repo, auto-masked,
1h expiry. Replaces GH_RUNNER_PAT with WE_CI_APP_ID / WE_CI_APP_PRIVATE_KEY.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
If the VM's startup script fails to register the runner, the benchmark job (runs-on the ephemeral label) sits queued waiting for a runner that never comes — up to GitHub's ~24h auto-fail — while the VM self-deletes at 3h, leaving no signal and no logs (startup output only reaches the VM serial console). start-runner now polls the runners API for an online runner with our label (~10 min bound) and, on timeout, dumps the VM serial console into the Actions log and fails — so the real cause is visible, benchmark is skipped (not queued), and stop-runner tears the VM down. Also harden runner-startup.sh: `set -x` + an ERR trap (so the serial dump pinpoints the failing line), run ./bin/installdependencies.sh (missing runner OS deps are a common silent config.sh failure), and bump RUNNER_VERSION 2.323.0 -> 2.334.0. Add timeout-minutes: 90 to benchmark as a backstop for a runner that registers then stalls mid-build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
us-central1 has no RTX PRO 6000 capacity; the us-south1 quota request is filed. Point the workflow ZONE at us-south1-a.
…uth1 Move benchmark runner zone to us-south1-a
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft — blocked on the standard RTX PRO 6000 quota in `overworld-dev` (currently 0; request pending). Opening now so the code can be reviewed in the meantime. This is a best-effort first cut: it targets our specific GCP setup and might just work on the first real run, or might need fixing once it can actually run.
What this does
On a ready-for-review PR, provisions an ephemeral G4 / RTX PRO 6000 VM in `overworld-dev`, runs the suite against both `main` and the PR HEAD on that GPU, comments a comparison table, then tears the VM down.
Required before this can run
Zone / image / disk / quota-bucket details are from a hands-on prototype attempt in `overworld-dev`.