Skip to content

GPU benchmark + consistency CI (issue #48)#53

Open
SandersAaronD wants to merge 8 commits into
mainfrom
sandersaarond/ci-gpu-benchmark
Open

GPU benchmark + consistency CI (issue #48)#53
SandersAaronD wants to merge 8 commits into
mainfrom
sandersaarond/ci-gpu-benchmark

Conversation

@SandersAaronD
Copy link
Copy Markdown

@SandersAaronD SandersAaronD commented Jun 3, 2026

Draft — blocked on the standard RTX PRO 6000 quota in `overworld-dev` (currently 0; request pending). Opening now so the code can be reviewed in the meantime. This is a best-effort first cut: it targets our specific GCP setup and might just work on the first real run, or might need fixing once it can actually run.

What this does

On a ready-for-review PR, provisions an ephemeral G4 / RTX PRO 6000 VM in `overworld-dev`, runs the suite against both `main` and the PR HEAD on that GPU, comments a comparison table, then tears the VM down.

  • `.github/benchmark.py` — `run` does a 256-frame perf rollout (LFPS) + a deterministic consistency forward pass from a shared, fully-populated KV cache (`get_state`/`load_state`, `use_deterministic_algorithms(True)`); `compare` emits the LFPS-delta + MSE markdown table. Failures are marked, not dropped.
  • `.github/runner-startup.sh` — registers the VM as a one-shot `--ephemeral` self-hosted runner. No SSH, so it fits the provisioner SA's create/delete-only IAM.
  • `.github/workflows/benchmark.yml` — `start-runner` → `benchmark` → `aggregate` → `stop-runner` (teardown on `always()`, plus a `max-run-duration` backstop). Runs main then PR on one VM via `uv run --project`, so only the engine install differs.

Required before this can run

  • Standard (non-preemptible) RTX PRO 6000 quota in `overworld-dev`/`us-central1` (request in).
  • Repo secrets: `WIF_PROVIDER_DEV`, `WE_CI_PROVISIONER_SA`, `WE_CI_NODE_SA`, `GH_RUNNER_PAT` (repo Administration:write, for runner registration tokens), `HF_TOKEN`.
  • Confirm `Overworld/Waypoint-*` model access works with `HF_TOKEN`.
  • Pin `RUNNER_VERSION` in the startup script to the current actions/runner release.

Zone / image / disk / quota-bucket details are from a hands-on prototype attempt in `overworld-dev`.

Adds an ephemeral-runner workflow that, on ready-for-review PRs, provisions a
G4 (RTX PRO 6000) VM in overworld-dev as a one-shot self-hosted runner, runs a
256-frame perf rollout (LFPS) and a deterministic consistency forward pass for
each config against both main and the PR HEAD, posts a comparison table + MSE,
then deletes the VM.

- examples/ci.py: provider-agnostic perf+consistency harness and aggregator
- .github/runner-startup.sh: registers the VM as an ephemeral runner (no SSH,
  fits the provisioner SA's create/delete-only permissions)
- .github/workflows/benchmark.yml: start-runner / benchmark / aggregate /
  stop-runner

Draft until the standard RTX PRO 6000 quota request for overworld-dev clears.
…ming

The script is GCP/world_engine-specific CI glue, not a general-purpose
example. Move examples/ci.py -> .github/benchmark.py and reword. Switch the
workflow to invoke it via uv run --project per ref (drops the copy hack).
@SandersAaronD SandersAaronD marked this pull request as ready for review June 3, 2026 21:17
@SandersAaronD
Copy link
Copy Markdown
Author

Taking this off draft entirely to trigger/test GHA, not at all ready to merge.

Clydingus and others added 5 commits June 4, 2026 11:33
RUNNER_NAME is a built-in GitHub Actions environment variable that the
runner overrides at runtime with its own name (e.g. "GitHub Actions
1000007239"). The workflow's `env: RUNNER_NAME` was therefore ignored,
and start-runner passed the runner's name to `gcloud compute instances
create`, which rejected it as an invalid GCE resource name:

  Invalid value for field 'resource.name': 'GitHub Actions 1000007239'.

The same bug made stop-runner's `instances delete "$RUNNER_NAME"` target
the wrong name (silently, via `|| true`), so a created VM would leak.

Rename to GH_RUNNER_NAME / GH_RUNNER_LABEL: clear of the reserved RUNNER_
and GITHUB_ prefixes, and consistent with the GH_RUNNER_PAT secret.
Updates the create positional arg, delete arg, and metadata. The
benchmark job's runs-on label is a literal expression and is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mint step runs under `bash -e` (no pipefail), so `curl -f | jq` hid a
failing PAT call behind jq's exit 0 — yielding an empty registration
token that silently flowed into the VM metadata and only surfaced later
as an opaque create failure.

Capture the HTTP status with -w, print it, and on non-201 emit the API's
.message/.documentation_url (non-sensitive error text, never the token)
and exit 1. Only parse/emit the token on success; keep masking it. Also
switch to `Authorization: Bearer` (the documented scheme for fine-grained
PATs) and pass the PAT via env rather than inline interpolation.

This both fixes the silent-empty-token footgun and tells us the exact
reason the current GH_RUNNER_PAT mint is failing (the token is configured
correctly per the UI, so the failure is something only the HTTP response
will reveal).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A fine-grained PAT is capped by its owner's repo role; the token owner
has Maintain (not Admin) on world_engine, and creating a runner
registration token requires Admin, so the PAT 403'd ("Resource not
accessible by personal access token") despite carrying Administration:write.

Switch to a GitHub App installation token (actions/create-github-app-token),
whose permissions come from the App install (Administration:write) and are
not bounded by any human's role. Auto-scoped to this repo, auto-masked,
1h expiry. Replaces GH_RUNNER_PAT with WE_CI_APP_ID / WE_CI_APP_PRIVATE_KEY.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
If the VM's startup script fails to register the runner, the benchmark
job (runs-on the ephemeral label) sits queued waiting for a runner that
never comes — up to GitHub's ~24h auto-fail — while the VM self-deletes
at 3h, leaving no signal and no logs (startup output only reaches the VM
serial console).

start-runner now polls the runners API for an online runner with our
label (~10 min bound) and, on timeout, dumps the VM serial console into
the Actions log and fails — so the real cause is visible, benchmark is
skipped (not queued), and stop-runner tears the VM down.

Also harden runner-startup.sh: `set -x` + an ERR trap (so the serial dump
pinpoints the failing line), run ./bin/installdependencies.sh (missing
runner OS deps are a common silent config.sh failure), and bump
RUNNER_VERSION 2.323.0 -> 2.334.0. Add timeout-minutes: 90 to benchmark
as a backstop for a runner that registers then stalls mid-build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
us-central1 has no RTX PRO 6000 capacity; the us-south1 quota request
is filed. Point the workflow ZONE at us-south1-a.
…uth1

Move benchmark runner zone to us-south1-a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants