GPU benchmark + consistency CI (issue #48) by SandersAaronD · Pull Request #53 · Overworldai/world_engine

SandersAaronD · 2026-06-03T00:09:33Z

Draft — blocked on the standard RTX PRO 6000 quota in `overworld-dev` (currently 0; request pending). Opening now so the code can be reviewed in the meantime. This is a best-effort first cut: it targets our specific GCP setup and might just work on the first real run, or might need fixing once it can actually run.

What this does

On a ready-for-review PR, provisions an ephemeral G4 / RTX PRO 6000 VM in `overworld-dev`, runs the suite against both `main` and the PR HEAD on that GPU, comments a comparison table, then tears the VM down.

`.github/benchmark.py` — `run` does a 256-frame perf rollout (LFPS) + a deterministic consistency forward pass from a shared, fully-populated KV cache (`get_state`/`load_state`, `use_deterministic_algorithms(True)`); `compare` emits the LFPS-delta + MSE markdown table. Failures are marked, not dropped.
`.github/runner-startup.sh` — registers the VM as a one-shot `--ephemeral` self-hosted runner. No SSH, so it fits the provisioner SA's create/delete-only IAM.
`.github/workflows/benchmark.yml` — `start-runner` → `benchmark` → `aggregate` → `stop-runner` (teardown on `always()`, plus a `max-run-duration` backstop). Runs main then PR on one VM via `uv run --project`, so only the engine install differs.

Required before this can run

Standard (non-preemptible) RTX PRO 6000 quota in `overworld-dev`/`us-central1` (request in).
Repo secrets: `WIF_PROVIDER_DEV`, `WE_CI_PROVISIONER_SA`, `WE_CI_NODE_SA`, `GH_RUNNER_PAT` (repo Administration:write, for runner registration tokens), `HF_TOKEN`.
Confirm `Overworld/Waypoint-*` model access works with `HF_TOKEN`.
Pin `RUNNER_VERSION` in the startup script to the current actions/runner release.

Zone / image / disk / quota-bucket details are from a hands-on prototype attempt in `overworld-dev`.

Adds an ephemeral-runner workflow that, on ready-for-review PRs, provisions a G4 (RTX PRO 6000) VM in overworld-dev as a one-shot self-hosted runner, runs a 256-frame perf rollout (LFPS) and a deterministic consistency forward pass for each config against both main and the PR HEAD, posts a comparison table + MSE, then deletes the VM. - examples/ci.py: provider-agnostic perf+consistency harness and aggregator - .github/runner-startup.sh: registers the VM as an ephemeral runner (no SSH, fits the provisioner SA's create/delete-only permissions) - .github/workflows/benchmark.yml: start-runner / benchmark / aggregate / stop-runner Draft until the standard RTX PRO 6000 quota request for overworld-dev clears.

…ming The script is GCP/world_engine-specific CI glue, not a general-purpose example. Move examples/ci.py -> .github/benchmark.py and reword. Switch the workflow to invoke it via uv run --project per ref (drops the copy hack).

SandersAaronD · 2026-06-03T21:17:46Z

Taking this off draft entirely to trigger/test GHA, not at all ready to merge.

RUNNER_NAME is a built-in GitHub Actions environment variable that the runner overrides at runtime with its own name (e.g. "GitHub Actions 1000007239"). The workflow's `env: RUNNER_NAME` was therefore ignored, and start-runner passed the runner's name to `gcloud compute instances create`, which rejected it as an invalid GCE resource name: Invalid value for field 'resource.name': 'GitHub Actions 1000007239'. The same bug made stop-runner's `instances delete "$RUNNER_NAME"` target the wrong name (silently, via `|| true`), so a created VM would leak. Rename to GH_RUNNER_NAME / GH_RUNNER_LABEL: clear of the reserved RUNNER_ and GITHUB_ prefixes, and consistent with the GH_RUNNER_PAT secret. Updates the create positional arg, delete arg, and metadata. The benchmark job's runs-on label is a literal expression and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mint step runs under `bash -e` (no pipefail), so `curl -f | jq` hid a failing PAT call behind jq's exit 0 — yielding an empty registration token that silently flowed into the VM metadata and only surfaced later as an opaque create failure. Capture the HTTP status with -w, print it, and on non-201 emit the API's .message/.documentation_url (non-sensitive error text, never the token) and exit 1. Only parse/emit the token on success; keep masking it. Also switch to `Authorization: Bearer` (the documented scheme for fine-grained PATs) and pass the PAT via env rather than inline interpolation. This both fixes the silent-empty-token footgun and tells us the exact reason the current GH_RUNNER_PAT mint is failing (the token is configured correctly per the UI, so the failure is something only the HTTP response will reveal). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A fine-grained PAT is capped by its owner's repo role; the token owner has Maintain (not Admin) on world_engine, and creating a runner registration token requires Admin, so the PAT 403'd ("Resource not accessible by personal access token") despite carrying Administration:write. Switch to a GitHub App installation token (actions/create-github-app-token), whose permissions come from the App install (Administration:write) and are not bounded by any human's role. Auto-scoped to this repo, auto-masked, 1h expiry. Replaces GH_RUNNER_PAT with WE_CI_APP_ID / WE_CI_APP_PRIVATE_KEY. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

If the VM's startup script fails to register the runner, the benchmark job (runs-on the ephemeral label) sits queued waiting for a runner that never comes — up to GitHub's ~24h auto-fail — while the VM self-deletes at 3h, leaving no signal and no logs (startup output only reaches the VM serial console). start-runner now polls the runners API for an online runner with our label (~10 min bound) and, on timeout, dumps the VM serial console into the Actions log and fails — so the real cause is visible, benchmark is skipped (not queued), and stop-runner tears the VM down. Also harden runner-startup.sh: `set -x` + an ERR trap (so the serial dump pinpoints the failing line), run ./bin/installdependencies.sh (missing runner OS deps are a common silent config.sh failure), and bump RUNNER_VERSION 2.323.0 -> 2.334.0. Add timeout-minutes: 90 to benchmark as a backstop for a runner that registers then stalls mid-build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

us-central1 has no RTX PRO 6000 capacity; the us-south1 quota request is filed. Point the workflow ZONE at us-south1-a.

…uth1 Move benchmark runner zone to us-south1-a

SandersAaronD added 2 commits June 2, 2026 17:09

SandersAaronD marked this pull request as ready for review June 3, 2026 21:17

Clydingus and others added 5 commits June 4, 2026 11:33

Move benchmark runner zone to us-south1-a

986451a

us-central1 has no RTX PRO 6000 capacity; the us-south1 quota request is filed. Point the workflow ZONE at us-south1-a.

SandersAaronD mentioned this pull request Jun 5, 2026

Move benchmark runner zone to us-south1-a #54

Merged

Merge pull request #54 from Overworldai/sandersaarond/benchmark-us-so…

cda66c7

…uth1 Move benchmark runner zone to us-south1-a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU benchmark + consistency CI (issue #48)#53

GPU benchmark + consistency CI (issue #48)#53
SandersAaronD wants to merge 8 commits into
mainfrom
sandersaarond/ci-gpu-benchmark

SandersAaronD commented Jun 3, 2026 •

edited

Loading

Uh oh!

SandersAaronD commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SandersAaronD commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Required before this can run

Uh oh!

SandersAaronD commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SandersAaronD commented Jun 3, 2026 •

edited

Loading