spec: API keys + RS256 production migration#18
Conversation
Production relayauth currently signs `{"alg":"HS256","kid":"production"}`
and serves no public key material from JWKS. The spec at
specs/token-format.md mandates RS256 or EdDSA and explicitly says
verifiers must reject any other value — meaning every spec-compliant
verifier (including @relayauth/sdk's TokenVerifier) cannot verify
production tokens at all. The `/v1/api-keys` endpoint family is
declared in OpenAPI + the contract test but is not implemented in
either the routes or the DB schema.
The user-visible failure that surfaced this: cloud's specialist worker
adopted RS256/JWKS-style auth on /a2a/rpc (cloud #267); sage was
supposed to mint via relayauth and present a bearer; neither side of
the chain works because relayauth is two pieces short of what the
specs assume. Specialist 401s every sage tool call, harness hits
max_iterations_reached, sage falls back to "I could not complete that
request right now."
This spec lays out the four-phase migration:
1. Implement `/v1/api-keys` POST/GET/revoke + `api_keys` table +
x-api-key auth middleware + accept-either bearer/api-key on
identities/tokens routes.
2. Switch token signing to RS256, publish RSA public key from JWKS,
keep old HS256 entry alongside during transition.
3. Deploy with verifier-first rollout (accepts both algs), then
signer cutover, then HS256 sunset after the 1h TTL window.
4. Bootstrap a sage→relayauth API key with the now-existing endpoint;
the existing PRs (sage #97, cloud #280) chain through automatically
once the GitHub Actions secret is set.
Risks, mitigations, and open questions called out inline. The hardest
operational piece is that relayauth has no CD workflow today; a
separate spec issue should track that as a hard prerequisite before
the cutover lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dfbef0d to
740a6c5
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dfbef0d03e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const scopePath = ts.split(":").slice(3).join(":").replace("/*", ""); | ||
| const requestedPath = requestedScope.split(":").slice(3).join(":").replace(/\/[^/]+$/, ""); | ||
| return ts.startsWith("relayfile:fs:read:") && requestedScope.startsWith("relayfile:fs:read:/") && | ||
| (scopePath === "" || requestedPath.startsWith(scopePath)); |
There was a problem hiding this comment.
Enforce directory boundary in file-scope matching
The prefix check in /v1/authorize/file can grant access to sibling paths that merely share a textual prefix. For example, a token with relayfile:fs:read:/frontend/* will pass for a request like relayfile:fs:read:/frontendevil/secrets.txt because requestedPath.startsWith(scopePath) is true for /frontendevil vs /frontend. This turns the new authorization route into an over-permissive matcher and can incorrectly allow reads outside the intended directory tree.
Useful? React with 👍 / 👎.
| if (fileScopes.length > 0) { | ||
| return `✓ Allowed (can read: ${fileScopes.join(", ")})`; | ||
| } |
There was a problem hiding this comment.
Preserve denied outcome when summarizing scope checks
The scope.check formatter now always renders an allowed message whenever any relayfile scopes are present, even if payload.result is denied. authenticateAndAuthorize emits denied scope.check events before scope.denied, so this causes the event feed to show contradictory success text for failed authorization attempts, which makes debugging authorization behavior unreliable.
Useful? React with 👍 / 👎.
Two findings from scoping the implementation: 1. `POST /v1/tokens` is unimplemented in the relayauth server. The discovery endpoint advertises it, the SDK calls it, the OpenAPI spec lists it — but there's no route handler in server.ts. Cloud's e2e tests mock it; production has no such mock. Without a working token endpoint, API keys have nothing useful to authenticate to, so this is a hard precondition for Phase 1. 2. The deployed worker lives in cloud/packages/relayauth/, not this repo directly. This repo (@relayauth/server) provides Hono routes + storage interfaces; cloud provides Cloudflare adapters (D1 + KV + Durable Objects) and the worker entrypoint. Most phases now have two PRs each: one here, one in cloud. Adds Phase 0 (tokens route) and a repo-split section explaining the two-PR cadence. Existing phases renumber as appropriate (numbered identifiers preserved to avoid renaming churn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six agent-relay workflows implementing the migration spec end-to-end.
All share the same strict review template — implementer self-review,
two parallel specialist peer reviews (security + spec/compat), architect
synthesis, fix loop, and a final approval gate that hard-fails if any
reviewer is still requesting changes or tests/typecheck don't pass.
Workflows:
118-tokens-route-phase0 POST /v1/tokens, /refresh, /revoke,
/introspect — precondition for everything
else. Implementer + 2 reviewers (security,
spec) + approval gate.
119-api-keys-phase1 /v1/api-keys POST/GET/revoke + ApiKeyStorage
interface in @relayauth/server, D1 migration
0002_api_keys.sql + Cloudflare adapter in
cloud/packages/relayauth, x-api-key auth on
identities + tokens routes.
120-rs256-signing-phase2a RS256 signing helper + JWKS RSA publication
in @relayauth/server. ADDITIVE only — HS256
stays default. Crypto-reviewer is the gate.
121-sdk-dual-verify-phase3a TokenVerifier accepts both RS256 (new) and
HS256 (legacy) during cutover. Crypto +
compat reviewers gate. Must land + propagate
to all consumers BEFORE 122 fires.
122-cloud-cutover-phase3b Production cryptographic cutover. Three
flag-controlled steps with HUMAN go/no-go
between each: publish RSA key in JWKS →
flip signer to RS256 → 90-min soak →
sunset HS256. Observability agent reads
worker tails between steps; rollback-
reviewer confirms each step has a sub-5-min
rollback path.
123-bootstrap-sage-key-phase4 Operational. Produces the runbook +
scripts to provision sage's RelayAuth API
key (admin bearer required), set the GitHub
secret, and chain through sage release +
cloud bump + deploy. Security-reviewer
checks scope minimisation + rotation plan.
Run order: 118 → 119 → 120 → 121 → publish + propagate → 122 → 123.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single CLI entrypoint that runs phases 118-123 in order with proper
branch + commit + PR handling per affected repo. Adds the missing piece
the workflows themselves don't do — they only modify files in place.
Per-phase manifest:
118 → relayauth, branch migration/rs256/118-tokens-route
119 → relayauth + cloud, paired branches migration/rs256/119-api-keys
120 → relayauth, branch migration/rs256/120-rs256-signing
121 → relayauth, branch migration/rs256/121-sdk-dual-verify
122 → cloud, branch migration/rs256/122-cutover-infra (HIGH risk)
123 → cloud, branch migration/rs256/123-bootstrap-runbook
For each phase the runner:
1. Branches off origin/main in every affected repo.
2. Runs the workflow (workflow modifies files; if its approval gate
fails, runner exits 1 with the branch preserved for retry).
3. Commits the diff with a generated message referencing the workflow
and the spec.
4. Pushes the branch and opens a PR via gh, with a structured body
pointing back to the spec + run order.
Hard human pause before HIGH-risk phases (122) — the operator must type
"PROCEED" to confirm preconditions (118-121 deployed, sdk dual-verify
propagated to consumers, operator window scheduled).
State checkpointed in .rs256-migration-state.json so re-runs pick up
where they left off. Flags: --from N, --only N, --dry-run, --no-pause,
--skip-pr.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| get_state() { | ||
| local id="$1" | ||
| ensure_state | ||
| jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE" |
There was a problem hiding this comment.
🔴 get_state reads the full nested object instead of .status, so phases are never recognized as already passed
set_state at line 97 writes a nested object {status: $status, ts: $ts} under each phase ID, but get_state at line 88 reads .[$id] which returns the entire object (e.g. {"status":"passed","ts":"2026-..."}), not just the status string. This means the comparison at scripts/run-rs256-migration.sh:217 (if [ "$state" = "passed" ]) will never be true, completely breaking the "pick up where we left off" behavior. Every re-invocation of the script will re-run all phases from the beginning, including re-running agent-relay workflows and re-branching/committing/PRing in git — potentially destructive for already-merged work.
| jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE" | |
| jq -r --arg id "$id" '.[$id].status // "pending"' "$STATE_FILE" |
Was this helpful? React with 👍 or 👎 to provide feedback.
| curl -fsS -o /dev/null -w 'HTTP %{http_code}\\n' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}' ; | ||
| # Without auth this should be 401, not 404. 404 means phase 1 did not deploy. |
There was a problem hiding this comment.
🔴 curl -f causes precondition check to fail on the expected HTTP 401 response
The precondition check at line 38 uses curl -fsS to verify the /v1/api-keys endpoint is deployed. The comment on line 39 says "Without auth this should be 401, not 404" — the intent is to distinguish between 401 (endpoint exists, requires auth) and 404 (endpoint not deployed). However, the -f flag causes curl to exit with code 22 for any HTTP error including 401. Combined with set -e on line 36, this aborts the entire precondition step on the expected 401 response, making the workflow un-runnable even when all preconditions are actually met.
Prompt for agents
The problem is in the `preconditions` step command in `workflows/123-bootstrap-sage-key-phase4.ts`, around line 36-38. The curl command uses `-f` which makes curl exit non-zero on HTTP 401, but 401 is the expected/successful response (it proves the endpoint is deployed and requires auth). Combined with `set -e`, this kills the entire precondition check.
To fix: Remove the `-f` flag from the curl call and instead capture the HTTP status code, then check it explicitly. For example, replace the curl line with something like:
status=$(curl -sS -o /dev/null -w '%{http_code}' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}') ;
if [ "$status" = "404" ]; then echo "ENDPOINT_NOT_DEPLOYED (got 404)"; exit 1; fi ;
echo "HTTP $status (endpoint exists)" ;
This allows 401 (and other non-404 responses) to pass while correctly failing on 404.
Was this helpful? React with 👍 or 👎 to provide feedback.
How to runSingle CLI entrypoint executes phases 118-123 in order with branch + commit + PR per repo. One-shot (full migration)cd /path/to/relayauth
./scripts/run-rs256-migration.shThat's it. The runner branches each affected repo off Common flags# Preview what would happen, no changes
./scripts/run-rs256-migration.sh --dry-run
# Resume from a specific phase (state is persisted between runs)
./scripts/run-rs256-migration.sh --from 120
# Run a single phase
./scripts/run-rs256-migration.sh --only 119
# Commit but don't open PRs (useful for local iteration)
./scripts/run-rs256-migration.sh --skip-pr
# Skip the HIGH-risk pre-cutover human pause (only for testing in dev/staging)
./scripts/run-rs256-migration.sh --no-pauseRepo pathsThe runner expects two repos checked out side-by-side. Override with env vars if your layout differs: RELAYAUTH_REPO=~/work/relayauth \
CLOUD_REPO=~/work/cloud \
./scripts/run-rs256-migration.shDefaults: Prereqs
What gets opened7 PRs total across two repos, one per phase per affected repo:
Each PR body references this spec + the run order. The workflow's strict review template (implementer self-review + 2 specialist peer reviewers + architect synthesis + approval gate) gates the commit — PRs only open if the gate passed. State + recoveryProgress is checkpointed in Operator window for phase 122Phase 122 (cryptographic cutover) has three internal human go/no-go gates that block on touch-files: touch /path/to/relayauth/.cutover-step1-approved # after RSA key in JWKS confirmed
touch /path/to/relayauth/.cutover-step2-approved # after signer flip + healthy tail
touch /path/to/relayauth/.cutover-step3-approved # after HS256 sunset confirmedEach gate prints exactly what to verify (curl JWKS, tail workers, etc.) before you mark it approved. There's also a 90-minute soak window between steps 2 and 3 — abort with |
Why now
Production relayauth is two pieces short of what its own specs already assume:
specs/token-format.mdmandates RS256 / EdDSA. Production signs HS256. Every spec-compliant verifier (including@relayauth/sdk'sTokenVerifieratverify.js:184-200) refuses HS256 outright./v1/api-keysendpoint family is declared in the OpenAPI spec and the contract test, but the routes don't exist and there's noapi_keystable.The user-visible failure that surfaced this: cloud's specialist worker adopted RS256/JWKS-style auth on
/a2a/rpc(cloud #267); sage was supposed to mint via relayauth and present a bearer; the chain breaks because relayauth can't issue what the verifier requires and there's no API key path for sage to authenticate to relayauth in the first place.End-result in production: sage's specialist tool calls 401 every time, harness hits
max_iterations_reached, sage falls back to "I could not complete that request right now." in Slack. Tracked across cloud #267, sage #97, cloud #280.What's in the spec
Four-phase migration:
POST /v1/api-keys+GET+revoke,api_keystable,x-api-keyauth middleware, accept-either bearer/api-key on identity + token routes.Risks, mitigations, and open questions inline. The hardest operational piece is that relayauth has no CD workflow today — separate spec issue should track that as a hard prerequisite before the cutover lands.
Asking for review on
POST /v1/api-keysitself should acceptx-api-keyauth (currently bearer-only to keep the bootstrap chain explicit)🤖 Generated with Claude Code