spec: API keys + RS256 production migration by khaliqgant · Pull Request #18 · AgentWorkforce/relayauth

khaliqgant · 2026-04-22T18:18:30Z

Why now

Production relayauth is two pieces short of what its own specs already assume:

specs/token-format.md mandates RS256 / EdDSA. Production signs HS256. Every spec-compliant verifier (including @relayauth/sdk's TokenVerifier at verify.js:184-200) refuses HS256 outright.
The /v1/api-keys endpoint family is declared in the OpenAPI spec and the contract test, but the routes don't exist and there's no api_keys table.

The user-visible failure that surfaced this: cloud's specialist worker adopted RS256/JWKS-style auth on /a2a/rpc (cloud #267); sage was supposed to mint via relayauth and present a bearer; the chain breaks because relayauth can't issue what the verifier requires and there's no API key path for sage to authenticate to relayauth in the first place.

End-result in production: sage's specialist tool calls 401 every time, harness hits max_iterations_reached, sage falls back to "I could not complete that request right now." in Slack. Tracked across cloud #267, sage #97, cloud #280.

What's in the spec

Four-phase migration:

API keys — POST /v1/api-keys + GET + revoke, api_keys table, x-api-key auth middleware, accept-either bearer/api-key on identity + token routes.
RS256 — switch signing to RS256, publish RSA public key in JWKS, keep HS256 entry alongside during transition.
Cutover — verifier-first deploy (dual-accept), signer cutover, HS256 sunset after 1h TTL window.
Bootstrap — admin generates a sage→relayauth API key once via the new endpoint, drops it into GitHub Actions secrets, sage and cloud PRs chain through automatically.

Risks, mitigations, and open questions inline. The hardest operational piece is that relayauth has no CD workflow today — separate spec issue should track that as a hard prerequisite before the cutover lands.

Asking for review on

Token signing key storage approach (operator-generated vs. self-bootstrap)
Whether POST /v1/api-keys itself should accept x-api-key auth (currently bearer-only to keep the bootstrap chain explicit)
The phased cutover order — particularly whether 1h TTL is short enough to avoid needing longer dual-accept
The non-blocking but real operational gap: relayauth production has no CD workflow

🤖 Generated with Claude Code

Production relayauth currently signs `{"alg":"HS256","kid":"production"}` and serves no public key material from JWKS. The spec at specs/token-format.md mandates RS256 or EdDSA and explicitly says verifiers must reject any other value — meaning every spec-compliant verifier (including @relayauth/sdk's TokenVerifier) cannot verify production tokens at all. The `/v1/api-keys` endpoint family is declared in OpenAPI + the contract test but is not implemented in either the routes or the DB schema. The user-visible failure that surfaced this: cloud's specialist worker adopted RS256/JWKS-style auth on /a2a/rpc (cloud #267); sage was supposed to mint via relayauth and present a bearer; neither side of the chain works because relayauth is two pieces short of what the specs assume. Specialist 401s every sage tool call, harness hits max_iterations_reached, sage falls back to "I could not complete that request right now." This spec lays out the four-phase migration: 1. Implement `/v1/api-keys` POST/GET/revoke + `api_keys` table + x-api-key auth middleware + accept-either bearer/api-key on identities/tokens routes. 2. Switch token signing to RS256, publish RSA public key from JWKS, keep old HS256 entry alongside during transition. 3. Deploy with verifier-first rollout (accepts both algs), then signer cutover, then HS256 sunset after the 1h TTL window. 4. Bootstrap a sage→relayauth API key with the now-existing endpoint; the existing PRs (sage #97, cloud #280) chain through automatically once the GitHub Actions secret is set. Risks, mitigations, and open questions called out inline. The hardest operational piece is that relayauth has no CD workflow today; a separate spec issue should track that as a hard prerequisite before the cutover lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfbef0d03e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T18:22:54Z

+    const scopePath = ts.split(":").slice(3).join(":").replace("/*", "");
+    const requestedPath = requestedScope.split(":").slice(3).join(":").replace(/\/[^/]+$/, "");
+    return ts.startsWith("relayfile:fs:read:") && requestedScope.startsWith("relayfile:fs:read:/") && 
+           (scopePath === "" || requestedPath.startsWith(scopePath));


Enforce directory boundary in file-scope matching

The prefix check in /v1/authorize/file can grant access to sibling paths that merely share a textual prefix. For example, a token with relayfile:fs:read:/frontend/* will pass for a request like relayfile:fs:read:/frontendevil/secrets.txt because requestedPath.startsWith(scopePath) is true for /frontendevil vs /frontend. This turns the new authorization route into an over-permissive matcher and can incorrectly allow reads outside the intended directory tree.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T18:22:54Z

+    if (fileScopes.length > 0) {
+      return `✓ Allowed (can read: ${fileScopes.join(", ")})`;
+    }


Preserve denied outcome when summarizing scope checks

The scope.check formatter now always renders an allowed message whenever any relayfile scopes are present, even if payload.result is denied. authenticateAndAuthorize emits denied scope.check events before scope.denied, so this causes the event feed to show contradictory success text for failed authorization attempts, which makes debugging authorization behavior unreliable.

Useful? React with 👍 / 👎.

Two findings from scoping the implementation: 1. `POST /v1/tokens` is unimplemented in the relayauth server. The discovery endpoint advertises it, the SDK calls it, the OpenAPI spec lists it — but there's no route handler in server.ts. Cloud's e2e tests mock it; production has no such mock. Without a working token endpoint, API keys have nothing useful to authenticate to, so this is a hard precondition for Phase 1. 2. The deployed worker lives in cloud/packages/relayauth/, not this repo directly. This repo (@relayauth/server) provides Hono routes + storage interfaces; cloud provides Cloudflare adapters (D1 + KV + Durable Objects) and the worker entrypoint. Most phases now have two PRs each: one here, one in cloud. Adds Phase 0 (tokens route) and a repo-split section explaining the two-PR cadence. Existing phases renumber as appropriate (numbered identifiers preserved to avoid renaming churn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six agent-relay workflows implementing the migration spec end-to-end. All share the same strict review template — implementer self-review, two parallel specialist peer reviews (security + spec/compat), architect synthesis, fix loop, and a final approval gate that hard-fails if any reviewer is still requesting changes or tests/typecheck don't pass. Workflows: 118-tokens-route-phase0 POST /v1/tokens, /refresh, /revoke, /introspect — precondition for everything else. Implementer + 2 reviewers (security, spec) + approval gate. 119-api-keys-phase1 /v1/api-keys POST/GET/revoke + ApiKeyStorage interface in @relayauth/server, D1 migration 0002_api_keys.sql + Cloudflare adapter in cloud/packages/relayauth, x-api-key auth on identities + tokens routes. 120-rs256-signing-phase2a RS256 signing helper + JWKS RSA publication in @relayauth/server. ADDITIVE only — HS256 stays default. Crypto-reviewer is the gate. 121-sdk-dual-verify-phase3a TokenVerifier accepts both RS256 (new) and HS256 (legacy) during cutover. Crypto + compat reviewers gate. Must land + propagate to all consumers BEFORE 122 fires. 122-cloud-cutover-phase3b Production cryptographic cutover. Three flag-controlled steps with HUMAN go/no-go between each: publish RSA key in JWKS → flip signer to RS256 → 90-min soak → sunset HS256. Observability agent reads worker tails between steps; rollback- reviewer confirms each step has a sub-5-min rollback path. 123-bootstrap-sage-key-phase4 Operational. Produces the runbook + scripts to provision sage's RelayAuth API key (admin bearer required), set the GitHub secret, and chain through sage release + cloud bump + deploy. Security-reviewer checks scope minimisation + rotation plan. Run order: 118 → 119 → 120 → 121 → publish + propagate → 122 → 123. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single CLI entrypoint that runs phases 118-123 in order with proper branch + commit + PR handling per affected repo. Adds the missing piece the workflows themselves don't do — they only modify files in place. Per-phase manifest: 118 → relayauth, branch migration/rs256/118-tokens-route 119 → relayauth + cloud, paired branches migration/rs256/119-api-keys 120 → relayauth, branch migration/rs256/120-rs256-signing 121 → relayauth, branch migration/rs256/121-sdk-dual-verify 122 → cloud, branch migration/rs256/122-cutover-infra (HIGH risk) 123 → cloud, branch migration/rs256/123-bootstrap-runbook For each phase the runner: 1. Branches off origin/main in every affected repo. 2. Runs the workflow (workflow modifies files; if its approval gate fails, runner exits 1 with the branch preserved for retry). 3. Commits the diff with a generated message referencing the workflow and the spec. 4. Pushes the branch and opens a PR via gh, with a structured body pointing back to the spec + run order. Hard human pause before HIGH-risk phases (122) — the operator must type "PROCEED" to confirm preconditions (118-121 deployed, sdk dual-verify propagated to consumers, operator window scheduled). State checkpointed in .rs256-migration-state.json so re-runs pick up where they left off. Flags: --from N, --only N, --dry-run, --no-pause, --skip-pr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 2 new potential issues.

View 9 additional findings in Devin Review.

devin-ai-integration · 2026-04-22T18:44:14Z

+get_state() {
+  local id="$1"
+  ensure_state
+  jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE"


🔴 get_state reads the full nested object instead of .status, so phases are never recognized as already passed

set_state at line 97 writes a nested object {status: $status, ts: $ts} under each phase ID, but get_state at line 88 reads .[$id] which returns the entire object (e.g. {"status":"passed","ts":"2026-..."}), not just the status string. This means the comparison at scripts/run-rs256-migration.sh:217 (if [ "$state" = "passed" ]) will never be true, completely breaking the "pick up where we left off" behavior. Every re-invocation of the script will re-run all phases from the beginning, including re-running agent-relay workflows and re-branching/committing/PRing in git — potentially destructive for already-merged work.

Suggested change

jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE"

jq -r --arg id "$id" '.[$id].status // "pending"' "$STATE_FILE"

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-22T18:44:15Z

+        curl -fsS -o /dev/null -w 'HTTP %{http_code}\\n' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}' ;
+        # Without auth this should be 401, not 404. 404 means phase 1 did not deploy.


🔴 curl -f causes precondition check to fail on the expected HTTP 401 response

The precondition check at line 38 uses curl -fsS to verify the /v1/api-keys endpoint is deployed. The comment on line 39 says "Without auth this should be 401, not 404" — the intent is to distinguish between 401 (endpoint exists, requires auth) and 404 (endpoint not deployed). However, the -f flag causes curl to exit with code 22 for any HTTP error including 401. Combined with set -e on line 36, this aborts the entire precondition step on the expected 401 response, making the workflow un-runnable even when all preconditions are actually met.

Prompt for agents

The problem is in the `preconditions` step command in `workflows/123-bootstrap-sage-key-phase4.ts`, around line 36-38. The curl command uses `-f` which makes curl exit non-zero on HTTP 401, but 401 is the expected/successful response (it proves the endpoint is deployed and requires auth). Combined with `set -e`, this kills the entire precondition check. To fix: Remove the `-f` flag from the curl call and instead capture the HTTP status code, then check it explicitly. For example, replace the curl line with something like: status=$(curl -sS -o /dev/null -w '%{http_code}' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}') ; if [ "$status" = "404" ]; then echo "ENDPOINT_NOT_DEPLOYED (got 404)"; exit 1; fi ; echo "HTTP $status (endpoint exists)" ; This allows 401 (and other non-404 responses) to pass while correctly failing on 404.

Was this helpful? React with 👍 or 👎 to provide feedback.

khaliqgant · 2026-04-22T18:47:29Z

How to run

Single CLI entrypoint executes phases 118-123 in order with branch + commit + PR per repo.

One-shot (full migration)

cd /path/to/relayauth
./scripts/run-rs256-migration.sh

That's it. The runner branches each affected repo off origin/main, runs the workflow, commits the diff, pushes, and opens a PR. Hard pause before phase 122 (HIGH-risk crypto cutover) requires you to type PROCEED.

Common flags

# Preview what would happen, no changes
./scripts/run-rs256-migration.sh --dry-run

# Resume from a specific phase (state is persisted between runs)
./scripts/run-rs256-migration.sh --from 120

# Run a single phase
./scripts/run-rs256-migration.sh --only 119

# Commit but don't open PRs (useful for local iteration)
./scripts/run-rs256-migration.sh --skip-pr

# Skip the HIGH-risk pre-cutover human pause (only for testing in dev/staging)
./scripts/run-rs256-migration.sh --no-pause

Repo paths

The runner expects two repos checked out side-by-side. Override with env vars if your layout differs:

RELAYAUTH_REPO=~/work/relayauth \
CLOUD_REPO=~/work/cloud \
./scripts/run-rs256-migration.sh

Defaults: /Users/khaliqgant/Projects/AgentWorkforce/{relayauth,cloud}.

Prereqs

agent-relay CLI on PATH
gh (GitHub CLI) authenticated against AgentWorkforce/{relayauth,cloud}
jq for state-file parsing
Both repos checked out on a clean working tree (uncommitted changes will fail the branch checkout)

What gets opened

7 PRs total across two repos, one per phase per affected repo:

Phase	Repo	Branch
118	relayauth	`migration/rs256/118-tokens-route`
119	relayauth + cloud	`migration/rs256/119-api-keys` (paired)
120	relayauth	`migration/rs256/120-rs256-signing`
121	relayauth	`migration/rs256/121-sdk-dual-verify`
122	cloud	`migration/rs256/122-cutover-infra`
123	cloud	`migration/rs256/123-bootstrap-runbook`

Each PR body references this spec + the run order. The workflow's strict review template (implementer self-review + 2 specialist peer reviewers + architect synthesis + approval gate) gates the commit — PRs only open if the gate passed.

State + recovery

Progress is checkpointed in .rs256-migration-state.json (in the relayauth repo root). Re-running the script skips already-passed phases. If a workflow's approval gate fails, the runner exits 1 and the branch is preserved so you can investigate and re-run.

Operator window for phase 122

Phase 122 (cryptographic cutover) has three internal human go/no-go gates that block on touch-files:

touch /path/to/relayauth/.cutover-step1-approved   # after RSA key in JWKS confirmed
touch /path/to/relayauth/.cutover-step2-approved   # after signer flip + healthy tail
touch /path/to/relayauth/.cutover-step3-approved   # after HS256 sunset confirmed

Each gate prints exactly what to verify (curl JWKS, tail workers, etc.) before you mark it approved. There's also a 90-minute soak window between steps 2 and 3 — abort with touch .cutover-soak-aborted if anything looks off.

khaliqgant force-pushed the spec/api-keys-and-rs256-migration branch from dfbef0d to 740a6c5 Compare April 22, 2026 18:20

devin-ai-integration Bot reviewed Apr 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

khaliqgant and others added 3 commits April 22, 2026 20:26

devin-ai-integration Bot reviewed Apr 22, 2026

View reviewed changes

khaliqgant merged commit 14551af into main Apr 22, 2026
2 checks passed

khaliqgant deleted the spec/api-keys-and-rs256-migration branch April 22, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: API keys + RS256 production migration#18

spec: API keys + RS256 production migration#18
khaliqgant merged 4 commits intomainfrom
spec/api-keys-and-rs256-migration

khaliqgant commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Apr 22, 2026

Uh oh!

devin-ai-integration Bot Apr 22, 2026

Uh oh!

khaliqgant commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE"
	jq -r --arg id "$id" '.[$id].status // "pending"' "$STATE_FILE"

		curl -fsS -o /dev/null -w 'HTTP %{http_code}\\n' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}' ;
		# Without auth this should be 401, not 404. 404 means phase 1 did not deploy.

Conversation

khaliqgant commented Apr 22, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why now

What's in the spec

Asking for review on

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

khaliqgant commented Apr 22, 2026

How to run

One-shot (full migration)

Common flags

Repo paths

Prereqs

What gets opened

State + recovery

Operator window for phase 122

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khaliqgant commented Apr 22, 2026 •

edited by devin-ai-integration Bot

Loading