Skip to content

spec: API keys + RS256 production migration#18

Merged
khaliqgant merged 4 commits intomainfrom
spec/api-keys-and-rs256-migration
Apr 22, 2026
Merged

spec: API keys + RS256 production migration#18
khaliqgant merged 4 commits intomainfrom
spec/api-keys-and-rs256-migration

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

@khaliqgant khaliqgant commented Apr 22, 2026

Why now

Production relayauth is two pieces short of what its own specs already assume:

  1. specs/token-format.md mandates RS256 / EdDSA. Production signs HS256. Every spec-compliant verifier (including @relayauth/sdk's TokenVerifier at verify.js:184-200) refuses HS256 outright.
  2. The /v1/api-keys endpoint family is declared in the OpenAPI spec and the contract test, but the routes don't exist and there's no api_keys table.

The user-visible failure that surfaced this: cloud's specialist worker adopted RS256/JWKS-style auth on /a2a/rpc (cloud #267); sage was supposed to mint via relayauth and present a bearer; the chain breaks because relayauth can't issue what the verifier requires and there's no API key path for sage to authenticate to relayauth in the first place.

End-result in production: sage's specialist tool calls 401 every time, harness hits max_iterations_reached, sage falls back to "I could not complete that request right now." in Slack. Tracked across cloud #267, sage #97, cloud #280.

What's in the spec

Four-phase migration:

  1. API keysPOST /v1/api-keys + GET + revoke, api_keys table, x-api-key auth middleware, accept-either bearer/api-key on identity + token routes.
  2. RS256 — switch signing to RS256, publish RSA public key in JWKS, keep HS256 entry alongside during transition.
  3. Cutover — verifier-first deploy (dual-accept), signer cutover, HS256 sunset after 1h TTL window.
  4. Bootstrap — admin generates a sage→relayauth API key once via the new endpoint, drops it into GitHub Actions secrets, sage and cloud PRs chain through automatically.

Risks, mitigations, and open questions inline. The hardest operational piece is that relayauth has no CD workflow today — separate spec issue should track that as a hard prerequisite before the cutover lands.

Asking for review on

  • Token signing key storage approach (operator-generated vs. self-bootstrap)
  • Whether POST /v1/api-keys itself should accept x-api-key auth (currently bearer-only to keep the bootstrap chain explicit)
  • The phased cutover order — particularly whether 1h TTL is short enough to avoid needing longer dual-accept
  • The non-blocking but real operational gap: relayauth production has no CD workflow

🤖 Generated with Claude Code


Open in Devin Review

Production relayauth currently signs `{"alg":"HS256","kid":"production"}`
and serves no public key material from JWKS. The spec at
specs/token-format.md mandates RS256 or EdDSA and explicitly says
verifiers must reject any other value — meaning every spec-compliant
verifier (including @relayauth/sdk's TokenVerifier) cannot verify
production tokens at all. The `/v1/api-keys` endpoint family is
declared in OpenAPI + the contract test but is not implemented in
either the routes or the DB schema.

The user-visible failure that surfaced this: cloud's specialist worker
adopted RS256/JWKS-style auth on /a2a/rpc (cloud #267); sage was
supposed to mint via relayauth and present a bearer; neither side of
the chain works because relayauth is two pieces short of what the
specs assume. Specialist 401s every sage tool call, harness hits
max_iterations_reached, sage falls back to "I could not complete that
request right now."

This spec lays out the four-phase migration:

  1. Implement `/v1/api-keys` POST/GET/revoke + `api_keys` table +
     x-api-key auth middleware + accept-either bearer/api-key on
     identities/tokens routes.
  2. Switch token signing to RS256, publish RSA public key from JWKS,
     keep old HS256 entry alongside during transition.
  3. Deploy with verifier-first rollout (accepts both algs), then
     signer cutover, then HS256 sunset after the 1h TTL window.
  4. Bootstrap a sage→relayauth API key with the now-existing endpoint;
     the existing PRs (sage #97, cloud #280) chain through automatically
     once the GitHub Actions secret is set.

Risks, mitigations, and open questions called out inline. The hardest
operational piece is that relayauth has no CD workflow today; a
separate spec issue should track that as a hard prerequisite before
the cutover lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant force-pushed the spec/api-keys-and-rs256-migration branch from dfbef0d to 740a6c5 Compare April 22, 2026 18:20
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfbef0d03e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/server/src/routes/authorize.ts Outdated
Comment on lines +24 to +27
const scopePath = ts.split(":").slice(3).join(":").replace("/*", "");
const requestedPath = requestedScope.split(":").slice(3).join(":").replace(/\/[^/]+$/, "");
return ts.startsWith("relayfile:fs:read:") && requestedScope.startsWith("relayfile:fs:read:/") &&
(scopePath === "" || requestedPath.startsWith(scopePath));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce directory boundary in file-scope matching

The prefix check in /v1/authorize/file can grant access to sibling paths that merely share a textual prefix. For example, a token with relayfile:fs:read:/frontend/* will pass for a request like relayfile:fs:read:/frontendevil/secrets.txt because requestedPath.startsWith(scopePath) is true for /frontendevil vs /frontend. This turns the new authorization route into an over-permissive matcher and can incorrectly allow reads outside the intended directory tree.

Useful? React with 👍 / 👎.

Comment on lines +124 to +126
if (fileScopes.length > 0) {
return `✓ Allowed (can read: ${fileScopes.join(", ")})`;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve denied outcome when summarizing scope checks

The scope.check formatter now always renders an allowed message whenever any relayfile scopes are present, even if payload.result is denied. authenticateAndAuthorize emits denied scope.check events before scope.denied, so this causes the event feed to show contradictory success text for failed authorization attempts, which makes debugging authorization behavior unreliable.

Useful? React with 👍 / 👎.

khaliqgant and others added 3 commits April 22, 2026 20:26
Two findings from scoping the implementation:

1. `POST /v1/tokens` is unimplemented in the relayauth server. The
   discovery endpoint advertises it, the SDK calls it, the OpenAPI
   spec lists it — but there's no route handler in server.ts. Cloud's
   e2e tests mock it; production has no such mock. Without a working
   token endpoint, API keys have nothing useful to authenticate to,
   so this is a hard precondition for Phase 1.

2. The deployed worker lives in cloud/packages/relayauth/, not this
   repo directly. This repo (@relayauth/server) provides Hono routes
   + storage interfaces; cloud provides Cloudflare adapters (D1 + KV
   + Durable Objects) and the worker entrypoint. Most phases now
   have two PRs each: one here, one in cloud.

Adds Phase 0 (tokens route) and a repo-split section explaining the
two-PR cadence. Existing phases renumber as appropriate (numbered
identifiers preserved to avoid renaming churn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six agent-relay workflows implementing the migration spec end-to-end.
All share the same strict review template — implementer self-review,
two parallel specialist peer reviews (security + spec/compat), architect
synthesis, fix loop, and a final approval gate that hard-fails if any
reviewer is still requesting changes or tests/typecheck don't pass.

Workflows:

  118-tokens-route-phase0    POST /v1/tokens, /refresh, /revoke,
                             /introspect — precondition for everything
                             else. Implementer + 2 reviewers (security,
                             spec) + approval gate.

  119-api-keys-phase1        /v1/api-keys POST/GET/revoke + ApiKeyStorage
                             interface in @relayauth/server, D1 migration
                             0002_api_keys.sql + Cloudflare adapter in
                             cloud/packages/relayauth, x-api-key auth on
                             identities + tokens routes.

  120-rs256-signing-phase2a  RS256 signing helper + JWKS RSA publication
                             in @relayauth/server. ADDITIVE only — HS256
                             stays default. Crypto-reviewer is the gate.

  121-sdk-dual-verify-phase3a TokenVerifier accepts both RS256 (new) and
                             HS256 (legacy) during cutover. Crypto +
                             compat reviewers gate. Must land + propagate
                             to all consumers BEFORE 122 fires.

  122-cloud-cutover-phase3b  Production cryptographic cutover. Three
                             flag-controlled steps with HUMAN go/no-go
                             between each: publish RSA key in JWKS →
                             flip signer to RS256 → 90-min soak →
                             sunset HS256. Observability agent reads
                             worker tails between steps; rollback-
                             reviewer confirms each step has a sub-5-min
                             rollback path.

  123-bootstrap-sage-key-phase4  Operational. Produces the runbook +
                             scripts to provision sage's RelayAuth API
                             key (admin bearer required), set the GitHub
                             secret, and chain through sage release +
                             cloud bump + deploy. Security-reviewer
                             checks scope minimisation + rotation plan.

Run order: 118 → 119 → 120 → 121 → publish + propagate → 122 → 123.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single CLI entrypoint that runs phases 118-123 in order with proper
branch + commit + PR handling per affected repo. Adds the missing piece
the workflows themselves don't do — they only modify files in place.

Per-phase manifest:

  118 → relayauth, branch migration/rs256/118-tokens-route
  119 → relayauth + cloud, paired branches migration/rs256/119-api-keys
  120 → relayauth, branch migration/rs256/120-rs256-signing
  121 → relayauth, branch migration/rs256/121-sdk-dual-verify
  122 → cloud, branch migration/rs256/122-cutover-infra (HIGH risk)
  123 → cloud, branch migration/rs256/123-bootstrap-runbook

For each phase the runner:
  1. Branches off origin/main in every affected repo.
  2. Runs the workflow (workflow modifies files; if its approval gate
     fails, runner exits 1 with the branch preserved for retry).
  3. Commits the diff with a generated message referencing the workflow
     and the spec.
  4. Pushes the branch and opens a PR via gh, with a structured body
     pointing back to the spec + run order.

Hard human pause before HIGH-risk phases (122) — the operator must type
"PROCEED" to confirm preconditions (118-121 deployed, sdk dual-verify
propagated to consumers, operator window scheduled).

State checkpointed in .rs256-migration-state.json so re-runs pick up
where they left off. Flags: --from N, --only N, --dry-run, --no-pause,
--skip-pr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 9 additional findings in Devin Review.

Open in Devin Review

get_state() {
local id="$1"
ensure_state
jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 get_state reads the full nested object instead of .status, so phases are never recognized as already passed

set_state at line 97 writes a nested object {status: $status, ts: $ts} under each phase ID, but get_state at line 88 reads .[$id] which returns the entire object (e.g. {"status":"passed","ts":"2026-..."}), not just the status string. This means the comparison at scripts/run-rs256-migration.sh:217 (if [ "$state" = "passed" ]) will never be true, completely breaking the "pick up where we left off" behavior. Every re-invocation of the script will re-run all phases from the beginning, including re-running agent-relay workflows and re-branching/committing/PRing in git — potentially destructive for already-merged work.

Suggested change
jq -r --arg id "$id" '.[$id] // "pending"' "$STATE_FILE"
jq -r --arg id "$id" '.[$id].status // "pending"' "$STATE_FILE"
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +38 to +39
curl -fsS -o /dev/null -w 'HTTP %{http_code}\\n' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}' ;
# Without auth this should be 401, not 404. 404 means phase 1 did not deploy.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 curl -f causes precondition check to fail on the expected HTTP 401 response

The precondition check at line 38 uses curl -fsS to verify the /v1/api-keys endpoint is deployed. The comment on line 39 says "Without auth this should be 401, not 404" — the intent is to distinguish between 401 (endpoint exists, requires auth) and 404 (endpoint not deployed). However, the -f flag causes curl to exit with code 22 for any HTTP error including 401. Combined with set -e on line 36, this aborts the entire precondition step on the expected 401 response, making the workflow un-runnable even when all preconditions are actually met.

Prompt for agents
The problem is in the `preconditions` step command in `workflows/123-bootstrap-sage-key-phase4.ts`, around line 36-38. The curl command uses `-f` which makes curl exit non-zero on HTTP 401, but 401 is the expected/successful response (it proves the endpoint is deployed and requires auth). Combined with `set -e`, this kills the entire precondition check.

To fix: Remove the `-f` flag from the curl call and instead capture the HTTP status code, then check it explicitly. For example, replace the curl line with something like:

  status=$(curl -sS -o /dev/null -w '%{http_code}' https://api.relayauth.dev/v1/api-keys -X POST -H 'content-type: application/json' -d '{}') ;
  if [ "$status" = "404" ]; then echo "ENDPOINT_NOT_DEPLOYED (got 404)"; exit 1; fi ;
  echo "HTTP $status (endpoint exists)" ;

This allows 401 (and other non-404 responses) to pass while correctly failing on 404.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@khaliqgant
Copy link
Copy Markdown
Member Author

How to run

Single CLI entrypoint executes phases 118-123 in order with branch + commit + PR per repo.

One-shot (full migration)

cd /path/to/relayauth
./scripts/run-rs256-migration.sh

That's it. The runner branches each affected repo off origin/main, runs the workflow, commits the diff, pushes, and opens a PR. Hard pause before phase 122 (HIGH-risk crypto cutover) requires you to type PROCEED.

Common flags

# Preview what would happen, no changes
./scripts/run-rs256-migration.sh --dry-run

# Resume from a specific phase (state is persisted between runs)
./scripts/run-rs256-migration.sh --from 120

# Run a single phase
./scripts/run-rs256-migration.sh --only 119

# Commit but don't open PRs (useful for local iteration)
./scripts/run-rs256-migration.sh --skip-pr

# Skip the HIGH-risk pre-cutover human pause (only for testing in dev/staging)
./scripts/run-rs256-migration.sh --no-pause

Repo paths

The runner expects two repos checked out side-by-side. Override with env vars if your layout differs:

RELAYAUTH_REPO=~/work/relayauth \
CLOUD_REPO=~/work/cloud \
./scripts/run-rs256-migration.sh

Defaults: /Users/khaliqgant/Projects/AgentWorkforce/{relayauth,cloud}.

Prereqs

  • agent-relay CLI on PATH
  • gh (GitHub CLI) authenticated against AgentWorkforce/{relayauth,cloud}
  • jq for state-file parsing
  • Both repos checked out on a clean working tree (uncommitted changes will fail the branch checkout)

What gets opened

7 PRs total across two repos, one per phase per affected repo:

Phase Repo Branch
118 relayauth migration/rs256/118-tokens-route
119 relayauth + cloud migration/rs256/119-api-keys (paired)
120 relayauth migration/rs256/120-rs256-signing
121 relayauth migration/rs256/121-sdk-dual-verify
122 cloud migration/rs256/122-cutover-infra
123 cloud migration/rs256/123-bootstrap-runbook

Each PR body references this spec + the run order. The workflow's strict review template (implementer self-review + 2 specialist peer reviewers + architect synthesis + approval gate) gates the commit — PRs only open if the gate passed.

State + recovery

Progress is checkpointed in .rs256-migration-state.json (in the relayauth repo root). Re-running the script skips already-passed phases. If a workflow's approval gate fails, the runner exits 1 and the branch is preserved so you can investigate and re-run.

Operator window for phase 122

Phase 122 (cryptographic cutover) has three internal human go/no-go gates that block on touch-files:

touch /path/to/relayauth/.cutover-step1-approved   # after RSA key in JWKS confirmed
touch /path/to/relayauth/.cutover-step2-approved   # after signer flip + healthy tail
touch /path/to/relayauth/.cutover-step3-approved   # after HS256 sunset confirmed

Each gate prints exactly what to verify (curl JWKS, tail workers, etc.) before you mark it approved. There's also a 90-minute soak window between steps 2 and 3 — abort with touch .cutover-soak-aborted if anything looks off.

@khaliqgant khaliqgant merged commit 14551af into main Apr 22, 2026
2 checks passed
@khaliqgant khaliqgant deleted the spec/api-keys-and-rs256-migration branch April 22, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant