Skip to content

fix(kiloclaw): prevent 401s after API key rotation via SecretRef#2765

Merged
pandemicsyn merged 5 commits intomainfrom
florian/fix/401-prevention
Apr 23, 2026
Merged

fix(kiloclaw): prevent 401s after API key rotation via SecretRef#2765
pandemicsyn merged 5 commits intomainfrom
florian/fix/401-prevention

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented Apr 23, 2026

Summary

KiloClaw controller rotates the KILOCODE_API_KEY env var and signals the gateway, but rotations silently no-op'd due to two compounding bugs:

  1. Plaintext on disk shadows env. openclaw onboard persisted the literal key to agents/<id>/agent/auth-profiles.json. OpenClaw's auth resolver prefers configured auth-profiles over env vars, so the stale on-disk key kept winning.
  2. The gateway child process can't see controller env changes. The controller spawns the gateway with env: process.env at spawn time, so subsequent process.env.KILOCODE_API_KEY = <new> in the controller never reaches the gateway. openclaw secrets reload re-resolves SecretRefs from the gateway's OWN (frozen) env. SIGUSR1 with OPENCLAW_NO_RESPAWN=1 takes an in-process restart branch that keeps the same env. The rotation's "success" signal was cosmetic.

Net result: users saw 401s against the Kilo AI gateway after every key rotation until the next image redeploy.

Fix has three parts, all in the controller:

  1. New installs — onboard with --secret-input-mode ref. No plaintext key on disk; openclaw stores an env-backed keyRef instead.
  2. Legacy installs — migrate auth-profiles.json on boot. Idempotent rewrite of any plaintext kilocode key to the same keyRef shape, atomic-written at 0o600. Runs at the end of runOnboardOrDoctor (so the gateway's first read already sees keyRef) and defensively on rotation.
  3. Rotation — supervisor.restart(). POST /_kilo/env/patch updates process.env, runs the migration, then calls supervisor.restart(). SIGTERM → gateway exits → the controller supervisor respawns with the controller's current env. The respawned gateway reads the migrated file and resolves the keyRef against the fresh env.

Architectural notes:

  • The response field on /_kilo/env/patch stays named signaled for wire compatibility with the worker (EnvPatchResponseSchema, reconcile.ts). Semantics unchanged — the controller delivered the env change to the running gateway — only the mechanism changed from SIGUSR1 to a full restart.
  • openclaw secrets reload looks attractive but is unusable for env-backed SecretRefs in our architecture: the server re-resolves against the gateway's OWN frozen env and returns stale values with ok: true. Two viable zero-downtime follow-ups exist but are out of scope: switch to { source: "file", ... } SecretRef (controller-owned file the controller updates on rotation), or remove OPENCLAW_NO_RESPAWN=1 so SIGUSR1 triggers a clean exit + respawn.

Verification

Smoke-test scripts have been extended to guard the new behavior end-to-end, and both passed locally on an x86_64 macOS Docker build:

  • scripts/controller-smoke-test.sh (fresh / onboard path) asserts:
    • auth-profiles.json stores a keyRef with no plaintext key after onboard (catches reverted --secret-input-mode ref).
    • POST /_kilo/env/patch response has { ok: true, signaled: true } (catches a rename that would silently break reconcile.ts).
    • The gateway child PID changes after a rotation request (catches a revert to SIGUSR1, which would leave the PID unchanged under OPENCLAW_NO_RESPAWN=1).
  • scripts/controller-entrypoint-smoke-test.sh (volume / doctor path) seeds a legacy plaintext auth-profiles.json and asserts it's rewritten to keyRef form with the plaintext literal removed.

Run locally (use non-default ports if the local dev stack already holds 18789/18790):

cd services/kiloclaw
docker build --progress=plain -t kiloclaw:controller .
PORT=19789 bash scripts/controller-smoke-test.sh
PORT=19790 bash scripts/controller-entrypoint-smoke-test.sh

Recommended manual checks on a staging Fly instance:

  • Fresh install: /root/.openclaw/agents/main/agent/auth-profiles.json contains keyRef and no key.
  • Legacy install: same file, pre-existing plaintext gets rewritten on next controller boot.
  • Force rotation by setting PROACTIVE_REFRESH_THRESHOLD_HOURS=10000 (worker env) so every alarm tick (≤5 min) rotates. Remove the override after testing.
  • Confirm reconcile.ts logs api_key_refreshed once per rotation (not every alarm), i.e., the new expiry persists after a successful push.
  • Additional manual verification: (reviewer/author to fill in)

Visual Changes

N/A

Reviewer Notes

  • All changes scoped to services/kiloclaw/controller/ and services/kiloclaw/scripts/. No worker, DB, or openclaw-fork changes.
  • The wire contract { ok, signaled } is preserved. migratedProfiles is added as an additive field (zod strips unknown keys).
  • Self-healing rollout: every instance migrates its own auth-profiles.json on next controller boot (which also restarts the gateway → picks up the migrated file).
  • routes/env.ts accepts a deps parameter for the migration call so tests can stub the filesystem walk.
  • bootstrap.ts::BootstrapDeps gained a statSync field; all fake harnesses were updated.
  • A previous iteration of this branch attempted openclaw secrets reload + SIGUSR1 fallback; reviewer @kilo-code-bot correctly pointed out that neither delivers a new env to a running gateway in our setup (frozen child-process env). The commit history shows that analysis and the subsequent pivot to supervisor.restart().

Controller rotates KILOCODE_API_KEY in env and signaled the gateway,
but openclaw onboard was persisting the literal key to
agents/<id>/agent/auth-profiles.json. OpenClaw's auth resolver prefers
configured auth-profiles over env vars, so rotations silently no-op'd
and the gateway kept authenticating with the stale on-disk key.

- Onboard with --secret-input-mode ref so new installs store an
  env-backed keyRef instead of the literal key.
- Add an idempotent migration that rewrites legacy plaintext kilocode
  profiles to the same keyRef shape; run it at the end of
  runOnboardOrDoctor and again on rotation as defense in depth.
- Switch /_kilo/env/patch from SIGUSR1 to 'openclaw secrets reload',
  which atomically swaps the SecretRef snapshot without aborting
  in-flight agent work. SIGUSR1 remains the fallback when the gateway
  is not reachable.
- Redact OPENCLAW_GATEWAY_TOKEN from execFileSync error messages
  before they hit controller logs.

Planning doc: .plans/kiloclaw-kilocode-key-secretref.md
Comment thread services/kiloclaw/controller/src/routes/env.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented Apr 23, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (3 files)
  • .plans/kiloclaw-kilocode-key-secretref.md
  • services/kiloclaw/controller/src/bootstrap.ts
  • services/kiloclaw/controller/src/config-writer.ts

Reviewed by gpt-5.4-20260305 · 458,905 tokens

The previous attempt (openclaw secrets reload, with SIGUSR1 fallback)
does not actually rotate the key. The gateway runs in a child process
with a frozen copy of the controller's env at spawn time. secrets
reload re-resolves SecretRefs from the gateway's OWN process.env
(frozen) and returns stale values with ok:true, short-circuiting the
fallback. SIGUSR1 with OPENCLAW_NO_RESPAWN=1 (set in bootstrap.ts:188)
takes the in-process restart branch — the process stays alive and
re-initializes against the same frozen env.

The only mechanism that delivers a new env var to the gateway in our
setup is a full process exit so the controller's supervisor respawns
the child with the controller's current env. supervisor.restart()
(SIGTERM → child exit → respawn) does exactly that. The respawned
gateway reads the already-migrated auth-profiles.json and resolves
the env-backed keyRef against the fresh env.

- Drop gateway-rpc.ts wrapper and its test (unused, misleading).
- /_kilo/env/patch now calls supervisor.restart() when the gateway is
  running; fire-and-forget to keep request latency bounded.
- Keep the response field 'signaled' for wire compatibility with the
  worker's EnvPatchResponseSchema and reconcile.ts, which treat it as
  the success bit for live env delivery. Semantics unchanged; only the
  mechanism behind it changed.
- Add a test that pins the response shape so a rename is caught
  locally instead of silently breaking the worker.
- Update planning doc with the full analysis of why secrets reload and
  SIGUSR1 are dead ends, and call out two viable zero-downtime
  follow-ups (file-source SecretRef; removing OPENCLAW_NO_RESPAWN=1).

Addresses review feedback on PR #2765.
The existing smoke scripts only checked HTTP status codes from
/_kilo/env/patch. A regression in the rotation mechanism itself
(reverting --secret-input-mode ref, dropping supervisor.restart(),
or renaming the response field) would have passed silently.

controller-smoke-test.sh (fresh / onboard path):
- Assert auth-profiles.json stores a keyRef and no plaintext 'key'
  after onboard, guarding against a reverted --secret-input-mode ref.
- Assert /_kilo/env/patch response matches the { ok, signaled } wire
  contract that the worker's EnvPatchResponseSchema parses.
- Assert the gateway child PID changes after a rotation request,
  confirming supervisor.restart() actually replaced the process
  (a reverted SIGUSR1 path with OPENCLAW_NO_RESPAWN=1 would not
  change the PID).

controller-entrypoint-smoke-test.sh (volume-mounted / doctor path):
- Seed a legacy plaintext auth-profiles.json in the volume before
  launch, assert the file was rewritten to keyRef form and the
  plaintext literal is gone after bootstrap completes.

No production code changes.
The planning doc's content is fully captured in the PR description
and code comments; keeping it in-repo adds no ongoing value. Also
drops two code comments that pointed at the file and one stale
mention of 'openclaw secrets reload' as the rotation mechanism
(the final implementation uses supervisor.restart()).
@pandemicsyn
Copy link
Copy Markdown
Contributor Author

Smoke test run:

❯ PORT=19789 bash scripts/controller-smoke-test.sh
waiting for controller on port 19789 ...
  [1] waiting...
  [2] state=bootstrapping
  ready after 3s

--- health endpoints ---
PASS: /_kilo/health -> 200 (got 200)
PASS: /health -> 200 (got 200)

--- gateway status ---
PASS: gateway status (no auth) -> 401 (got 401)
PASS: gateway status (bearer auth) -> 200 (got 200)

--- proxy token ---
PASS: root without proxy token (REQUIRE_PROXY_TOKEN=true) -> 401 (got 401)

--- env patch endpoint ---
PASS: env patch (no auth) -> 401 (got 401)
PASS: env patch (non-patchable key) -> 400 (got 400)
PASS: env patch (empty body) -> 400 (got 400)

--- auth-profiles SecretRef mode ---
PASS: auth-profiles.json stores keyRef (got 1)
PASS: auth-profiles.json has no plaintext key (got 1)

--- env patch rotation semantics ---
PASS: env patch response ok=True (got True)
PASS: env patch response signaled=True (got True)
PASS: gateway pid changed after env patch (got 1)

--- version endpoint ---
PASS: version (no auth) -> 401 (got 401)
PASS: version (bearer auth) -> 200 (got 200)

=== Results: 15 passed, 0 failed ===

cloud/services/kiloclaw florian/fix/401-prevention ≡9s
❯ PORT=19790 bash scripts/controller-entrypoint-smoke-test.sh
waiting for /_kilo/health on port 19790 ...
  [1] waiting...
  [2] state=bootstrapping
  ready after 3s

--- health endpoints ---
PASS: /_kilo/health -> 200 (got 200)
PASS: /health -> 200 (got 200)

--- gateway status ---
PASS: gateway status (no auth) -> 401 (got 401)
PASS: gateway status (bearer auth) -> 200 (got 200)

--- proxy token ---
PASS: root without proxy token (REQUIRE_PROXY_TOKEN=true) -> 401 (got 401)

--- auth-profiles migration (doctor path) ---
PASS: legacy plaintext migrated to keyRef (got 1)
PASS: legacy plaintext removed from disk (got 1)

=== Results: 7 passed, 0 failed ===

and what a deploy looks like on disk now:

root@02f254a998c5:~/.openclaw/agents/main/agent# cat auth-profiles.json | jq
{
  "version": 1,
  "profiles": {
    "kilocode:default": {
      "type": "api_key",
      "provider": "kilocode",
      "keyRef": {
        "source": "env",
        "provider": "default",
        "id": "KILOCODE_API_KEY"
      }
    }
  }
}

…ntion

# Conflicts:
#	services/kiloclaw/controller/src/bootstrap.test.ts
@pandemicsyn pandemicsyn merged commit 515ea7d into main Apr 23, 2026
15 checks passed
@pandemicsyn pandemicsyn deleted the florian/fix/401-prevention branch April 23, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants