fix(kiloclaw): prevent 401s after API key rotation via SecretRef#2765
Merged
pandemicsyn merged 5 commits intomainfrom Apr 23, 2026
Merged
fix(kiloclaw): prevent 401s after API key rotation via SecretRef#2765pandemicsyn merged 5 commits intomainfrom
pandemicsyn merged 5 commits intomainfrom
Conversation
Controller rotates KILOCODE_API_KEY in env and signaled the gateway, but openclaw onboard was persisting the literal key to agents/<id>/agent/auth-profiles.json. OpenClaw's auth resolver prefers configured auth-profiles over env vars, so rotations silently no-op'd and the gateway kept authenticating with the stale on-disk key. - Onboard with --secret-input-mode ref so new installs store an env-backed keyRef instead of the literal key. - Add an idempotent migration that rewrites legacy plaintext kilocode profiles to the same keyRef shape; run it at the end of runOnboardOrDoctor and again on rotation as defense in depth. - Switch /_kilo/env/patch from SIGUSR1 to 'openclaw secrets reload', which atomically swaps the SecretRef snapshot without aborting in-flight agent work. SIGUSR1 remains the fallback when the gateway is not reachable. - Redact OPENCLAW_GATEWAY_TOKEN from execFileSync error messages before they hit controller logs. Planning doc: .plans/kiloclaw-kilocode-key-secretref.md
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (3 files)
Reviewed by gpt-5.4-20260305 · 458,905 tokens |
The previous attempt (openclaw secrets reload, with SIGUSR1 fallback) does not actually rotate the key. The gateway runs in a child process with a frozen copy of the controller's env at spawn time. secrets reload re-resolves SecretRefs from the gateway's OWN process.env (frozen) and returns stale values with ok:true, short-circuiting the fallback. SIGUSR1 with OPENCLAW_NO_RESPAWN=1 (set in bootstrap.ts:188) takes the in-process restart branch — the process stays alive and re-initializes against the same frozen env. The only mechanism that delivers a new env var to the gateway in our setup is a full process exit so the controller's supervisor respawns the child with the controller's current env. supervisor.restart() (SIGTERM → child exit → respawn) does exactly that. The respawned gateway reads the already-migrated auth-profiles.json and resolves the env-backed keyRef against the fresh env. - Drop gateway-rpc.ts wrapper and its test (unused, misleading). - /_kilo/env/patch now calls supervisor.restart() when the gateway is running; fire-and-forget to keep request latency bounded. - Keep the response field 'signaled' for wire compatibility with the worker's EnvPatchResponseSchema and reconcile.ts, which treat it as the success bit for live env delivery. Semantics unchanged; only the mechanism behind it changed. - Add a test that pins the response shape so a rename is caught locally instead of silently breaking the worker. - Update planning doc with the full analysis of why secrets reload and SIGUSR1 are dead ends, and call out two viable zero-downtime follow-ups (file-source SecretRef; removing OPENCLAW_NO_RESPAWN=1). Addresses review feedback on PR #2765.
The existing smoke scripts only checked HTTP status codes from
/_kilo/env/patch. A regression in the rotation mechanism itself
(reverting --secret-input-mode ref, dropping supervisor.restart(),
or renaming the response field) would have passed silently.
controller-smoke-test.sh (fresh / onboard path):
- Assert auth-profiles.json stores a keyRef and no plaintext 'key'
after onboard, guarding against a reverted --secret-input-mode ref.
- Assert /_kilo/env/patch response matches the { ok, signaled } wire
contract that the worker's EnvPatchResponseSchema parses.
- Assert the gateway child PID changes after a rotation request,
confirming supervisor.restart() actually replaced the process
(a reverted SIGUSR1 path with OPENCLAW_NO_RESPAWN=1 would not
change the PID).
controller-entrypoint-smoke-test.sh (volume-mounted / doctor path):
- Seed a legacy plaintext auth-profiles.json in the volume before
launch, assert the file was rewritten to keyRef form and the
plaintext literal is gone after bootstrap completes.
No production code changes.
The planning doc's content is fully captured in the PR description and code comments; keeping it in-repo adds no ongoing value. Also drops two code comments that pointed at the file and one stale mention of 'openclaw secrets reload' as the rotation mechanism (the final implementation uses supervisor.restart()).
Contributor
Author
|
Smoke test run: and what a deploy looks like on disk now: root@02f254a998c5:~/.openclaw/agents/main/agent# cat auth-profiles.json | jq
{
"version": 1,
"profiles": {
"kilocode:default": {
"type": "api_key",
"provider": "kilocode",
"keyRef": {
"source": "env",
"provider": "default",
"id": "KILOCODE_API_KEY"
}
}
}
} |
…ntion # Conflicts: # services/kiloclaw/controller/src/bootstrap.test.ts
jeanduplessis
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
KiloClaw controller rotates the
KILOCODE_API_KEYenv var and signals the gateway, but rotations silently no-op'd due to two compounding bugs:openclaw onboardpersisted the literal key toagents/<id>/agent/auth-profiles.json. OpenClaw's auth resolver prefers configured auth-profiles over env vars, so the stale on-disk key kept winning.env: process.envat spawn time, so subsequentprocess.env.KILOCODE_API_KEY = <new>in the controller never reaches the gateway.openclaw secrets reloadre-resolves SecretRefs from the gateway's OWN (frozen) env. SIGUSR1 withOPENCLAW_NO_RESPAWN=1takes an in-process restart branch that keeps the same env. The rotation's "success" signal was cosmetic.Net result: users saw 401s against the Kilo AI gateway after every key rotation until the next image redeploy.
Fix has three parts, all in the controller:
--secret-input-mode ref. No plaintext key on disk; openclaw stores an env-backedkeyRefinstead.auth-profiles.jsonon boot. Idempotent rewrite of any plaintext kilocodekeyto the samekeyRefshape, atomic-written at0o600. Runs at the end ofrunOnboardOrDoctor(so the gateway's first read already seeskeyRef) and defensively on rotation.supervisor.restart().POST /_kilo/env/patchupdatesprocess.env, runs the migration, then callssupervisor.restart(). SIGTERM → gateway exits → the controller supervisor respawns with the controller's current env. The respawned gateway reads the migrated file and resolves the keyRef against the fresh env.Architectural notes:
/_kilo/env/patchstays namedsignaledfor wire compatibility with the worker (EnvPatchResponseSchema,reconcile.ts). Semantics unchanged — the controller delivered the env change to the running gateway — only the mechanism changed from SIGUSR1 to a full restart.openclaw secrets reloadlooks attractive but is unusable for env-backed SecretRefs in our architecture: the server re-resolves against the gateway's OWN frozen env and returns stale values withok: true. Two viable zero-downtime follow-ups exist but are out of scope: switch to{ source: "file", ... }SecretRef (controller-owned file the controller updates on rotation), or removeOPENCLAW_NO_RESPAWN=1so SIGUSR1 triggers a clean exit + respawn.Verification
Smoke-test scripts have been extended to guard the new behavior end-to-end, and both passed locally on an x86_64 macOS Docker build:
scripts/controller-smoke-test.sh(fresh / onboard path) asserts:auth-profiles.jsonstores akeyRefwith no plaintextkeyafter onboard (catches reverted--secret-input-mode ref).POST /_kilo/env/patchresponse has{ ok: true, signaled: true }(catches a rename that would silently breakreconcile.ts).OPENCLAW_NO_RESPAWN=1).scripts/controller-entrypoint-smoke-test.sh(volume / doctor path) seeds a legacy plaintextauth-profiles.jsonand asserts it's rewritten tokeyRefform with the plaintext literal removed.Run locally (use non-default ports if the local dev stack already holds 18789/18790):
Recommended manual checks on a staging Fly instance:
/root/.openclaw/agents/main/agent/auth-profiles.jsoncontainskeyRefand nokey.PROACTIVE_REFRESH_THRESHOLD_HOURS=10000(worker env) so every alarm tick (≤5 min) rotates. Remove the override after testing.reconcile.tslogsapi_key_refreshedonce per rotation (not every alarm), i.e., the new expiry persists after a successful push.Visual Changes
N/A
Reviewer Notes
services/kiloclaw/controller/andservices/kiloclaw/scripts/. No worker, DB, or openclaw-fork changes.{ ok, signaled }is preserved.migratedProfilesis added as an additive field (zod strips unknown keys).auth-profiles.jsonon next controller boot (which also restarts the gateway → picks up the migrated file).routes/env.tsaccepts adepsparameter for the migration call so tests can stub the filesystem walk.bootstrap.ts::BootstrapDepsgained astatSyncfield; all fake harnesses were updated.openclaw secrets reload+ SIGUSR1 fallback; reviewer @kilo-code-bot correctly pointed out that neither delivers a new env to a running gateway in our setup (frozen child-process env). The commit history shows that analysis and the subsequent pivot tosupervisor.restart().