From ad7f09996cec6514213c5fccc5dccb6e9bad2698 Mon Sep 17 00:00:00 2001 From: bussyjd Date: Sat, 30 May 2026 01:57:08 +0400 Subject: [PATCH] docs(security): RFC for per-agent remote-signer keystore isolation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up to PR #570 (controller Secret RBAC scoping), which scopes the controller's keystore access by name but cannot isolate one agent's keystore from another's: all agents share the Secret name remote-signer-keystore, and a name-scoped ClusterRole matches that name in any namespace. Captures the verified custody flow (controller mints keys in-process via openclaw.GenerateKeystoreInMemory; standing cluster-wide GET on every agent's keystore + password), the threat model, four options (rejecting the "per-namespace Role for the controller SA" approach as isolation theater), and recommends Option B: move keystore minting into the agent pod so the controller leaves the signer-key custody path entirely. No controller code yet -- gated on open questions about remote-signer self-mint capability and the address-reporting channel. πŸ€– Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- plans/per-agent-keystore-isolation.md | 137 ++++++++++++++++++++++++++ 1 file changed, 137 insertions(+) create mode 100644 plans/per-agent-keystore-isolation.md diff --git a/plans/per-agent-keystore-isolation.md b/plans/per-agent-keystore-isolation.md new file mode 100644 index 00000000..6791f331 --- /dev/null +++ b/plans/per-agent-keystore-isolation.md @@ -0,0 +1,137 @@ +# Per-agent remote-signer keystore isolation + +Status: **design / RFC** β€” no controller code yet. Companion to the RBAC scoping +in `security/controller-secrets-rbac-scope` (PR #570), which is the immediate +hardening; this doc tracks the deeper isolation follow-up that #570 explicitly +defers. + +## Problem + +Every agent's signer keystore is a Secret named `remote-signer-keystore`, one +per agent namespace (`agent-`, `hermes-obol-agent`, …). After #570 the +`serviceoffer-controller` ClusterRole reads it with: + +```yaml +resourceNames: ["litellm-secrets", "hermes-api-server", "remote-signer-keystore"] +verbs: ["get", "delete"] +``` + +`resourceNames` on a **ClusterRole** matches a name in *any* namespace. Because +all agents share the name `remote-signer-keystore`, the controller's +ServiceAccount can `get` (and `delete`) **every agent's keystore Secret** β€” +which contains the V3 keystore JSON **and** the decryption password +(`internal/serviceoffercontroller/agent_wallet.go::buildSignerKeystoreSecret`). + +There are two distinct sub-risks, verified in code: + +1. **Standing cross-agent read.** A compromised controller pod can `GET` every + agent's keystore + password β†’ derive every agent's signer key β†’ drain every + agent wallet. This is the blast radius. +2. **In-process custody at mint.** `openclaw.GenerateKeystoreInMemory()` + (`internal/openclaw/wallet.go:136`) runs *inside the controller process*. The + controller generates and holds the private key + password at provisioning + time, regardless of RBAC. + +Note: in normal operation the controller reads only the **address annotation** +(`obol.org/wallet-address`) on the reuse path β€” never the key data after mint +(`ensureSignerKeystore`). But `GET` on a Secret returns all keys, so the +standing capability exposes the key material even though the code does not use +it. + +## Threat model + +- **In scope:** a compromised/abused `serviceoffer-controller` (supply-chain on + its image, RCE in a reconcile path, a malicious ClusterRole edit) reading or + deleting *other* agents' signer keys. +- **Out of scope:** an attacker who already controls a specific agent's own + pod/namespace (they already hold that agent's key by design). + +The controller is trusted infra, so this is defense-in-depth / blast-radius +reduction, not a remotely-exploitable hole. It matters because the asset is +spendable signer keys for N tenants behind one shared identity. + +## Options + +### Option 0 β€” Accept + document (baseline) +Keep #570 as the final state; document that the controller is in the keystore +TCB. Zero additional work. Rejected as the *end* state because the asset +(multi-tenant signer keys) warrants real isolation, but it is the honest +fallback if Option B proves too costly for the current milestone. + +### Option A β€” Per-namespace Role + RoleBinding minted by the controller +Remove keystore verbs from the ClusterRole; at provisioning, the controller +creates in `agent-` a Role (get/delete on the keystore in that ns) + a +RoleBinding to the controller's SA. + +**Rejected β€” this is isolation theater.** The controller manages *all* agents, +so it would mint a RoleBinding for itself in *every* agent namespace, retaining +full reach. It also needs cluster-wide `create` to bootstrap the keystore +before the Role exists (chicken-and-egg). Net: more RBAC surface, same reach. + +### Option B β€” Agent self-mints the keystore (controller out of custody) β€” RECOMMENDED +The keypair is generated **inside the agent's own namespace/pod**, never in the +controller. The controller never gains `get`/`create`/`delete` on +`remote-signer-keystore`. + +Moving parts: +1. **In-pod keystore generation.** Either (a) the `remote-signer` image + self-generates a keystore on first boot when none is mounted, or (b) an init + container runs a tiny keystore-gen tool (reuse the logic in + `openclaw.GenerateKeystoreInMemory`, shipped as a minimal binary) and writes + the Secret. **OPEN QUESTION** β€” see below. +2. **Namespaced write RBAC for the agent SA.** The controller creates, once per + agent namespace, a Role granting the *agent's* ServiceAccount + `create`/`get` on `remote-signer-keystore` in its own namespace, plus a + RoleBinding. The agent SA can only ever touch its own namespace β€” true + isolation. +3. **Address reporting via a non-secret channel.** Today the controller learns + the address from `mat.Address` at mint. With self-mint it must learn it + without reading the Secret. Options: the agent patches + `Agent.status.walletAddress` (its SA gets `patch` on `agents/status` in its + ns), or writes a non-secret ConfigMap the controller reads. Either avoids a + keystore `GET`. +4. **Controller RBAC shrinks** to: `litellm-secrets` get (fixed ns), + `hermes-api-server` get/create/delete, and **no** `remote-signer-keystore` + access at all. + +### Option C β€” Unique per-agent keystore names + per-namespace Roles +Name the keystore `-remote-signer-keystore`. On its own this does not +help a ClusterRole (resourceNames is a fixed list, no wildcards) and still needs +per-namespace Roles β†’ collapses into Option A's theater problem. Only useful as +a hygiene change layered onto Option B. + +## Recommendation + +**Option B, phased:** + +- **Phase 1 (decision-gated):** confirm the in-pod mint mechanism (B.1). If the + remote-signer image can self-generate, this is small; if not, build the + init-container mint tool. +- **Phase 2:** controller mints the namespaced Role/RoleBinding for the agent + SA + the address-reporting channel; switch `ensureAgentWallet` to *wait for* + the agent-reported address instead of minting it. +- **Phase 3:** drop `remote-signer-keystore` from the controller ClusterRole; + add a guard test asserting the controller has **no** get/create/delete on it. + +If Phase 1 reveals B is disproportionately expensive for this milestone, fall +back to Option 0 and revisit β€” but do not ship Option A as a substitute. + +## Open questions (resolve before Phase 1 code) + +1. Does `ghcr.io/obolnetwork/remote-signer:v0.3.0` support generating a keystore + on first boot when the keystore dir is empty? (Check the `ObolNetwork` + remote-signer repo / chart.) If yes β†’ no init container needed. +2. Is `Agent.status.walletAddress` the right address channel, and can the agent + SA be granted `patch` on `agents/status` scoped to its own namespace? +3. Does anything else consume the keystore Secret directly besides the + remote-signer pod? (Grep across runtimes before removing controller access.) + +## Acceptance criteria + +- Controller ClusterRole has **no** verbs on `remote-signer-keystore` (guarded + by an extension of `TestServiceOfferControllerSecretRBAC_Scoped`). +- The agent SA's keystore write access is a namespaced **Role**, never a + ClusterRole. +- `obol agent init` still populates `Agent.status.walletAddress`; teardown still + cleans up; release-smoke sellβ†’buyβ†’teardown stays green. +- Pre-production: no keystore migration needed (greenfield).