Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions plans/per-agent-keystore-isolation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Per-agent remote-signer keystore isolation

Status: **design / RFC** — no controller code yet. Companion to the RBAC scoping
in `security/controller-secrets-rbac-scope` (PR #570), which is the immediate
hardening; this doc tracks the deeper isolation follow-up that #570 explicitly
defers.

## Problem

Every agent's signer keystore is a Secret named `remote-signer-keystore`, one
per agent namespace (`agent-<petname>`, `hermes-obol-agent`, …). After #570 the
`serviceoffer-controller` ClusterRole reads it with:

```yaml
resourceNames: ["litellm-secrets", "hermes-api-server", "remote-signer-keystore"]
verbs: ["get", "delete"]
```

`resourceNames` on a **ClusterRole** matches a name in *any* namespace. Because
all agents share the name `remote-signer-keystore`, the controller's
ServiceAccount can `get` (and `delete`) **every agent's keystore Secret** —
which contains the V3 keystore JSON **and** the decryption password
(`internal/serviceoffercontroller/agent_wallet.go::buildSignerKeystoreSecret`).

There are two distinct sub-risks, verified in code:

1. **Standing cross-agent read.** A compromised controller pod can `GET` every
agent's keystore + password → derive every agent's signer key → drain every
agent wallet. This is the blast radius.
2. **In-process custody at mint.** `openclaw.GenerateKeystoreInMemory()`
(`internal/openclaw/wallet.go:136`) runs *inside the controller process*. The
controller generates and holds the private key + password at provisioning
time, regardless of RBAC.

Note: in normal operation the controller reads only the **address annotation**
(`obol.org/wallet-address`) on the reuse path — never the key data after mint
(`ensureSignerKeystore`). But `GET` on a Secret returns all keys, so the
standing capability exposes the key material even though the code does not use
it.

## Threat model

- **In scope:** a compromised/abused `serviceoffer-controller` (supply-chain on
its image, RCE in a reconcile path, a malicious ClusterRole edit) reading or
deleting *other* agents' signer keys.
- **Out of scope:** an attacker who already controls a specific agent's own
pod/namespace (they already hold that agent's key by design).

The controller is trusted infra, so this is defense-in-depth / blast-radius
reduction, not a remotely-exploitable hole. It matters because the asset is
spendable signer keys for N tenants behind one shared identity.

## Options

### Option 0 — Accept + document (baseline)
Keep #570 as the final state; document that the controller is in the keystore
TCB. Zero additional work. Rejected as the *end* state because the asset
(multi-tenant signer keys) warrants real isolation, but it is the honest
fallback if Option B proves too costly for the current milestone.

### Option A — Per-namespace Role + RoleBinding minted by the controller
Remove keystore verbs from the ClusterRole; at provisioning, the controller
creates in `agent-<x>` a Role (get/delete on the keystore in that ns) + a
RoleBinding to the controller's SA.

**Rejected — this is isolation theater.** The controller manages *all* agents,
so it would mint a RoleBinding for itself in *every* agent namespace, retaining
full reach. It also needs cluster-wide `create` to bootstrap the keystore
before the Role exists (chicken-and-egg). Net: more RBAC surface, same reach.

### Option B — Agent self-mints the keystore (controller out of custody) — RECOMMENDED
The keypair is generated **inside the agent's own namespace/pod**, never in the
controller. The controller never gains `get`/`create`/`delete` on
`remote-signer-keystore`.

Moving parts:
1. **In-pod keystore generation.** Either (a) the `remote-signer` image
self-generates a keystore on first boot when none is mounted, or (b) an init
container runs a tiny keystore-gen tool (reuse the logic in
`openclaw.GenerateKeystoreInMemory`, shipped as a minimal binary) and writes
the Secret. **OPEN QUESTION** — see below.
2. **Namespaced write RBAC for the agent SA.** The controller creates, once per
agent namespace, a Role granting the *agent's* ServiceAccount
`create`/`get` on `remote-signer-keystore` in its own namespace, plus a
RoleBinding. The agent SA can only ever touch its own namespace — true
isolation.
3. **Address reporting via a non-secret channel.** Today the controller learns
the address from `mat.Address` at mint. With self-mint it must learn it
without reading the Secret. Options: the agent patches
`Agent.status.walletAddress` (its SA gets `patch` on `agents/status` in its
ns), or writes a non-secret ConfigMap the controller reads. Either avoids a
keystore `GET`.
4. **Controller RBAC shrinks** to: `litellm-secrets` get (fixed ns),
`hermes-api-server` get/create/delete, and **no** `remote-signer-keystore`
access at all.

### Option C — Unique per-agent keystore names + per-namespace Roles
Name the keystore `<agent>-remote-signer-keystore`. On its own this does not
help a ClusterRole (resourceNames is a fixed list, no wildcards) and still needs
per-namespace Roles → collapses into Option A's theater problem. Only useful as
a hygiene change layered onto Option B.

## Recommendation

**Option B, phased:**

- **Phase 1 (decision-gated):** confirm the in-pod mint mechanism (B.1). If the
remote-signer image can self-generate, this is small; if not, build the
init-container mint tool.
- **Phase 2:** controller mints the namespaced Role/RoleBinding for the agent
SA + the address-reporting channel; switch `ensureAgentWallet` to *wait for*
the agent-reported address instead of minting it.
- **Phase 3:** drop `remote-signer-keystore` from the controller ClusterRole;
add a guard test asserting the controller has **no** get/create/delete on it.

If Phase 1 reveals B is disproportionately expensive for this milestone, fall
back to Option 0 and revisit — but do not ship Option A as a substitute.

## Open questions (resolve before Phase 1 code)

1. Does `ghcr.io/obolnetwork/remote-signer:v0.3.0` support generating a keystore
on first boot when the keystore dir is empty? (Check the `ObolNetwork`
remote-signer repo / chart.) If yes → no init container needed.
2. Is `Agent.status.walletAddress` the right address channel, and can the agent
SA be granted `patch` on `agents/status` scoped to its own namespace?
3. Does anything else consume the keystore Secret directly besides the
remote-signer pod? (Grep across runtimes before removing controller access.)

## Acceptance criteria

- Controller ClusterRole has **no** verbs on `remote-signer-keystore` (guarded
by an extension of `TestServiceOfferControllerSecretRBAC_Scoped`).
- The agent SA's keystore write access is a namespaced **Role**, never a
ClusterRole.
- `obol agent init` still populates `Agent.status.walletAddress`; teardown still
cleans up; release-smoke sell→buy→teardown stays green.
- Pre-production: no keystore migration needed (greenfield).
Loading