security: per-agent remote-signer keystore isolation (controller out of custody path) — follow-up to #570

## Summary

Every agent's signer keystore is a Secret **named identically** — `remote-signer-keystore` — one per agent namespace, and the keypair is **minted inside the `serviceoffer-controller` process**. Because the controller's RBAC is a *ClusterRole* and `resourceNames` on a ClusterRole matches a name in **any** namespace, a single compromised controller can read (and delete) **every** agent's keystore + decryption password, and therefore derive every agent's spendable signer key.

This is **defense-in-depth / blast-radius reduction**, not a remotely-exploitable hole — the controller is trusted infra. It matters because the protected asset is *spendable, multi-tenant signer keys* sitting behind one shared identity.

This issue is the design follow-up that **#570 explicitly defers**. #570 is the immediate, minimal RBAC hardening (scope the verbs, drop dead `update`/`patch`, add the missing `delete`). It does **not** — and on its own *cannot* — isolate agent-A's keystore from agent-B's, because they share a name.

> Supersedes draft PR #571 (which carried this as `plans/per-agent-keystore-isolation.md`). The design now lives here as a tracking issue so it can be discussed and scheduled independently of any code branch.

---

## Current architecture — why the reach exists

```mermaid
flowchart TB
 subgraph x402ns["namespace: x402"]
 SOC["serviceoffer-controller ServiceAccount"]
 GEN["GenerateKeystoreInMemory() mints privkey + password inside the controller process"]
 SOC --- GEN
 end

 subgraph cr["ClusterRole: serviceoffer-controller (after #570)"]
 RULE["resourceNames: remote-signer-keystore verbs: get, delete"]
 end

 subgraph agentA["namespace: agent-alice"]
 SA["Secret/remote-signer-keystore keystore.json + password"]
 RSA["remote-signer pod"]
 SA --> RSA
 end
 subgraph agentB["namespace: agent-bob"]
 SB["Secret/remote-signer-keystore keystore.json + password"]
 RSB["remote-signer pod"]
 SB --> RSB
 end
 subgraph hermesns["namespace: hermes-obol-agent"]
 SH["Secret/remote-signer-keystore keystore.json + password"]
 RSH["remote-signer pod"]
 SH --> RSH
 end

 GEN -->|create| SA
 GEN -->|create| SB
 GEN -->|create| SH

 RULE -. "name matches in ANY ns" .-> SA
 RULE -. "name matches in ANY ns" .-> SB
 RULE -. "name matches in ANY ns" .-> SH

 classDef danger fill:#ffe0e0,stroke:#d73a4a,color:#000;
 class SA,SB,SH,GEN danger;
```

The dotted edges are the problem: one `resourceNames: ["remote-signer-keystore"]` rule on a ClusterRole fans out to **every** namespace that happens to use that name.

### Two distinct sub-risks (both verified in code)

1. **Standing cross-agent read.** A `GET` on a Secret returns *all* keys. The controller SA can `GET remote-signer-keystore` in every agent namespace → keystore JSON **+ password** → derive each agent's private key. Source: `internal/serviceoffercontroller/agent_wallet.go::buildSignerKeystoreSecret` writes both `keystore.json` and `password` into the one Secret.
2. **In-process custody at mint.** `openclaw.GenerateKeystoreInMemory()` runs *inside the controller process* (`internal/openclaw/wallet.go`). The controller generates and holds the private key + password at provisioning time — **regardless of RBAC**. RBAC scoping alone never removes this.

> Note: in steady state the reuse path reads only the **address annotation** (`obol.org/wallet-address`) — never the key data after mint. But the *capability* to read key material is standing, which is exactly the blast radius an attacker uses.

---

## Blast radius (attack path)

```mermaid
sequenceDiagram
 autonumber
 participant ATK as Attacker
 participant SOC as Compromised controller pod
 participant API as kube-apiserver
 participant CH as Base / chain

 ATK->>SOC: RCE in a reconcile path OR malicious controller image
 Note over SOC: holds the controller ServiceAccount token
 loop for every agent namespace
 SOC->>API: GET secret/remote-signer-keystore (ns = agent-N)
 API-->>SOC: keystore.json + password
 Note over SOC: decrypt V3 keystore -> private key
 end
 SOC->>CH: sign + broadcast transfers
 CH-->>ATK: every agent wallet drained
```

---

## Threat model

| | |
|---|---|
| **In scope** | A compromised/abused `serviceoffer-controller` — supply-chain on its image, RCE in a reconcile path, or a malicious ClusterRole edit — reading or deleting **other** agents' signer keys. |
| **Out of scope** | An attacker who already controls a specific agent's own pod/namespace. They already hold that agent's key by design. |

The controller is trusted infra, so this is blast-radius reduction, not a remotely-exploitable bug. The asset (spendable signer keys for N tenants) is what makes it worth real isolation rather than "accept and document".

---

## Options considered

```mermaid
flowchart LR
 Q{"Isolate per-agent signer keys?"}
 Q -->|"do nothing"| O0["Option 0 Accept + document controller stays in keystore TCB"]
 Q -->|"per-ns Role for controller SA"| OA["Option A REJECTED isolation theater: controller binds itself in every agent ns"]
 Q -->|"unique names only"| OC["Option C insufficient alone collapses into A"]
 Q -->|"agent self-mints"| OB["Option B RECOMMENDED controller leaves the custody path"]

 classDef rec fill:#e0ffe0,stroke:#2da44e,color:#000;
 classDef rej fill:#ffe0e0,stroke:#d73a4a,color:#000;
 class OB rec;
 class OA rej;
```

| Option | Idea | Verdict |
|---|---|---|
| **0 — Accept + document** | Keep #570 as final; document the controller as part of the keystore TCB. | Honest fallback, but not the *end* state — the asset warrants real isolation. |
| **A — Per-namespace Role for the controller SA** | Controller mints a Role+RoleBinding for *itself* in each agent ns; drop keystore verbs from the ClusterRole. | **Rejected — isolation theater.** The controller manages all agents, so it binds itself in every namespace → same reach, more RBAC surface, plus a chicken-and-egg `create` bootstrap. |
| **C — Unique per-agent keystore names** | Name it `<agent>-remote-signer-keystore`. | Doesn't help a ClusterRole alone (`resourceNames` has no wildcards). Only useful as hygiene layered onto B. |
| **B — Agent self-mints** | Keypair generated **inside the agent's own namespace/pod**; controller never gains get/create/delete on the keystore. | **Recommended.** Removes both sub-risks: no shared-name reach, no in-process custody. |

---

## Recommended target — Option B

```mermaid
flowchart TB
 subgraph x402ns["namespace: x402"]
 SOC["serviceoffer-controller no keystore verbs"]
 end

 subgraph agentA["namespace: agent-alice"]
 RoleA["Role + RoleBinding agent SA: create/get remote-signer-keystore this ns only"]
 INITA["init / first-boot mint in the agent pod"]
 SAk["Secret/remote-signer-keystore"]
 STA["Agent.status.walletAddress (non-secret)"]
 INITA -->|create| SAk
 INITA -->|publish addr| STA
 RoleA -. scopes .-> SAk
 end

 SOC -->|"mint Role/RoleBinding once"| RoleA
 SOC -->|"read address (non-secret)"| STA
 SOC -. "cannot read keystore" .-x SAk

 classDef safe fill:#e0ffe0,stroke:#2da44e,color:#000;
 class SAk,STA,INITA safe;
```

**Moving parts**

1. **In-pod keystore generation** — either (a) the `remote-signer` image self-generates a keystore on first boot when none is mounted, or (b) a tiny init container mints it (reuse `openclaw.GenerateKeystoreInMemory` logic, shipped as a minimal binary). **← open question #1.**
2. **Namespaced write RBAC for the agent SA** — controller creates, once per agent namespace, a **Role** granting the *agent's* SA `create`/`get` on `remote-signer-keystore` in its **own** namespace + a RoleBinding. The agent SA can never reach another namespace → true isolation.
3. **Address via a non-secret channel** — the agent publishes its address (e.g. patches `Agent.status.walletAddress`, SA scoped to `agents/status` in its ns) so the controller learns it **without** a keystore `GET`.
4. **Controller RBAC shrinks to**: `litellm-secrets` get (fixed ns), `hermes-api-server` get/create/delete, and **zero** `remote-signer-keystore` access.

### Phased rollout

```mermaid
flowchart TB
 P1["Phase 1 — decision gate Confirm in-pod mint mechanism remote-signer self-mint? else init-container tool"]
 P2["Phase 2 — wiring controller mints namespaced Role/RoleBinding + address-reporting channel ensureAgentWallet waits for agent-reported addr"]
 P3["Phase 3 — drop access remove remote-signer-keystore from ClusterRole guard test: no get/create/delete"]
 P1 -->|"self-mint supported"| P2
 P1 -->|"not supported, build init tool"| P2
 P2 --> P3
 P3 --> DONE["Controller holds no agent signer material"]
 classDef done fill:#e0ffe0,stroke:#2da44e,color:#000;
 class DONE done;
```

If Phase 1 shows B is disproportionately expensive for the current milestone, fall back to **Option 0** and revisit — but do **not** ship Option A as a substitute.

---

## Open questions (resolve before Phase 1 code)

1. Does `ghcr.io/obolnetwork/remote-signer:v0.3.0` generate a keystore on first boot when the keystore dir is empty? (Check the `ObolNetwork` remote-signer repo / chart.) If yes → no init container needed.
2. Is `Agent.status.walletAddress` the right address channel, and can the agent SA be granted `patch` on `agents/status` scoped to its own namespace?
3. Does anything **besides** the remote-signer pod consume the keystore Secret directly? (Grep across runtimes before removing controller access.)

---

## Acceptance criteria

- [ ] Controller ClusterRole has **no** verbs on `remote-signer-keystore` (guarded by extending `TestServiceOfferControllerSecretRBAC_Scoped`).
- [ ] The agent SA's keystore write access is a namespaced **Role**, never a ClusterRole.
- [ ] `obol agent init` still populates `Agent.status.walletAddress`; teardown still cleans up.
- [ ] release-smoke `sell → buy → teardown` stays green.
- [ ] Pre-production: greenfield, no keystore migration needed.

---

## References

- **#570** — `security(controller): scope serviceoffer-controller Secret RBAC to named secrets` (the immediate hardening this defers from).
- `internal/serviceoffercontroller/agent_wallet.go` — `buildSignerKeystoreSecret`, `ensureSignerKeystore` (mint + custody).
- `internal/openclaw/wallet.go` — `GenerateKeystoreInMemory` (in-process keypair).
- `internal/embed/infrastructure/base/templates/x402.yaml` — `serviceoffer-controller` ClusterRole.
- `internal/embed/embed_crd_test.go` — `TestServiceOfferControllerSecretRBAC_Scoped` (the guard to extend in Phase 3).


In scope	A compromised/abused `serviceoffer-controller` — supply-chain on its image, RCE in a reconcile path, or a malicious ClusterRole edit — reading or deleting other agents' signer keys.
Out of scope	An attacker who already controls a specific agent's own pod/namespace. They already hold that agent's key by design.

Option	Idea	Verdict
0 — Accept + document	Keep #570 as final; document the controller as part of the keystore TCB.	Honest fallback, but not the end state — the asset warrants real isolation.
A — Per-namespace Role for the controller SA	Controller mints a Role+RoleBinding for itself in each agent ns; drop keystore verbs from the ClusterRole.	Rejected — isolation theater. The controller manages all agents, so it binds itself in every namespace → same reach, more RBAC surface, plus a chicken-and-egg `create` bootstrap.
C — Unique per-agent keystore names	Name it `<agent>-remote-signer-keystore`.	Doesn't help a ClusterRole alone (`resourceNames` has no wildcards). Only useful as hygiene layered onto B.
B — Agent self-mints	Keypair generated inside the agent's own namespace/pod; controller never gains get/create/delete on the keystore.	Recommended. Removes both sub-risks: no shared-name reach, no in-process custody.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security: per-agent remote-signer keystore isolation (controller out of custody path) — follow-up to #570 #573

Summary

Current architecture — why the reach exists

Two distinct sub-risks (both verified in code)

Blast radius (attack path)

Threat model

Options considered

Recommended target — Option B

Phased rollout

Open questions (resolve before Phase 1 code)

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

security: per-agent remote-signer keystore isolation (controller out of custody path) — follow-up to #570 #573

Description

Summary

Current architecture — why the reach exists

Two distinct sub-risks (both verified in code)

Blast radius (attack path)

Threat model

Options considered

Recommended target — Option B

Phased rollout

Open questions (resolve before Phase 1 code)

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions