Skip to content

security: per-agent remote-signer keystore isolation (controller out of custody path) — follow-up to #570 #573

@bussyjd

Description

@bussyjd

Summary

Every agent's signer keystore is a Secret named identicallyremote-signer-keystore — one per agent namespace, and the keypair is minted inside the serviceoffer-controller process. Because the controller's RBAC is a ClusterRole and resourceNames on a ClusterRole matches a name in any namespace, a single compromised controller can read (and delete) every agent's keystore + decryption password, and therefore derive every agent's spendable signer key.

This is defense-in-depth / blast-radius reduction, not a remotely-exploitable hole — the controller is trusted infra. It matters because the protected asset is spendable, multi-tenant signer keys sitting behind one shared identity.

This issue is the design follow-up that #570 explicitly defers. #570 is the immediate, minimal RBAC hardening (scope the verbs, drop dead update/patch, add the missing delete). It does not — and on its own cannot — isolate agent-A's keystore from agent-B's, because they share a name.

Supersedes draft PR #571 (which carried this as plans/per-agent-keystore-isolation.md). The design now lives here as a tracking issue so it can be discussed and scheduled independently of any code branch.


Current architecture — why the reach exists

flowchart TB
    subgraph x402ns["namespace: x402"]
        SOC["serviceoffer-controller<br/>ServiceAccount"]
        GEN["GenerateKeystoreInMemory()<br/>mints privkey + password<br/><b>inside the controller process</b>"]
        SOC --- GEN
    end

    subgraph cr["ClusterRole: serviceoffer-controller (after #570)"]
        RULE["resourceNames: remote-signer-keystore<br/>verbs: get, delete"]
    end

    subgraph agentA["namespace: agent-alice"]
        SA["Secret/remote-signer-keystore<br/>keystore.json + password"]
        RSA["remote-signer pod"]
        SA --> RSA
    end
    subgraph agentB["namespace: agent-bob"]
        SB["Secret/remote-signer-keystore<br/>keystore.json + password"]
        RSB["remote-signer pod"]
        SB --> RSB
    end
    subgraph hermesns["namespace: hermes-obol-agent"]
        SH["Secret/remote-signer-keystore<br/>keystore.json + password"]
        RSH["remote-signer pod"]
        SH --> RSH
    end

    GEN -->|create| SA
    GEN -->|create| SB
    GEN -->|create| SH

    RULE -. "name matches in ANY ns" .-> SA
    RULE -. "name matches in ANY ns" .-> SB
    RULE -. "name matches in ANY ns" .-> SH

    classDef danger fill:#ffe0e0,stroke:#d73a4a,color:#000;
    class SA,SB,SH,GEN danger;
Loading

The dotted edges are the problem: one resourceNames: ["remote-signer-keystore"] rule on a ClusterRole fans out to every namespace that happens to use that name.

Two distinct sub-risks (both verified in code)

  1. Standing cross-agent read. A GET on a Secret returns all keys. The controller SA can GET remote-signer-keystore in every agent namespace → keystore JSON + password → derive each agent's private key. Source: internal/serviceoffercontroller/agent_wallet.go::buildSignerKeystoreSecret writes both keystore.json and password into the one Secret.
  2. In-process custody at mint. openclaw.GenerateKeystoreInMemory() runs inside the controller process (internal/openclaw/wallet.go). The controller generates and holds the private key + password at provisioning time — regardless of RBAC. RBAC scoping alone never removes this.

Note: in steady state the reuse path reads only the address annotation (obol.org/wallet-address) — never the key data after mint. But the capability to read key material is standing, which is exactly the blast radius an attacker uses.


Blast radius (attack path)

sequenceDiagram
    autonumber
    participant ATK as Attacker
    participant SOC as Compromised controller pod
    participant API as kube-apiserver
    participant CH as Base / chain

    ATK->>SOC: RCE in a reconcile path OR malicious controller image
    Note over SOC: holds the controller ServiceAccount token
    loop for every agent namespace
        SOC->>API: GET secret/remote-signer-keystore (ns = agent-N)
        API-->>SOC: keystore.json + password
        Note over SOC: decrypt V3 keystore -> private key
    end
    SOC->>CH: sign + broadcast transfers
    CH-->>ATK: every agent wallet drained
Loading

Threat model

In scope A compromised/abused serviceoffer-controller — supply-chain on its image, RCE in a reconcile path, or a malicious ClusterRole edit — reading or deleting other agents' signer keys.
Out of scope An attacker who already controls a specific agent's own pod/namespace. They already hold that agent's key by design.

The controller is trusted infra, so this is blast-radius reduction, not a remotely-exploitable bug. The asset (spendable signer keys for N tenants) is what makes it worth real isolation rather than "accept and document".


Options considered

flowchart LR
    Q{"Isolate per-agent<br/>signer keys?"}
    Q -->|"do nothing"| O0["Option 0<br/>Accept + document<br/>controller stays in keystore TCB"]
    Q -->|"per-ns Role for<br/>controller SA"| OA["Option A<br/><b>REJECTED</b><br/>isolation theater:<br/>controller binds itself<br/>in every agent ns"]
    Q -->|"unique names only"| OC["Option C<br/>insufficient alone<br/>collapses into A"]
    Q -->|"agent self-mints"| OB["Option B<br/><b>RECOMMENDED</b><br/>controller leaves<br/>the custody path"]

    classDef rec fill:#e0ffe0,stroke:#2da44e,color:#000;
    classDef rej fill:#ffe0e0,stroke:#d73a4a,color:#000;
    class OB rec;
    class OA rej;
Loading
Option Idea Verdict
0 — Accept + document Keep #570 as final; document the controller as part of the keystore TCB. Honest fallback, but not the end state — the asset warrants real isolation.
A — Per-namespace Role for the controller SA Controller mints a Role+RoleBinding for itself in each agent ns; drop keystore verbs from the ClusterRole. Rejected — isolation theater. The controller manages all agents, so it binds itself in every namespace → same reach, more RBAC surface, plus a chicken-and-egg create bootstrap.
C — Unique per-agent keystore names Name it <agent>-remote-signer-keystore. Doesn't help a ClusterRole alone (resourceNames has no wildcards). Only useful as hygiene layered onto B.
B — Agent self-mints Keypair generated inside the agent's own namespace/pod; controller never gains get/create/delete on the keystore. Recommended. Removes both sub-risks: no shared-name reach, no in-process custody.

Recommended target — Option B

flowchart TB
    subgraph x402ns["namespace: x402"]
        SOC["serviceoffer-controller<br/><b>no keystore verbs</b>"]
    end

    subgraph agentA["namespace: agent-alice"]
        RoleA["Role + RoleBinding<br/>agent SA: create/get<br/>remote-signer-keystore<br/><b>this ns only</b>"]
        INITA["init / first-boot mint<br/>in the agent pod"]
        SAk["Secret/remote-signer-keystore"]
        STA["Agent.status.walletAddress<br/>(non-secret)"]
        INITA -->|create| SAk
        INITA -->|publish addr| STA
        RoleA -. scopes .-> SAk
    end

    SOC -->|"mint Role/RoleBinding once"| RoleA
    SOC -->|"read address (non-secret)"| STA
    SOC -. "cannot read keystore" .-x SAk

    classDef safe fill:#e0ffe0,stroke:#2da44e,color:#000;
    class SAk,STA,INITA safe;
Loading

Moving parts

  1. In-pod keystore generation — either (a) the remote-signer image self-generates a keystore on first boot when none is mounted, or (b) a tiny init container mints it (reuse openclaw.GenerateKeystoreInMemory logic, shipped as a minimal binary). ← open question Bring obolup code to this repo #1.
  2. Namespaced write RBAC for the agent SA — controller creates, once per agent namespace, a Role granting the agent's SA create/get on remote-signer-keystore in its own namespace + a RoleBinding. The agent SA can never reach another namespace → true isolation.
  3. Address via a non-secret channel — the agent publishes its address (e.g. patches Agent.status.walletAddress, SA scoped to agents/status in its ns) so the controller learns it without a keystore GET.
  4. Controller RBAC shrinks to: litellm-secrets get (fixed ns), hermes-api-server get/create/delete, and zero remote-signer-keystore access.

Phased rollout

flowchart TB
    P1["<b>Phase 1 — decision gate</b><br/>Confirm in-pod mint mechanism<br/>remote-signer self-mint? else init-container tool"]
    P2["<b>Phase 2 — wiring</b><br/>controller mints namespaced Role/RoleBinding<br/>+ address-reporting channel<br/>ensureAgentWallet waits for agent-reported addr"]
    P3["<b>Phase 3 — drop access</b><br/>remove remote-signer-keystore from ClusterRole<br/>guard test: no get/create/delete"]
    P1 -->|"self-mint supported"| P2
    P1 -->|"not supported, build init tool"| P2
    P2 --> P3
    P3 --> DONE["Controller holds <b>no</b><br/>agent signer material"]
    classDef done fill:#e0ffe0,stroke:#2da44e,color:#000;
    class DONE done;
Loading

If Phase 1 shows B is disproportionately expensive for the current milestone, fall back to Option 0 and revisit — but do not ship Option A as a substitute.


Open questions (resolve before Phase 1 code)

  1. Does ghcr.io/obolnetwork/remote-signer:v0.3.0 generate a keystore on first boot when the keystore dir is empty? (Check the ObolNetwork remote-signer repo / chart.) If yes → no init container needed.
  2. Is Agent.status.walletAddress the right address channel, and can the agent SA be granted patch on agents/status scoped to its own namespace?
  3. Does anything besides the remote-signer pod consume the keystore Secret directly? (Grep across runtimes before removing controller access.)

Acceptance criteria

  • Controller ClusterRole has no verbs on remote-signer-keystore (guarded by extending TestServiceOfferControllerSecretRBAC_Scoped).
  • The agent SA's keystore write access is a namespaced Role, never a ClusterRole.
  • obol agent init still populates Agent.status.walletAddress; teardown still cleans up.
  • release-smoke sell → buy → teardown stays green.
  • Pre-production: greenfield, no keystore migration needed.

References

  • security(controller): scope serviceoffer-controller Secret RBAC to named secrets #570security(controller): scope serviceoffer-controller Secret RBAC to named secrets (the immediate hardening this defers from).
  • internal/serviceoffercontroller/agent_wallet.gobuildSignerKeystoreSecret, ensureSignerKeystore (mint + custody).
  • internal/openclaw/wallet.goGenerateKeystoreInMemory (in-process keypair).
  • internal/embed/infrastructure/base/templates/x402.yamlserviceoffer-controller ClusterRole.
  • internal/embed/embed_crd_test.goTestServiceOfferControllerSecretRBAC_Scoped (the guard to extend in Phase 3).

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions