Skip to content

feat(sandbox): supervisor should periodically refresh provider credentials for running sandboxes #1264

@kon-angelo

Description

@kon-angelo

Problem Statement

When a provider's credentials are updated (e.g., a rotated API key), running sandboxes never observe the change. The supervisor calls GetSandboxProviderEnvironment exactly once at startup (crates/openshell-sandbox/src/lib.rs:272-301) and stores the result in an immutable Arc<SecretResolver>. The run_policy_poll_loop refreshes policy and settings every 10 seconds but never re-fetches provider credentials.

This means:

  • Rotating a provider's API key requires restarting every sandbox using that provider.
  • Short-lived tokens (OAuth2, SSO) cannot be consumed by long-running sandboxes without external workarounds.
  • The operator/gateway correctly propagates Secret changes to the gateway DB, but the last mile — gateway DB to running supervisor — is broken.

Issue #896 proposes a comprehensive credential refresh lifecycle as part of Provider Profiles (including oauth2 and external strategies), and #1171 addresses attach/detach semantics for new process launches. However, neither addresses the simpler immediate gap: the supervisor's existing SecretResolver should be refreshable at runtime so that credential rotation works for the current provider model without waiting for the full Provider v2 redesign.

Proposed Design

Add periodic credential refresh to the supervisor's existing polling loop. The change is minimal and incremental:

  1. Change SecretResolver storage from Arc<SecretResolver> to Arc<RwLock<SecretResolver>> (or arc_swap::ArcSwap<SecretResolver> for lock-free reads). The proxy hot-path reads credentials on every request; a RwLock with rare writes is acceptable, but ArcSwap would be zero-cost for the read path.

  2. Add a refresh_provider_environment call to run_policy_poll_loop (or a parallel loop). Every N seconds (e.g., 60s, configurable), call GetSandboxProviderEnvironment, diff the result against the current SecretResolver, and swap if changed.

  3. The proxy reads the current SecretResolver snapshot per-request. With ArcSwap, this is a single atomic load — no contention with the refresh writer.

  4. No changes to child process environment variables. Already-running processes keep their placeholder env vars (openshell:resolve:env:*). The placeholders don't change — only the real values in the SecretResolver mapping change. Since resolution happens at the proxy layer at request time, the agent seamlessly uses the new credentials on the next API call without restart.

  5. No new RPCs needed. The existing GetSandboxProviderEnvironment unary RPC is sufficient. The supervisor already calls it; it just needs to call it more than once.

Startup (existing):
  fetch_provider_environment() → SecretResolver (immutable)

Proposed:
  fetch_provider_environment() → ArcSwap<SecretResolver>
  poll loop (every 60s):
    new_env = fetch_provider_environment()
    if new_env != current_env:
      secret_resolver.swap(SecretResolver::from_provider_env(new_env))

This is fully backward-compatible:

  • The static refresh strategy from Enhanced Provider Management #896 describes exactly this behavior ("takes effect on the next proxied request or next supervisor provider-cache refresh") — this PR would implement that cache refresh.
  • No proto changes. No gateway changes. Supervisor-only.
  • Child processes are unaffected (env vars contain placeholders, not secrets).

Alternatives Considered

  • Push-based streaming RPC (e.g., WatchProviderEnvironment): More complex, requires gateway-side subscription management, and doesn't align with the existing pull-based polling model used for policy. The polling approach is simpler and consistent with existing architecture.
  • Restart sandbox on credential change: Current workaround, but breaks long-running agent sessions and is unacceptable for production use with token rotation.
  • Wait for full Provider v2 (Enhanced Provider Management #896): The roadmap is comprehensive but large. The credential refresh gap exists today with the v1 provider model and can be fixed incrementally without conflicting with v2 work.
  • External sidecar/daemon that restarts sandboxes: Adds operational complexity and still disrupts running sessions unnecessarily.

Agent Investigation

Codebase analysis of crates/openshell-sandbox/:

  • src/lib.rs:272-301: fetch_provider_environment called once at startup, result stored in HashMap<String, String>.
  • src/lib.rs:303-304: SecretResolver::from_provider_env() creates an immutable SecretResolver, wrapped in Arc<SecretResolver>.
  • src/secrets.rs:67-90: SecretResolver is #[derive(Clone)] with a plain HashMap<String, String> field. No RwLock, no mutation methods.
  • src/lib.rs:2148-2307: run_policy_poll_loop polls GetSandboxConfig for policy/settings only — never credentials.
  • proto/openshell.proto:84: Comment explicitly says "called by sandbox supervisor at startup" — this is a documentation issue, not a protocol limitation.
  • The proxy reads SecretResolver on every HTTP request via resolve() method — this is the natural injection point for refreshed credentials.
  • ArcSwap is not currently in Cargo.toml but is a single dependency add. Alternatively, tokio::sync::watch could broadcast new resolvers to proxy tasks.

Related: #896 (Provider v2 roadmap, credential refresh lifecycle), #1171 (attach/detach for running sandboxes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions