Skip to content

architecture: seller offers as first-class declarative state #488

@bussyjd

Description

@bussyjd

Context

obol stack up today has three different lifecycles bolted together. Infrastructure and the Hermes agent come up declaratively via helmfile, but seller offers (obol sell http, obol sell inference) are imperative — created by direct kubectl apply/CR writes from the CLI, and re-hydrated on stack-up by a bespoke resumeSellOffers function (see PR #487).

This issue captures the longer-term direction: bring seller offers under the same declarative helmfile pass that already manages infrastructure and Hermes, so stack up is one mechanism applied uniformly.

Today (overcomplicated)

                            obol stack up
                                  │
              ┌───────────────────┼───────────────────┐
              │                   │                   │
              ▼                   ▼                   ▼
      Infrastructure         Agent (Hermes)     Seller offers
       (Traefik, eRPC,       via hermes.Sync()  (sell inference,
        LiteLLM,                                  sell http)
        x402-verifier,                                 │
        serviceoffer-                                  ▼
        controller,                            NOT MANAGED BY
        cloudflared, ...)                      STACK UP
              │                   │                    │
              │                   │           ┌────────┴────────┐
              ▼                   ▼           ▼                 ▼
          helmfile            helmfile  sell inference     sell http
          (declarative)       (declarative)  ↓                 ↓
                                       descriptor on       YAML manifest
                                       disk (JSON)         on disk (PR #487)
                                              │                 │
                                              ▼                 ▼
                                       host gateway        in-cluster only
                                       foreground process
                                       (#487 spawns detached;
                                        TODO: helm chart)

          ────  resume gap filled with bespoke code  ────
                  (cmd/obol/sell.go::resumeSellOffers)

Three problems with this shape

  1. Two persistence formats. `inference.Store` (rich JSON descriptor) and the sell-http YAML manifest store added in feat(stack): paid services survive stack down/up (sell-inference + sell-http resume + storefront fixes) #487. Two walkers, two parsers, two test surfaces, two failure modes.
  2. Foreground host process asymmetry. Only `sell inference` runs a host-side gateway. That's why resume has to fork-and-detach for inference but not for http, and why `startDetachedInferenceGateway` + PID files exist at all.
  3. Resume function exists only because seller offers aren't first-class infra. Every recovery scenario (`stack down`/`stack up`, `stack purge`, host reboot, agent reset) has to re-implement the recovery path. The controller and Hermes don't need any of that — they come back the same way infrastructure comes back.

Proposed end state

                            obol stack up
                                  │
                  ┌───────────────┼───────────────┐
                  │               │               │
                  ▼               ▼               ▼
            Infrastructure   Agent          Seller offers
                  │           │              │
                  └───────────┴──────────────┘
                              │
                              ▼
                       SAME MECHANISM:
                       a helmfile pass over
                       declarative sources of
                       truth on disk
                              │
              ┌───────────────┴────────────────┐
              ▼                                ▼
       infra/*.yaml                     applications/
       (already exists)                 ├── hermes/<id>/         (exists)
                                        ├── sell-http/<name>/    (new)
                                        └── sell-inference/<name>/ (new)
                                                │
                                                ▼
                                        helmfile.yaml +
                                        values-*.yaml per offer

In this shape:

  • `obol sell http` / `obol sell inference` become "edit the descriptor on disk + helmfile sync the slice". No imperative `kubectl apply`, no foreground process spawn.
  • `obol stack down`/`up` is "helmfile destroy/sync the whole tree" — agents and offers come back the same way the controller comes back.
  • The inference gateway becomes an in-cluster Deployment, so the host-side foreground process disappears entirely.
  • `resumeSellOffers`, `startDetachedInferenceGateway`, PID files, gateway logs — none of those need to exist. They are scaffolding around the asymmetry.

Migration path

  1. Build the inference gateway as a Pod image.
    • Replaces `startDetachedInferenceGateway` and the PID-file plumbing.
    • The host-side subprocess is the only blocker to symmetric lifecycle.
  2. Move sell-inference / sell-http to helmfile-managed slices under `applications/sell-{http,inference}//`.
    • Replaces `inference.Store` and the sell-http YAML manifest store with a single declarative format.
    • One walker, one parser, one test surface.
  3. `obol stack up` becomes a single helmfile pass over everything.
    • Replaces `resumeSellOffers` entirely.
    • Recovery becomes "whatever's on disk is what's running".

Where PR #487 fits

PR #487 (#487) is the deliberate near-term step:

  • Necessary today because seller offers are still imperative — without it, `stack up` doesn't bring back paid services and the spark2 dev cluster needs manual replay after every restart.
  • Ships now, unblocks spark2 and any other persistent dev cluster.
  • Superseded once the declarative model lands. The `startDetachedInferenceGateway` comment already points at the helm-chart future, and `resumeSellOffers` is explicitly scaffolding.

Acceptance criteria for closing this issue

  • Inference gateway runs as an in-cluster Deployment built from a Pod image, not a host subprocess.
  • Sell-http and sell-inference offers are represented on disk as helmfile-managed slices under `applications/`.
  • `obol stack up` requires no resume-specific code path for seller offers — the same helmfile pass that brings up infra and Hermes also brings up offers.
  • `resumeSellOffers`, `startDetachedInferenceGateway`, the PID file plumbing, and the two persistence stores are deleted.

Out of scope

  • Buyer-side resume. Buyer state is already cluster-resident (`PurchaseRequest` CRs + sidecar config).
  • Migration of existing on-disk descriptors written by older CLIs — a one-shot importer can live alongside the new layout if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions