fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir by bussyjd · Pull Request #522 · ObolNetwork/obol-stack

bussyjd · 2026-05-23T19:05:41Z

Why

Today /state on the x402-buyer sidecar is emptyDir: {}. Pod restart loses consumed.json → buyer treats every pre-signed auth as fresh → tries to spend already-consumed nonces → facilitator 400 cascade → buyer pool exhausted → 503 until manual buy.py process --all.

Before

   litellm pod restart (rollout, OOM, anything)
        │
        ▼
   emptyDir /state wiped                    auth-pool ConfigMap
        │                                   (controller-managed)
        │                                          │
        │                                          ▼
        └──────────────────► buyer reads pool, sees N unused
                                 │
                                 ▼
                        spends auth #1
                                 │
                                 ▼
                  facilitator: "nonce already used" 400
                                 │
                                 ▼
                        buyer 402 → caller retry → buyer spends auth #2
                                 │
                                 ▼
                  facilitator 400 → ... → pool exhausted → 503
                                 │
                                 ▼
                   manual `buy.py process --all` to reseed

After

   litellm pod restart
        │
        ▼
   PVC remounted (local-path, RWO, 50Mi)        auth-pool ConfigMap
        │                                              │
        ▼                                              ▼
   consumed.json INTACT                       buyer reads pool
        │                                              │
        └──────────────────► buyer skips consumed entries
                                 │
                                 ▼
                        spends next unused auth
                                 │
                                 ▼
                  facilitator OK → settlement proceeds

What changed

llm.yaml — new PVC x402-buyer-state (50Mi, local-path, RWO); volume entry on litellm Deployment swaps from emptyDir: {} → persistentVolumeClaim
llm.yaml Deployment strategy → Recreate (RWO PVC requires no surge)
internal/embed/embed_buyer_state_test.go — regression test pinning the PVC + Recreate invariants

What this does NOT solve

Multi-replica litellm (RWO can't be shared; would need StatefulSet OR RWX storage class)
Hard node loss (local-path is node-local; on k3d single-node, full blast radius anyway)

PSS compatibility

PR #12 (Restricted PSS sweep) will need to verify the buyer can read/write the PVC mount under runAsUser: 65532. Local-path creates files at 0777 by default which is permissive. Cross-PR coordination noted.

Test plan

go build ./... clean
go test ./internal/embed/... ./internal/x402/buyer/... green
Manual on next stack up: deploy → buy 5 auths → spend 2 → kubectl rollout restart deploy/litellm -n llm → spend 3rd auth → facilitator settles cleanly (no 400)

Today the x402-buyer sidecar's /state directory is an emptyDir. When the litellm pod restarts (rollout, OOM, node drain), consumed.json is gone. The pre-signed auth pool reloads from the ConfigMap the controller manages, and the buyer treats every auth as unconsumed — attempting to spend nonces that the facilitator already marked used. Cascade: facilitator returns 400 "nonce already used" -> buyer 402 back to LiteLLM -> caller retry -> same 400 -> eventually buyer pool exhausted -> 503 until manual `buy.py process --all` reseeds. Fix: convert /state to a PVC backed by local-path-provisioner (the storage class already deployed via base/templates/local-path.yaml). 50Mi request; consumed.json is tiny but room left for log growth. Deployment strategy switched to Recreate because a RWO PVC can't be co-mounted during a RollingUpdate surge. Litellm is replicas: 1 so this just means rollouts have a ~5s gap instead of an overlap — acceptable. What this does NOT solve: - Multi-replica litellm. RWO PVC works only for replicas: 1; would need RWX (which local-path doesn't support — needs NFS/Longhorn) or per-replica state via StatefulSet. Out of scope; litellm has no current scaling need. - Hard node loss. local-path PVCs are node-local; if the k3d node is destroyed, state is gone (along with the rest of the cluster). For local-only operator that's the expected blast radius. PSS compatibility note: the PVC mount works under PSS Restricted as long as the buyer container runs with appropriate fsGroup. PR #12 (Restricted PSS sweep) handles that separately and will verify mount permissions when it lands.

bussyjd · 2026-05-24T09:13:36Z

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

bussyjd mentioned this pull request May 24, 2026

feat: x402 marketplace + architecture review bundle (#513-#535) #536

Merged

6 tasks

bussyjd closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir#522

fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir#522
bussyjd wants to merge 1 commit into
mainfrom
fix/x402-buyer-state-pvc

bussyjd commented May 23, 2026

Uh oh!

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bussyjd commented May 23, 2026

Why

Before

After

What changed

What this does NOT solve

PSS compatibility

Test plan

Uh oh!

bussyjd commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant