Skip to content

fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir#522

Closed
bussyjd wants to merge 1 commit into
mainfrom
fix/x402-buyer-state-pvc
Closed

fix(x402-buyer): persist consumed-nonce state to PVC instead of emptyDir#522
bussyjd wants to merge 1 commit into
mainfrom
fix/x402-buyer-state-pvc

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Why

Today /state on the x402-buyer sidecar is emptyDir: {}. Pod restart loses consumed.json → buyer treats every pre-signed auth as fresh → tries to spend already-consumed nonces → facilitator 400 cascade → buyer pool exhausted → 503 until manual buy.py process --all.

Before

   litellm pod restart (rollout, OOM, anything)
        │
        ▼
   emptyDir /state wiped                    auth-pool ConfigMap
        │                                   (controller-managed)
        │                                          │
        │                                          ▼
        └──────────────────► buyer reads pool, sees N unused
                                 │
                                 ▼
                        spends auth #1
                                 │
                                 ▼
                  facilitator: "nonce already used" 400
                                 │
                                 ▼
                        buyer 402 → caller retry → buyer spends auth #2
                                 │
                                 ▼
                  facilitator 400 → ... → pool exhausted → 503
                                 │
                                 ▼
                   manual `buy.py process --all` to reseed

After

   litellm pod restart
        │
        ▼
   PVC remounted (local-path, RWO, 50Mi)        auth-pool ConfigMap
        │                                              │
        ▼                                              ▼
   consumed.json INTACT                       buyer reads pool
        │                                              │
        └──────────────────► buyer skips consumed entries
                                 │
                                 ▼
                        spends next unused auth
                                 │
                                 ▼
                  facilitator OK → settlement proceeds

What changed

  • llm.yaml — new PVC x402-buyer-state (50Mi, local-path, RWO); volume entry on litellm Deployment swaps from emptyDir: {}persistentVolumeClaim
  • llm.yaml Deployment strategy → Recreate (RWO PVC requires no surge)
  • internal/embed/embed_buyer_state_test.go — regression test pinning the PVC + Recreate invariants

What this does NOT solve

  • Multi-replica litellm (RWO can't be shared; would need StatefulSet OR RWX storage class)
  • Hard node loss (local-path is node-local; on k3d single-node, full blast radius anyway)

PSS compatibility

PR #12 (Restricted PSS sweep) will need to verify the buyer can read/write the PVC mount under runAsUser: 65532. Local-path creates files at 0777 by default which is permissive. Cross-PR coordination noted.

Test plan

  • go build ./... clean
  • go test ./internal/embed/... ./internal/x402/buyer/... green
  • Manual on next stack up: deploy → buy 5 auths → spend 2 → kubectl rollout restart deploy/litellm -n llm → spend 3rd auth → facilitator settles cleanly (no 400)

Today the x402-buyer sidecar's /state directory is an emptyDir. When
the litellm pod restarts (rollout, OOM, node drain), consumed.json is
gone. The pre-signed auth pool reloads from the ConfigMap the
controller manages, and the buyer treats every auth as unconsumed —
attempting to spend nonces that the facilitator already marked used.

Cascade: facilitator returns 400 "nonce already used" -> buyer 402
back to LiteLLM -> caller retry -> same 400 -> eventually buyer pool
exhausted -> 503 until manual `buy.py process --all` reseeds.

Fix: convert /state to a PVC backed by local-path-provisioner (the
storage class already deployed via base/templates/local-path.yaml).
50Mi request; consumed.json is tiny but room left for log growth.

Deployment strategy switched to Recreate because a RWO PVC can't
be co-mounted during a RollingUpdate surge. Litellm is replicas: 1
so this just means rollouts have a ~5s gap instead of an overlap —
acceptable.

What this does NOT solve:
  - Multi-replica litellm. RWO PVC works only for replicas: 1; would
    need RWX (which local-path doesn't support — needs NFS/Longhorn)
    or per-replica state via StatefulSet. Out of scope; litellm has
    no current scaling need.
  - Hard node loss. local-path PVCs are node-local; if the k3d node
    is destroyed, state is gone (along with the rest of the cluster).
    For local-only operator that's the expected blast radius.

PSS compatibility note: the PVC mount works under PSS Restricted as
long as the buyer container runs with appropriate fsGroup. PR #12
(Restricted PSS sweep) handles that separately and will verify mount
permissions when it lands.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant