Skip to content

fix(x402): gate verifier /readyz on informer cache sync#519

Closed
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/verifier-readyz-on-informer-sync
Closed

fix(x402): gate verifier /readyz on informer cache sync#519
bussyjd wants to merge 1 commit into
feat/x402-marketplace-metricsfrom
fix/verifier-readyz-on-informer-sync

Conversation

@bussyjd
Copy link
Copy Markdown
Collaborator

@bussyjd bussyjd commented May 23, 2026

Why

Closes CLAUDE.md pitfall #14 root cause. Before this PR, the verifier /readyz returns 200 the moment static config loads — but routes from the ServiceOffer informer load later. During the gap, kubelet adds the pod to Service Endpoints, Traefik forwards paid-route requests, and matchPaidRoute returns "no rule -> 200" (free pass on paid traffic).

Before

   T0  pod starts
   T1  static config loads
   T2  kubelet probe /readyz -> 200 (config is loaded!)
   T3  kubelet adds pod to Service Endpoints
   T4  Traefik routes /services/foo/* -> verifier
   T5  verifier matches /services/foo/* -> no rule (informer still warming)
   T6  verifier returns 200 -> PAID ROUTE BYPASSED
   T7  informer WaitForCacheSync completes
   T8  refresh() loads routes
   T9  verifier now gates correctly

After

   T0  pod starts
   T1  static config loads
   T2  kubelet probe /readyz -> 503 "routes not loaded"
   T3  pod stays OUT of Service Endpoints
   T4  Traefik has no endpoint to forward to -> no paid-route bypass possible
   ...
   T7  informer WaitForCacheSync completes
   T8  refresh() loads routes + calls v.MarkRoutesLoaded()
   T9  kubelet probe /readyz -> 200
   T10 kubelet adds pod to Service Endpoints
   T11 traffic flows, fully gated

What changed

  • Verifier.routesLoaded atomic.Bool + MarkRoutesLoaded() method
  • HandleReadyz returns 503 with cause-specific body until both config and routes loaded
  • WatchServiceOffers gains optional onFirstApply callback
  • main.go wires it for kube source; calls directly for file source

Test plan

  • go build ./... clean
  • go test ./internal/x402/... ./cmd/x402-verifier/... green
  • Unit test asserts /readyz 503->200 transition on MarkRoutesLoaded
  • Manual on next stack up: kubectl describe pod should show readiness probe failures briefly (~5s on warm cluster) then succeed

Stacks on

PR #513 + PR #515. Rebase onto main after both merge.

Pairs with

PR #515 (replicas: 1) shrinks the race window but doesn't close it. This PR closes it.

Closes the root cause of CLAUDE.md pitfall #14 ("first-request flake
on freshly-deployed verifier"). Previously /readyz returned 200 the
moment config.Load() became non-nil, but routes from the ServiceOffer
informer load later — between those two events the pod is Ready from
kubelet's view, receives Service traffic, and matchPaidRoute returns
"no rule -> 200" for paid routes. The release-smoke flows hide this
behind 12x5s retry loops; the actual fix is to not be Ready until
routes are loaded.

  - Adds routesLoaded atomic.Bool to Verifier.
  - HandleReadyz returns 503 until BOTH config and routes loaded,
    with a body that distinguishes the two cases for kubectl describe
    debuggability.
  - WatchServiceOffers takes an optional onFirstApply callback,
    invoked after the post-WaitForCacheSync refresh succeeds.
  - main.go wires v.MarkRoutesLoaded as the callback for kube source,
    or invokes it directly after NewVerifier for file source (the
    file source has no informer; routes are loaded synchronously).

Pairs with PR #515 (replicas: 1) — at single replica the rollout
window for this race shrinks from "some scrapes" to "first ~5-10s",
but it's still a bug; this PR closes it.
@bussyjd
Copy link
Copy Markdown
Collaborator Author

bussyjd commented May 24, 2026

Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved.

@bussyjd bussyjd closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant