fix(prometheus-rules): use increase() for the per-offer revenue rule#530
Closed
bussyjd wants to merge 1 commit into
Closed
fix(prometheus-rules): use increase() for the per-offer revenue rule#530bussyjd wants to merge 1 commit into
bussyjd wants to merge 1 commit into
Conversation
Recording rule was sum(counter), which is wrong for any metric where
the counter resets across pod restarts — Prometheus counters are
per-process by design. The TSDB is the canonical persistence layer;
rate() and increase() perform reset detection at query time across
the samples the TSDB holds.
- Renames the rule to x402:revenue:7d_by_offer (name matches what
it returns; the old "lifetime" / "total_by_offer_current" names
were aspirational against a finite retention window).
- Expression: sum by (offer_namespace, offer_name) (
increase(obol_x402_verifier_charged_requests_total[7d])
)
- 7d inside 8d retention gives 1-day headroom so reset detection
has both-side samples at the window's left edge.
Per Robust Perception's "avoiding the counter-reset undercount"
canonical guidance. Zero new components — uses only native Prometheus
+ recording-rule primitives.
Found by the 14-PR integration test (plans/integration-test-results-
final-20260524.md). The OBOL parity smoke surfaced it more visibly
when a verifier restart produced a "0 req·24h" UI display on a row
with real on-chain traffic.
Stacks on PR #527 (Helm-escape fix for the same file).
Collaborator
Author
|
Superseded by bundle PR #536 — closing in favor of the consolidated merge target. Original branch and history preserved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The recording rule
x402:revenue:total_by_offer_currentused a naivesum(counter)expression, which is wrong for any metric where the counter can reset across pod restarts. Prometheus counters are per-process by design — they reset to zero on every pod restart (rollout, OOM, eviction, node reschedule). Asum by (...) (counter)query therefore drops to zero whenever the verifier restarts, producing a misleading "0 requests" reading on offers with real on-chain traffic.This PR fixes the rule to use
increase(), which performs reset detection at query time across the samples held in the TSDB.Before
After
Why
increase([7d])not[8d]The TSDB is the canonical persistence layer (retention 8d).
increase()needs samples on both sides of the window edge to do reset detection at the left edge of the window. A 7d window inside 8d retention gives a 1-day headroom so the rule keeps working at exactly the moment data ages out, instead of silently producing NaN/undercounts at the boundary.Why the rename
The old names (
total_by_offer_current,lifetime_by_offer) were aspirational against a finite retention window. The new name (7d_by_offer) matches what the expression actually returns and makes the bounded nature of the value explicit at the call site.Canonical reference: Robust Perception "avoiding the counter-reset undercount".
Provenance
Found by the 14-PR integration test (
plans/integration-test-results-final-20260524.md). The OBOL parity smoke surfaced it more visibly when a verifier restart produced a "0 req·24h" UI display on a row with real on-chain traffic.Stacks on #527 (Helm-escape fix for the same file). Frontend consumer PR follows.
Test plan
go build ./...cleango test ./internal/embed/... ./internal/x402/...greenobol stack upand returns a per-offer 7d count for a verifier that's been restarted at least once