Rates Engine v0.5.0-rc.85
Pre-release
Pre-release
·
445 commits
to main
since this release
[v0.5.0-rc.85] — 2026-05-28
Tested against Stellar Protocol 23 (Whisk).
Density-honesty bundle: the previous rc.83/rc.84 shipped with a status-page bug that inflated per-source density to 100% by falling back to genesis on a NULL live cursor. Combined with the cursor-derived density measuring process state rather than data state, this masked the F-0020 cascade-window gap entirely. This release ships the four layers needed to detect, diagnose, and prevent that class of failure end-to-end.
Pre-deploy operator note: this release rolls out the api + aggregator + ops binaries together (rc.84 was ops-only). No migrations needed beyond rc.83's 0046/0047.
Added
ratesengine-ops resume-stallednow gates plans against the data-derivedFindSorobanEventsLedgerGapsresult. Pre-this-PR the subcommand trusted the cursor inventory as ground truth; the live r1 dry-run surfaced 50 "actionable" stalled cursors but the first one probed (sdex[15394495, 30599999]) had 19 M trade rows already in the trades hypertable from an overlapping sibling cursor. False positive. Post-this-PR the gate cuts r1's actionable plan count from 50 to 6 — and those 6 are exactly the cursors whose remaining range overlaps the F-0020 cascade gap (62,642,781 → 62,735,517). Two new flags:--force-classic-cursors(operator opt-in to trust the cursor inventory for SDEX, which doesn't yet have a per-source data-gap detector) and--data-gap-min-size(threshold for the gate query; defaults to 1000). Four new tests pin the gate logic. Closes the F-0020 follow-up that motivatedresume-stalled's ship in the first place: the subcommand is now data-aware.- Periodic data-derived gap detector + 5 new metrics + 2 new alerts. The aggregator binary now runs
internal/storage/timescale.RunGapDetectoras a goroutine that scanssoroban_eventsevery 5 min for contiguous ledger-coverage gaps >= 1000 ledgers. Five new gauges emit per-source:ratesengine_ingest_gap_ledgers(total missing),ratesengine_ingest_gap_count(interval count),ratesengine_ingest_gap_max_size_ledgers(largest single gap),ratesengine_ingest_gap_detector_runs_total+ratesengine_ingest_gap_detector_duration_seconds(meta-metrics for worker health). Two new alerts ship in bothdeploy/monitoring/rules/ingestion.ymland the R1 overlay:ratesengine_ingest_gap_detected(P1 page, fires on max_size > 1000 sustained 15 min) andratesengine_ingest_gap_detector_silent(P2 ticket, fires when the meta-counter goes silent). Two new runbooks document triage + remediation:docs/operations/runbooks/ingest-gap-detected.mdanddocs/operations/runbooks/ingest-gap-detector-silent.md. The pre-this-worker world had no automated signal between "cursor inventory says clean" and "data table actually missing" — the F-0020 cascade-window soroban_events writer halt was invisible for the entire incident. Post-this-worker, the same scenario pages within ~6 min of the first detector cycle after the gap forms. ratesengine-ops find-data-gapssubcommand. Scanssoroban_eventsdirectly for contiguous ledger-coverage gaps and emits a targeted backfill plan. Data-derived alternative to cursor-derived density — cursor coverage measures process state ("did we walk this ledger") and can read 100% while data is missing; this subcommand measures reality. Flags:--min-gap-size(default 1000, filters out legitimate no-Soroban-activity stretches),--from/--to(range scope; default = first ledger in table → live cursor tip),--output text|json(text emits ready-to-pasteratesengine-ops backfillcommands; json emits a plan-shaped document forjqpiping). New store helpertimescale.Store.FindSorobanEventsLedgerGapsuses a LAG() window function overSELECT DISTINCT ledger— cheap on the (ledger_close_time, ledger) btree. Live r1 run found the two F-0020 cascade-window gaps (62,642,781 → 62,735,517 = 92,737 ledgers + 62,746,866 → 62,757,524 = 10,659 ledgers; 103,396 missing total) — exact-match to the manual probe. Future periodic gap-detection metric + alert (ratesengine_ingest_gap_ledgers{source}) is the next layer; this CLI is the immediate operator utility. Four tests pin happy/empty paths + JSON snake_case contract + text-mode operator-facing format.
Fixed
- Density metric no longer falls back to
sourceGenesisLedgerwhen the live cursor'sfirst_ledgeris NULL. The fallback had been inflating per-source density to a dishonest 100% on the status page, hiding genuine ingest gaps (notably the F-0020 cascade-window soroban_events gap, ~103 K ledgers across two contiguous ranges). The pre-fix premise — "the UPDATE branch flips first_ledger on next write" — was wrong: the UPDATE branch left first_ledger untouched, so the NULL persisted forever and the fallback became a permanent lie. After this fix, a NULL live cursor contributes no historical span; the projection credits only the backfill-cursor union until the live cursor's first_ledger is populated. UpsertCursornow COALESCE-populatesfirst_ledgeron the first UPDATE after a NULL row (pre-migration-0046 cluster post-deploy). The transient NULL→no-credit window closes on the live indexer's next tick rather than persisting forever; subsequent advances leave first_ledger pinned to the original value via COALESCE so the coverage anchor only ever moves backward through explicit operator action (DELETE + re-insert).- Cursor + diagnostic godoc rewritten to match the new contract — pre-2026-05-28 wording implied the fallback was transient and harmless; corrected to flag it as the source of the dishonest 100% reading the F-0020 audit surfaced.