Skip to content

Rates Engine v0.5.0-rc.85

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 28 May 19:12
· 445 commits to main since this release

[v0.5.0-rc.85] — 2026-05-28

Tested against Stellar Protocol 23 (Whisk).

Density-honesty bundle: the previous rc.83/rc.84 shipped with a status-page bug that inflated per-source density to 100% by falling back to genesis on a NULL live cursor. Combined with the cursor-derived density measuring process state rather than data state, this masked the F-0020 cascade-window gap entirely. This release ships the four layers needed to detect, diagnose, and prevent that class of failure end-to-end.

Pre-deploy operator note: this release rolls out the api + aggregator + ops binaries together (rc.84 was ops-only). No migrations needed beyond rc.83's 0046/0047.

Added

  • ratesengine-ops resume-stalled now gates plans against the data-derived FindSorobanEventsLedgerGaps result. Pre-this-PR the subcommand trusted the cursor inventory as ground truth; the live r1 dry-run surfaced 50 "actionable" stalled cursors but the first one probed (sdex [15394495, 30599999]) had 19 M trade rows already in the trades hypertable from an overlapping sibling cursor. False positive. Post-this-PR the gate cuts r1's actionable plan count from 50 to 6 — and those 6 are exactly the cursors whose remaining range overlaps the F-0020 cascade gap (62,642,781 → 62,735,517). Two new flags: --force-classic-cursors (operator opt-in to trust the cursor inventory for SDEX, which doesn't yet have a per-source data-gap detector) and --data-gap-min-size (threshold for the gate query; defaults to 1000). Four new tests pin the gate logic. Closes the F-0020 follow-up that motivated resume-stalled's ship in the first place: the subcommand is now data-aware.
  • Periodic data-derived gap detector + 5 new metrics + 2 new alerts. The aggregator binary now runs internal/storage/timescale.RunGapDetector as a goroutine that scans soroban_events every 5 min for contiguous ledger-coverage gaps >= 1000 ledgers. Five new gauges emit per-source: ratesengine_ingest_gap_ledgers (total missing), ratesengine_ingest_gap_count (interval count), ratesengine_ingest_gap_max_size_ledgers (largest single gap), ratesengine_ingest_gap_detector_runs_total + ratesengine_ingest_gap_detector_duration_seconds (meta-metrics for worker health). Two new alerts ship in both deploy/monitoring/rules/ingestion.yml and the R1 overlay: ratesengine_ingest_gap_detected (P1 page, fires on max_size > 1000 sustained 15 min) and ratesengine_ingest_gap_detector_silent (P2 ticket, fires when the meta-counter goes silent). Two new runbooks document triage + remediation: docs/operations/runbooks/ingest-gap-detected.md and docs/operations/runbooks/ingest-gap-detector-silent.md. The pre-this-worker world had no automated signal between "cursor inventory says clean" and "data table actually missing" — the F-0020 cascade-window soroban_events writer halt was invisible for the entire incident. Post-this-worker, the same scenario pages within ~6 min of the first detector cycle after the gap forms.
  • ratesengine-ops find-data-gaps subcommand. Scans soroban_events directly for contiguous ledger-coverage gaps and emits a targeted backfill plan. Data-derived alternative to cursor-derived density — cursor coverage measures process state ("did we walk this ledger") and can read 100% while data is missing; this subcommand measures reality. Flags: --min-gap-size (default 1000, filters out legitimate no-Soroban-activity stretches), --from / --to (range scope; default = first ledger in table → live cursor tip), --output text|json (text emits ready-to-paste ratesengine-ops backfill commands; json emits a plan-shaped document for jq piping). New store helper timescale.Store.FindSorobanEventsLedgerGaps uses a LAG() window function over SELECT DISTINCT ledger — cheap on the (ledger_close_time, ledger) btree. Live r1 run found the two F-0020 cascade-window gaps (62,642,781 → 62,735,517 = 92,737 ledgers + 62,746,866 → 62,757,524 = 10,659 ledgers; 103,396 missing total) — exact-match to the manual probe. Future periodic gap-detection metric + alert (ratesengine_ingest_gap_ledgers{source}) is the next layer; this CLI is the immediate operator utility. Four tests pin happy/empty paths + JSON snake_case contract + text-mode operator-facing format.

Fixed

  • Density metric no longer falls back to sourceGenesisLedger when the live cursor's first_ledger is NULL. The fallback had been inflating per-source density to a dishonest 100% on the status page, hiding genuine ingest gaps (notably the F-0020 cascade-window soroban_events gap, ~103 K ledgers across two contiguous ranges). The pre-fix premise — "the UPDATE branch flips first_ledger on next write" — was wrong: the UPDATE branch left first_ledger untouched, so the NULL persisted forever and the fallback became a permanent lie. After this fix, a NULL live cursor contributes no historical span; the projection credits only the backfill-cursor union until the live cursor's first_ledger is populated.
  • UpsertCursor now COALESCE-populates first_ledger on the first UPDATE after a NULL row (pre-migration-0046 cluster post-deploy). The transient NULL→no-credit window closes on the live indexer's next tick rather than persisting forever; subsequent advances leave first_ledger pinned to the original value via COALESCE so the coverage anchor only ever moves backward through explicit operator action (DELETE + re-insert).
  • Cursor + diagnostic godoc rewritten to match the new contract — pre-2026-05-28 wording implied the fallback was transient and harmless; corrected to flag it as the source of the dishonest 100% reading the F-0020 audit surfaced.