Skip to content

Rates Engine v0.5.0-rc.57

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 19 May 15:34
· 708 commits to main since this release

[v0.5.0-rc.57] — 2026-05-19

Fixed

  • CachedCoinsReader stale-while-revalidate — kills the
    /v1/assets list cold-refresh stampede (#22).
    The cache
    already single-flighted, but on TTL expiry the leader discarded
    the still-present stale rows and blocked ~3.9 s on the upstream
    listing aggregate inline, so every request landing in a refetch
    window paid it (p50 1 ms warm / p95 3.9 s at each TTL boundary).
    fetchRows now serves the stale rows immediately on expiry
    and triggers exactly one background refresh (refreshRows,
    request-ctx-independent so it isn't aborted when the stale
    response is written; keeps serving stale and retries on refresh
    error; single-flighted so concurrent stale reads never stampede
    upstream). Cold miss still blocks (nothing to serve). Acceptable
    staleness for an activity-ranked listing (the cache's own
    rationale). New stale/refresh_error cache-op outcomes.
    Concurrency-correct: go test -race clean across new SWR tests
    (serves-stale-instantly, single-flight under 25 concurrent reads,
    keeps-stale-on-error). Same SWR pattern is a candidate follow-up
    for fetchHistoryMap (sparkline). Bundles into rc.57.

  • coins.go xlm_usd CTE bounded to 24 h — collapses #18#21.
    The native→USD price CTE in the /v1/assets/{id} coins query had
    no time predicate (its xlm_usd_1h/xlm_usd_24h siblings and
    the sources_stats.go copies were already bounded). With no
    bucket floor TimescaleDB can't chunk-prune, so
    ORDER BY bucket DESC LIMIT 1 across 3 quote_assets must consider
    every prices_1m chunk (thousands post-backfill). Warm+idle
    that's ~13 ms, but the all-chunks access pattern degrades badly
    under concurrent load + cold buffers — observed ~40 s in
    pg_stat_activity during /v1/assets/{id} fan-out
    (caught via
    in-flight sampling), the dominant tax on every native→USD price
    path (assets detail ×3 in the F2 chain, change_24h, market_cap,
    /v1/price?asset=native, network stats). This is the same query
    #18 logged at ~51 s. Adding AND bucket >= now() - INTERVAL '24 hours' chunk-prunes it to ~1 day (~2–3 ms, resilient under
    load) and is more correct — the unbounded form could surface
    a days-stale vwap as the current price; XLM/USDC trades every
    minute so a 24 h floor never realistically misses the latest.
    Surgical one-clause change; mirrors the already-bounded
    sources_stats.go CTE. Verified on r1; re-measure end-to-end via
    api-latency-sweep.sh post rc.57.

  • /v1/markets no longer 8 s-times-out / 503s — distinctPairsCommon
    reads right-granularity CAGGs (#20).
    The query that powers
    /v1/markets (+ ?source= / ?asset= variants) aggregated
    prices_1m × 14 days × ~52 k pairs; post all-time backfill
    prices_1m ballooned so it seq-scanned multi-million-row
    materialised chunks (~8 s+), blowing both the 8 s handler ceiling
    and leaving the prewarm unable to warm the cache → real users got
    8 s 503/500/empty-200 (live log: a Chrome client, a node client).
    It's a directory query, so it now sources the 14-day
    active-pair set + last_trade_at + last_price from prices_1d
    (cheap — the killer was always the 14 d × 52 k-pair enumeration)
    and the 24 h trade_count/volume_usd from prices_1m
    RESTRICTED to the trailing 24 h (chunk-pruned → ~160 ms
    all-pairs on r1 — fast and exact). Correction (caught
    pre-deploy by r1 measurement):
    an interim variant sourced the
    24 h figures from prices_1h under a "Σ-associative → identical"
    claim — that is false for a rolling, non-hour-aligned 24 h
    window: prices_1h understated all-pairs 24 h volume ~9 %
    ($3.60 B vs prices_1m $3.97 B; boundary + prices_1h refresh-lag).
    The shipped form keeps the user-facing 24 h figure
    prices_1m-accurate. No data/precision loss anywhere data is
    consumed at resolution
    prices_1m and every detail endpoint
    (/history, /ohlc, /chart, /vwap, /twap) are untouched;
    the only change is the listing's last_trade_at rounds to the
    day (from prices_1d). Verified on r1 real data: plan cost
    330 k → 46 k, raw-scan → prices_1d index scan + a ~160 ms
    prices_1m-24 h aggregate, ~8 s+&uncompletable → fast, results
    sane & correctly volume-desc ordered (BTC/USDT $1.0 B,
    ETH/USDT $588 M, …). Keyset cursor / order / Market shape
    preserved byte-for-byte; count_24h COALESCE'd to 0 for
    24 h-idle pairs (more robust than the prior
    FILTER-SUM NULL). Takes effect on r1 with the rc.57 deploy;
    end-to-end re-measure via api-latency-sweep.sh post-deploy.

Added

  • scripts/dev/api-latency-sweep.sh — granular latency profiler
    over the entire anonymous public GET surface (the "kitchen sink"):
    N samples/endpoint → p50/p95/p99/max, ranked slowest-first,
    flagged against the RFP SLO (p95 < 200 ms) and a 1 s concern
    ceiling, exit code = endpoints over the ceiling. Portable
    (API_BASE_URL → run on-host for pure server compute, or from a
    VPS / against r2/r3 for network + cross-region), CACHE_BUST=1
    exposes uncached cost, JSON=1 for machine diffing,
    --spec-check diffs coverage against openapi/…v1.yaml so it
    can't rot. Complements cmd/ratesengine-sla-probe (focused RFP
    pass/fail) with a broad diagnostic ranking. First r1 run
    surfaced /v1/markets (~8 s, failing), /v1/assets/native
    (~5 s), /v1/assets list cold-refresh stampede (1 ms/3.9 s).

Fixed

  • BackfillCoverageStats gutted to a no-op — removes the dead
    per-source trades scan entirely (the honest fix; #12 only
    bounded it).
    Consumer trace confirmed its output is 100 %
    unused: buildBackfillCoverage is cursor-first for every mapped
    source and its cacheRows path continues past every source the
    function scanned (all in sourceGenesisLedger). It nonetheless
    ran ~13 per-source ts-ordered scans + a ~15 s
    approximate_row_count('trades') every refresh interval, and the
    oracle sources' zero-trades scans walked the full ~2700-chunk
    hypertable to the 57014 timeout — the CoverageCache cold-start
    hang + a primary SLO-burn contributor. Now does zero DB work;
    the dead scanScalarBestEffort/coverageStatTimeout from #12 are
    removed too. The (now-inert, zero-cost) CoverageCache scaffolding
    is removed in the #16 snapshot-pregeneration refactor that
    supersedes this whole path.
  • verify-archive-tier-a.service: TimeoutStartSec 4h → 17h —
    fixes a bootstrap deadlock that kept
    ratesengine_verify_archive_unit_failed firing permanently.
    The
    binary self-bounds at -max-runtime (16h) and only writes the
    -from-last-verified state file on a clean exit; subsequent runs
    are then incremental (minutes). But systemd's TimeoutStartSec
    was 4h while -max-runtime was 16h, so on a fresh deploy / after
    a state-file loss the bootstrap full pass (~10–14h at 12 workers,
    state absent → -from genesis) was SIGTERM'd at 4h before it
    could seed state → every run a full pass → permanent failure
    (true deadlock; the old "4h is plenty for incremental" rationale
    ignored that the first run is always a full pass). Also bumped
    Environment=VERIFY_ARCHIVE_MAX_RUNTIME 4h → 16h to match. r1
    hot-fixed in place via a drop-in (TimeoutStartSec=17h) +
    reset-failed; the next nightly run bootstraps the state file and
    it is self-healing thereafter. Operator-copied unit (not in the
    binary release) — repo is now source-of-truth-correct so R2/R3 /
    fresh deploys don't reintroduce the deadlock.
  • BackfillCoverageStats is now fail-soft + per-query
    time-bounded — fixes the coverage-cache cold-start hang and a
    primary SLO-burn contributor.
    Oracle sources (band / redstone /
    reflector-*) write to oracle_updates, never trades, so their
    per-source … WHERE source=$1 ORDER BY ts LIMIT 1 earliest
    query could not chunk-exclude and scanned all ~2700 trades chunks
    to prove emptiness, hitting the statement-timeout (57014). The
    old code did return nil, err on that, so CoverageCache's
    cold-start Refresh never succeeded (snapshot stayed nil
    forever) and the failing query was re-issued every refresh
    interval, feeding the SLO availability/latency burn alerts. Every
    query is now run through scanScalarBestEffort (8 s per-query
    timeout, returns 0 on any error instead of propagating), so one
    slow/empty source degrades that field to 0 instead of blanking
    the whole snapshot. BackfillCoverageStats now always returns
    (rows, nil). These stats are best-effort enrichment only — the
    headline density is cursor-derived and entries come from
    source_entry_counts — so 0-on-timeout is the correct safe
    degradation. Integration-covered (no DB-free unit seam).
  • sourceGenesisLedger: corrected comet/blend off-by-one
    (51_499_54551_499_546).
    51_499_545 came from the walk
    JSON's from_ledger (the ContractCode-upload / walker transition
    boundary); the exact ContractInstance instantiation ledger is
    L51_499_546 per comet.md:157 and blend.md:90. comet and
    blend legitimately share this ledger — there is no standalone
    mainnet Comet; the only mainnet Comet deployment is Blend's
    backstop pool, instantiated in the same ledger as Blend's Pool
    Factory V2 during Blend's mainnet rollout. Comment expanded so
    the shared origin reads as intentional, not a copy-paste bug.
    defindex stays a clearly-labelled PROVISIONAL placeholder
    (separate 2025 protocol; real value pending its in-progress
    wasm-history walk) and is now deliberately distinct from the
    comet/blend pair so it is not mistaken for the real coincidence.