Rates Engine v0.5.0-rc.57

Pre-release

Pre-release

github-actions released this 19 May 15:34

· 708 commits to main since this release

ad78dd0

[v0.5.0-rc.57] — 2026-05-19

Fixed

CachedCoinsReader stale-while-revalidate — kills the
/v1/assets list cold-refresh stampede (#22). The cache
already single-flighted, but on TTL expiry the leader discarded
the still-present stale rows and blocked ~3.9 s on the upstream
listing aggregate inline, so every request landing in a refetch
window paid it (p50 1 ms warm / p95 3.9 s at each TTL boundary).
fetchRows now serves the stale rows immediately on expiry
and triggers exactly one background refresh (refreshRows,
request-ctx-independent so it isn't aborted when the stale
response is written; keeps serving stale and retries on refresh
error; single-flighted so concurrent stale reads never stampede
upstream). Cold miss still blocks (nothing to serve). Acceptable
staleness for an activity-ranked listing (the cache's own
rationale). New stale/refresh_error cache-op outcomes.
Concurrency-correct: go test -race clean across new SWR tests
(serves-stale-instantly, single-flight under 25 concurrent reads,
keeps-stale-on-error). Same SWR pattern is a candidate follow-up
for fetchHistoryMap (sparkline). Bundles into rc.57.
coins.go xlm_usd CTE bounded to 24 h — collapses #18 ≡ #21.
The native→USD price CTE in the /v1/assets/{id} coins query had
no time predicate (its xlm_usd_1h/xlm_usd_24h siblings and
the sources_stats.go copies were already bounded). With no
bucket floor TimescaleDB can't chunk-prune, so
ORDER BY bucket DESC LIMIT 1 across 3 quote_assets must consider
every prices_1m chunk (thousands post-backfill). Warm+idle
that's ~13 ms, but the all-chunks access pattern degrades badly
under concurrent load + cold buffers — observed ~40 s in
pg_stat_activity during /v1/assets/{id} fan-out (caught via
in-flight sampling), the dominant tax on every native→USD price
path (assets detail ×3 in the F2 chain, change_24h, market_cap,
/v1/price?asset=native, network stats). This is the same query
#18 logged at ~51 s. Adding AND bucket >= now() - INTERVAL '24 hours' chunk-prunes it to ~1 day (~2–3 ms, resilient under
load) and is more correct — the unbounded form could surface
a days-stale vwap as the current price; XLM/USDC trades every
minute so a 24 h floor never realistically misses the latest.
Surgical one-clause change; mirrors the already-bounded
sources_stats.go CTE. Verified on r1; re-measure end-to-end via
api-latency-sweep.sh post rc.57.
/v1/markets no longer 8 s-times-out / 503s — distinctPairsCommon
reads right-granularity CAGGs (#20). The query that powers
/v1/markets (+ ?source= / ?asset= variants) aggregated
prices_1m × 14 days × ~52 k pairs; post all-time backfill
prices_1m ballooned so it seq-scanned multi-million-row
materialised chunks (~8 s+), blowing both the 8 s handler ceiling
and leaving the prewarm unable to warm the cache → real users got
8 s 503/500/empty-200 (live log: a Chrome client, a node client).
It's a directory query, so it now sources the 14-day
active-pair set + last_trade_at + last_price from prices_1d
(cheap — the killer was always the 14 d × 52 k-pair enumeration)
and the 24 h trade_count/volume_usd from prices_1m
RESTRICTED to the trailing 24 h (chunk-pruned → ~160 ms
all-pairs on r1 — fast and exact). Correction (caught
pre-deploy by r1 measurement): an interim variant sourced the
24 h figures from prices_1h under a "Σ-associative → identical"
claim — that is false for a rolling, non-hour-aligned 24 h
window: prices_1h understated all-pairs 24 h volume ~9 %
($3.60 B vs prices_1m $3.97 B; boundary + prices_1h refresh-lag).
The shipped form keeps the user-facing 24 h figure
prices_1m-accurate. No data/precision loss anywhere data is
consumed at resolution — prices_1m and every detail endpoint
(/history, /ohlc, /chart, /vwap, /twap) are untouched;
the only change is the listing's last_trade_at rounds to the
day (from prices_1d). Verified on r1 real data: plan cost
330 k → 46 k, raw-scan → prices_1d index scan + a ~160 ms
prices_1m-24 h aggregate, ~8 s+&uncompletable → fast, results
sane & correctly volume-desc ordered (BTC/USDT $1.0 B,
ETH/USDT $588 M, …). Keyset cursor / order / Market shape
preserved byte-for-byte; count_24h COALESCE'd to 0 for
24 h-idle pairs (more robust than the prior
FILTER-SUM NULL). Takes effect on r1 with the rc.57 deploy;
end-to-end re-measure via api-latency-sweep.sh post-deploy.

Added

scripts/dev/api-latency-sweep.sh — granular latency profiler
over the entire anonymous public GET surface (the "kitchen sink"):
N samples/endpoint → p50/p95/p99/max, ranked slowest-first,
flagged against the RFP SLO (p95 < 200 ms) and a 1 s concern
ceiling, exit code = endpoints over the ceiling. Portable
(API_BASE_URL → run on-host for pure server compute, or from a
VPS / against r2/r3 for network + cross-region), CACHE_BUST=1
exposes uncached cost, JSON=1 for machine diffing,
--spec-check diffs coverage against openapi/…v1.yaml so it
can't rot. Complements cmd/ratesengine-sla-probe (focused RFP
pass/fail) with a broad diagnostic ranking. First r1 run
surfaced /v1/markets (~8 s, failing), /v1/assets/native
(~5 s), /v1/assets list cold-refresh stampede (1 ms/3.9 s).

Fixed

BackfillCoverageStats gutted to a no-op — removes the dead
per-source trades scan entirely (the honest fix; #12 only
bounded it). Consumer trace confirmed its output is 100 %
unused: buildBackfillCoverage is cursor-first for every mapped
source and its cacheRows path continues past every source the
function scanned (all in sourceGenesisLedger). It nonetheless
ran ~13 per-source ts-ordered scans + a ~15 s
approximate_row_count('trades') every refresh interval, and the
oracle sources' zero-trades scans walked the full ~2700-chunk
hypertable to the 57014 timeout — the CoverageCache cold-start
hang + a primary SLO-burn contributor. Now does zero DB work;
the dead scanScalarBestEffort/coverageStatTimeout from #12 are
removed too. The (now-inert, zero-cost) CoverageCache scaffolding
is removed in the #16 snapshot-pregeneration refactor that
supersedes this whole path.
verify-archive-tier-a.service: TimeoutStartSec 4h → 17h —
fixes a bootstrap deadlock that kept
ratesengine_verify_archive_unit_failed firing permanently. The
binary self-bounds at -max-runtime (16h) and only writes the
-from-last-verified state file on a clean exit; subsequent runs
are then incremental (minutes). But systemd's TimeoutStartSec
was 4h while -max-runtime was 16h, so on a fresh deploy / after
a state-file loss the bootstrap full pass (~10–14h at 12 workers,
state absent → -from genesis) was SIGTERM'd at 4h before it
could seed state → every run a full pass → permanent failure
(true deadlock; the old "4h is plenty for incremental" rationale
ignored that the first run is always a full pass). Also bumped
Environment=VERIFY_ARCHIVE_MAX_RUNTIME 4h → 16h to match. r1
hot-fixed in place via a drop-in (TimeoutStartSec=17h) +
reset-failed; the next nightly run bootstraps the state file and
it is self-healing thereafter. Operator-copied unit (not in the
binary release) — repo is now source-of-truth-correct so R2/R3 /
fresh deploys don't reintroduce the deadlock.
BackfillCoverageStats is now fail-soft + per-query
time-bounded — fixes the coverage-cache cold-start hang and a
primary SLO-burn contributor. Oracle sources (band / redstone /
reflector-*) write to oracle_updates, never trades, so their
per-source … WHERE source=$1 ORDER BY ts LIMIT 1 earliest
query could not chunk-exclude and scanned all ~2700 trades chunks
to prove emptiness, hitting the statement-timeout (57014). The
old code did return nil, err on that, so CoverageCache's
cold-start Refresh never succeeded (snapshot stayed nil
forever) and the failing query was re-issued every refresh
interval, feeding the SLO availability/latency burn alerts. Every
query is now run through scanScalarBestEffort (8 s per-query
timeout, returns 0 on any error instead of propagating), so one
slow/empty source degrades that field to 0 instead of blanking
the whole snapshot. BackfillCoverageStats now always returns
(rows, nil). These stats are best-effort enrichment only — the
headline density is cursor-derived and entries come from
source_entry_counts — so 0-on-timeout is the correct safe
degradation. Integration-covered (no DB-free unit seam).
sourceGenesisLedger: corrected comet/blend off-by-one
(51_499_545 → 51_499_546). 51_499_545 came from the walk
JSON's from_ledger (the ContractCode-upload / walker transition
boundary); the exact ContractInstance instantiation ledger is
L51_499_546 per comet.md:157 and blend.md:90. comet and
blend legitimately share this ledger — there is no standalone
mainnet Comet; the only mainnet Comet deployment is Blend's
backstop pool, instantiated in the same ledger as Blend's Pool
Factory V2 during Blend's mainnet rollout. Comment expanded so
the shared origin reads as intentional, not a copy-paste bug.
defindex stays a clearly-labelled PROVISIONAL placeholder
(separate 2025 protocol; real value pending its in-progress
wasm-history walk) and is now deliberately distinct from the
comet/blend pair so it is not mistaken for the real coincidence.

Assets 9