Rates Engine v0.5.0-rc.51
Pre-release[v0.5.0-rc.51] — 2026-05-14
Changed
-
verify-archive Tier A is now incremental. Pre-fix the nightly
systemd unit re-walked the entire chain from genesis every night,
taking ~13.8h of wall time and ~7h of CPU time per pass (67% of
every day; visible as a sustained load-average drag on r1). Past
LCM files are immutable, so re-hashing them is wasted compute.New scheme: ratesengine-ops verify-archive accepts
-state-file PATH -from-last-verified [-safety-overlap N].
Reads the prior run's high-water mark from a small JSON file,
computes-from = max(2, last_verified - safety_overlap), and
verifies only the new tail. The resume-from-hash from prior
state is plumbed through so cross-run chain continuity is
preserved (the next incremental run's first chunk must chain to
the previous run's last verified hash).Default safety overlap: 5000 ledgers (~17h of chain) catches any
anomalies that snuck in just before the last run's tip.
systemd unit defaults updated:
VERIFY_ARCHIVE_STATE_FILE=/var/lib/ratesengine/verify-archive-state.json,
VERIFY_ARCHIVE_MAX_RUNTIME=4h(down from 16h). Typical
incremental pass covers ~24h of new ledgers in minutes.A weekly full-archive re-pass (defense-in-depth against silent
corruption in older chunks) remains a TBD sibling unit. -
backfill_coverage[].density_pctreplacescoverage_pcton
/v1/diagnostics/ingestion. Pre-fix the metric was(latest - earliest) / (tip - genesis)— endpoint span, not data density. A
source with one trade at genesis and one trade at tip scored
100% even with the whole interior empty (caught live 2026-05-14
when SDEX backfill was still running but coverage showed 99.8%
and aquarius/comet/phoenix/soroswap all showed 99.99%).New metric: union of completed portions of all backfill cursor
intervals that include this source in their decoder set, clamped
to[genesis, tip], divided bytip - genesis + 1. Hits 100%
only when backfill ranges actually cover the whole interval.
Sparse sources no longer score 100% just for having endpoint
trades — they score by what fraction of ledgers their backfill
has processed, which is the question operators actually want
answered.Wire:
coverage_pctretained as a transitional field for one
release. New fields:density_pct,covered_ledgers,
expected_ledgers. Status page updated to render the new
density (tooltip exposes the absolute "covered / expected"
numerator + denominator).
Added
-
Chainlink ingest source (
internal/sources/external/chainlink/).
Promotes the formerly-divergence-only Chainlink reference into a
full ingest source — writescanonical.OracleUpdaterows to
oracle_updateson its own poller goroutine alongside Reflector /
Redstone / Band. Implementsexternal.Poller; lives parallel to
the existinginternal/divergence/chainlink.gocross-check (which
stays in place for synchronous divergence_warning checks).Wire shape: poll
AggregatorV3.latestRoundData()over JSON-RPC,
dedupe by(feed_address, roundId), project to canonical with
synthetic deterministic tx_hash (sha256(feed || roundId)) for
idempotent restart. Default 30s cadence; per-feed Decimals/Invert
overrides via TOML. Default endpoint is Cloudflare public; operator
drops an Alchemy URL (with embedded API key) into r1's TOML or via
CHAINLINK_RPC_URLenv. Bounded concurrency (8) per tick.Backfill: new
ratesengine-ops backfill-chainlinksubcommand walks
AnswerUpdatedevent logs via chunkedeth_getLogs(5k blocks /
call, the safe default for Alchemy / Infura / QuickNode response-
size caps). ~33k RPC calls and ~7h wall time for all-time backfill
of the default 6 majors on Alchemy free tier (~19% of monthly
quota); scale linearly with feed count up to all 516 ETH-mainnet
Chainlink feeds within the same free-tier envelope. Idempotent on
the oracle_updates PK; safe to re-run over already-covered ranges.Surface: registered in
external.Registryas
ClassOracle / BackfillSafe=true / IncludeInVWAP=false. Picked up
by/v1/sources?class=oracleautomatically — explorer's /oracles
page surfaces it without UI changes. -
Oracle CAGG ladder (migration 0034). Seven continuous
aggregates onoracle_updatesat the standard
1m/15m/1h/4h/1d/1w/1motiers — sister to the trade CAGG ladder
in migration 0002. Closes the gap where every/v1/oracle/*
history query was scanning raworacle_updates; manageable at
~3 oracle sources × ~860 rows/day each, untenable once Chainlink
arrives at scale.Aggregation semantics differ from trades: oracles are point-in-
time observations, so each bucket carriesfirst / last / min / max / last_decimals / count(no VWAP / TWAP because there is no
volume dimension). One row per(source, asset, quote, bucket)—
per-source identity preserved so cross-oracle comparison stays
meaningful. Refresh policies match the trade ladder; no retention
on sub-1h tiers (matches the operator's "store everything forever"
decision in migration 0031), indefinite for 1h+ per the proposal. -
DeFindex vault decoder (
internal/sources/defindex/).
Event-based decoder (dispatcher.Decoder, NOT
ContractCallDecoder) for paltalabs/defindex's autocompound
vaults. Phase A matches("DeFindexVault","deposit")and
("DeFindexVault","withdraw")events on the 3 known vaults
(USDC / EURC / XLM autocompound). Decoder pulls
depositor/withdrawer, multi-assetamountsvec
(i128, no truncation per ADR-0003), and the share-token
delta (df_tokens_minted/_burned) by name from the
body Map (decode-by-name per
contract-schema-evolution.md). Phase B will tag matching
same-tx Blend / Soroswap legs as
routed_via=defindex-{vault}and write
aggregator_exposuresrows from a separate periodic
ticker. Pre-seed migration0033_seed_defindex_vaults.up.sql
populates the 3 vaults in theroutersregistry as
kind='aggregator-vault'. WASM-history audit started at
docs/operations/wasm-audits/defindex.md;
BackfillSafe=falseuntil the per-hash review lands. -
Soroswap Router decoder (
internal/sources/soroswap_router/).
New ContractCallDecoder following the Band oracle pattern —
matches by(contract_id, function_name)and decodes
swap_exact_tokens_for_tokens/swap_tokens_for_exact_tokens
invocations on the canonical pubnet router
(CAG5LRYQ5JVEUI5TEID72EYOVX44TTUJT5BQR2J6J77FH65PCCFAJDDH).
Phase A is log-only — every routed swap surfaces an INFO line
with path, in/out amounts (i128, no truncation per ADR-0003),
recipient, and deadline. Phase B will tag matching same-tx
trades.routed_viarows via the existing migration-0025 column.
NewClassRoutertaxonomy ininternal/sources/external/
(alongside the existingClassLending); router class is
attribution-only, never contributes to VWAP. Pre-seed migration
0032_seed_soroswap_router.up.sqlpopulates therouters
registry. WASM-history audit started at
docs/operations/wasm-audits/soroswap-router.md;
BackfillSafe=falseuntil the per-hash review lands.
Changed
- Raw trades retention removed (migration 0031). Pre-fix the
tradeshypertable aged out at 90 days; we relied on the
hourly+ CAGGs to preserve historical OHLC. Operator wants raw
per-trade fidelity preserved indefinitely (regulatory + proof-
of-pricing queries can't be reconstructed from CAGGs).
Justification: r1's postgres data dir is on a 1.5 TB ZFS volume
with 4% used. Earlier "no room" analysis was wrong — was
measuring the OS root disk (49 GB), not the postgres data
volume. Status page coverage panel relabeled from "Raw-trades
coverage (last 90 days)" → "Raw-trades coverage — genesis →
tip"; coverage_pct grows monotonically as backfills land.
Compression policy on chunks > 7d is unchanged (~5x reduction).
Added
-
CEX pair coverage — cross-fiat majors. All four CEX
connectors (binance/bitstamp/coinbase/kraken) now stream BTC
and ETH against EUR + GBP in addition to USD. Pre-fix, only
Bitstamp published BTC/EUR — every aggregator tick on
crypto:BTC/fiat:EURwas single-source, which falsely tripped
Phase 2 freeze permanently. Bitstamp + Coinbase + Kraken +
Binance all support these pairs natively; we just hadn't
enumerated them in the connector defaults.Stop-gap pre-Tier-3. The next change in this area will replace
the hand-curatedDefaultPairs()maps with auto-discovery from
each exchange's pair-catalogue endpoint
(/api/v3/exchangeInfo//products//0/public/AssetPairs/
/api/v2/trading-pairs-info), filtered by an allow-list of
quote assets. That move expands coverage from ~50 hand-curated
pairs/exchange to ~200-1500 active pairs/exchange. Storage
scales with PAIR COUNT (CAGG rows, ~50 MB/year for 1500 pairs)
not raw trade volume (90-day retention), so the cost is
bounded.
Fixed
-
Backfill auto-refresh: three bugs caught on first real run.
Yesterday's commit addedrefresh_continuous_aggregatecalls
after each backfill chunk but every CAGG refresh failed. Three
fixes from the live test:42P18: could not determine data type of parameter $1
— lib/pq'sCALLsyntax doesn't propagate the procedure
signature's parameter types, so untyped placeholders fail.
Fix: explicit::timestamptzcasts in the SQL.22023: refresh window too smallforprices_4h/
_1d/_1w/_1mo— Timescale rejects refresh windows
narrower than 2× bucket width. A 10k-ledger chunk's ts
span (~4h) was fine forprices_1hbut failed every
coarser CAGG. Fix: per-CAGGMinWindowdeclared in
CAGGsLiveForever; newPadRefreshWindowhelper expands
the chunk's window to that minimum centered on the
chunk's midpoint. Padded area materialises as empty
buckets (cheap).55P03: concurrent refresh— with-parallel N,
multiple chunks race on the same coarse CAGG (prices_1mo
was the worst — chunks finishing close together all want
to refresh the same monthly bucket). Fix: retry-on-55P03
with exponential backoff (200ms → 1.6s × 5 attempts).
End-to-end verified live: 10k-ledger SDEX backfill at
ledgers 50,000,000-50,010,000 inserted 718,873 trades AND
populated 66,513prices_1hbuckets + 22,005prices_1d
buckets — those CAGGs will now persist past the 90-day raw
retention. Yesterday's claim "auto-refresh now works" was
premature; this commit is what makes it true. -
Live-site QA pass — F-01/F-03/F-04 resolved, F-02 partial.
Working throughdocs/review-2026-05-13-live-site-qa.md:- F-01 (degraded state invisible in explorer): new
DegradedBannercomponent polls/v1/statusevery 60s and
renders a fixed band between Navbar and content when
overall ≠ "ok". Tone (amber/red) keys offpageCount > 0.
Includes top alert name + link to status page. Quiet when
everything's fine; noisy enough to set expectations when
it isn't. - F-02 (pools 503 silently rendered as "No pools matched"):
DexesView now branches onq.isErrorand surfaces an
explicit error card with retry + link to status. Empty-
state path is gated behind!q.isError. Backend perf
(the underlying 7s cold-cache p99) tracked alongside the
api_cache_miss_rate_highworkstream. - F-03 (CORS credentials mismatch): explorer's
useMe()
no longer sendscredentials: includeagainst an API that
explicitly refuses credentialed CORS. Cost: signed-in
users see signed-out CTAs in the explorer navbar
(dashboard.ratesengine.net is unaffected — same-origin).
Inline comment documents the cross-origin cookie work
needed to re-enable session detection (Domain=
.ratesengine.net + ACA-Credentials + SameSite=None). - F-04 (
deep_linkAPI path leaked to next/link):
NetworksPanelno longer feeds APIdeep_linkvalues
(e.g./v1/assets/USDC-GA5Z…) into<Link>. Stellar
rows now build the explorer route explicitly
(/assets/{slug}/stellar); the API deep_link stays in
the JSON for programmatic consumers.
- F-01 (degraded state invisible in explorer): new
-
Incident triage sweep — 9 active alerts → root-cause +
preventatives. Worked through every alert firing on r1 today
and either resolved the root cause, codified prevention in
ansible, or filed it as a known-real signal needing follow-up:-
node_root_disk_warning— disk 81% → 62% by truncating a
7.3GB syslog. Root cause: Loki running atlog_level=debug
spamming ~4Mcaller=mock.go msg=Get key=collectors/...
lines/day into syslog. Fix: set Loki towarn
(configs/ansible/roles/loki/templates/loki-config.yaml.j2)
and add a defense-in-depth rsyslog filter so even an
accidental level regression can't reach/var/log/syslog
(configs/ansible/roles/archival-node/tasks/15-log-discipline.yml).
Also pruned 36 old binary backups + 9 stale toml backups +
vacuumed journal to 7 days. -
verify_archive_unit_failed— root cause: 8h max-runtime
cap was tight for ~62.5M-ledger pubnet. Fresh run completed
34.7M ledgers in 8h (1207 l/s aggregate at 8 workers) then
exited 1/FAILURE on context deadline — the same as the
previously-rotated journal would have shown. Bumped
defaults to 12 workers + 16h cap (sits inside the 24h
timer cadence with headroom). Updated both the in-repo
unit (deploy/systemd/verify-archive-tier-a.service) and
the live r1 drop-in. Started a fresh run on the new
settings; the alert clears when it finishes. -
sla_probe_unit_failed_alert— REAL:/v1/markets,
/v1/assetscold-cache p99 spikes (~5s, ~2.4s) breach
the 500ms target on the probe's first sample after each
30s cache-TTL window. Filed as a perf workstream — needs
/v1/assets+/v1/issuerscache wrappers + prewarm. -
api_cache_miss_rate_high— REAL: prewarm covers
markets/all_poolsfor limits {5,25,100,200} but
markets/asset_marketsandmarkets/source_markets
ops aren't prewarmed at all; user-facing requests with
novel param tuples miss cache. Same perf workstream. -
anomaly_freeze_sustained/anomaly_freeze_engaged—
REAL but invisible: 1892 freeze decisions emitted, zero
Redis markers, zerofreeze_eventsrows. Phase 2's
baseline z-score is unstable because we only have 7 days
ofprices_1hdata (root cause = the SDEX backfill bug
from the previous session). Added an INFO log in
markPhase2Freezeso operators can grep
journalctl -u ratesengine-aggregator | grep "phase2 freeze"
to see which pairs are firing. Updated the alert
annotation (both repo + R1 overlay) to call out the
cold-baseline pattern + triage steps. -
aggregator_supply_refresh_never_initialized— gated by
[supply].aggregator_refresh_enabled = false(default).
Enabling it requires the on-chain supply observers to be
backfilled across the watched accounts; same workstream as
the SDEX backfill. Not a quick fix; documented for follow-up. -
supply_snapshot_never_initialized— RESOLVED: the
supply-snapshot.servicewas running daily and exiting 0,
but/etc/default/supply-snapshotdidn't setTEXTFILE_OUTPUT,
so the binary skipped the metric write. Wired the textfile
path; metric now emits. Codified in
configs/ansible/roles/archival-node/tasks/10-observability.yml
so a rebuilt host gets the wiring automatically. -
slo_latency_burn_slow— same family as the SLA-probe
perf finding; will track with that workstream.
-
-
Backfill status surfaces "stalled" vs "running" separately.
BackfillDecoderState(the per-decoder row on
/v1/diagnostics/ingestion) decomposes the previously-opaque
ranges_activecount intoranges_complete(done),
ranges_running(incomplete + updated within 10 min), and
ranges_stalled(incomplete + idle > 10 min — needs
ratesengine-ops backfill -resume). Status page renders three
separate columns with green/blue/red coloring. The old
ranges_activefield stays on the wire for back-compat. -
Backfill auto-refreshes the long-lived CAGGs (
prices_1h/
prices_4h/prices_1d/prices_1w/prices_1mo) at the
end of every chunk. Without this, historical inserts get
dropped by the 90-day raw-trades retention policy before the
CAGG policy refresher's natural cadence picks them up — which
is what happened to the May 6-11 2026 SDEX backfill (cursors
hitlast_ledger == range_endfor every range, ~80M trades
inserted, retention dropped them within 24h, no CAGG
materialisation, ~5d of wall-clock work lost; trades
MIN(ledger)forsdexcollapsed back to 61,191,617).Backfill tool changes:
- New
-refresh-caggsflag (defaulttrue). After each
chunk's trade-insert loop, derives the actual ts range from
the inserted rows (Store.LedgerRangeToTimeRange) and
force-refreshes every long-lived CAGG over that window
(Store.RefreshContinuousAggregate). - Per-view soft-fail so one wedged CAGG doesn't block the
others. - Procedure doc rewritten — manual
CALL refresh_continuous_ aggregatestep removed (now automatic).
Diagnostics endpoint additions:
cagg_coveragefield reportsprices_1hMIN/MAX bucket +
row count — the real source-of-truth answer to "do we have
historical OHLC since genesis?" (rawtradesonly spans
the last 90 days; hourly+ CAGGs are retained forever).
- New
Added
-
Backfill coverage on
/v1/diagnostics/ingestion+ status page.
Newbackfill_coverage[]array on the diagnostics endpoint
reports per-source MIN/MAX ledger from the trades hypertable,
joined with an operator-curated map of source genesis ledgers
(1 for SDEX, contract deploy ledger for each Soroban DEX), with
a derivedcoverage_pctso the answer to "do we have data from
ledger 1 to tip?" is one column. CEX/FX sources surface as
applies=false(their trades have no Stellar-ledger context).
Backed by a process-local cache refreshed every 5 min in a
background goroutine — the underlying SQL is 2-3s on a populated
trades hypertable, too slow for the request path.Status page renders a new "Coverage — ledger genesis → tip"
table with per-source progress bars (green ≥99%, amber ≥50%,
red <50%). Today's r1 reading: SDEX 2.18% covered (61.2M → 62.5M
out of 1 → 62.5M), Soroban DEXes 15-17%, off-chain sources N/A. -
Status page — per-region "Ingestion" section. Polls each
region's/v1/diagnostics/ingestionevery 30s and renders a
panel with: binary version + commit, live ledger card (latest,
lag, 24h volume, indexed markets/assets), FX backfill coverage
(date range, currencies, total quotes), CoinGecko market-cap
cache state (entries, newest/oldest fetch age), supply observer
counts, per-decoder backfill table (ranges total/active, oldest
lag), and per-source health table joined with trailing-24h
trades/volume/markets. Region list is a singleREGIONSconst —
r2/r3 join by appending a row, no other code changes needed. -
GET /v1/diagnostics/ingestion— single-fetch ingestion
snapshot for the region. Composes: region label, binary version,
live ledger tip + lag, per-decoder backfill state (ranges
total/active, oldest lag), Frankfurter / fx_quotes coverage
(earliest/latest dates, total quotes, distinct currencies),
market-cap cache state, supply observer coverage (classic vs
SEP-41 counts, last snapshot age), and the full source registry
joined with trailing-24h trades/volume/markets.
Designed as the only call the status page makes for its
per-region ingestion panel — operators no longer have to scrape
/v1/network/stats+/v1/sources+/v1/diagnostics/cursors/v1/versionand reconcile by hand. New storage helpers:
FXCoverageStats,SupplyCoverageStats(one query each, ~1ms
on populated tables). Cache:public, max-age=15.
Fixed
/assets/{slug}for catalogue slugs (usdc,chinese-yuan,
btc, …) now renders the real cross-chain view instead of the
"Asset not found" fallback. The page'sfetchGlobalAsset
was firing a per-slug/v1/assets/{slug}request at build
time, just like[network]was before its consolidation —
with ~1000 prerendered routes that storm tripped r1's anon
rate limit and every catalogue page baked in the not-found
fallback. Extracted the catalogue source to
web/explorer/src/app/assets/catalogue.ts(shared module,
single/v1/assets/verifiedcall, memoised promise, 429-aware
retry). Both[slug]and[slug]/[network]now read from the
same map./assets/{slug}and/assets/{slug}/{network}now resolve in
both case variants for catalogue entries. Previously only the
uppercase form (/assets/USDC/) was prerendered because dedup in
generateStaticParamspicked first-seen casing, so user-typed
lowercase URLs (/assets/usdc/) and any links pointing at the
catalogue's canonical lowercase slugs returned 404. Now both
cases get a route per catalogue entry; non-catalogue Stellar
assets keep their listing casing as before./assets/{slug}for verified-catalogue currencies now renders
the cross-chain identity view, not the Stellar-issuer view.
The dispatcher used to fall through to AssetDetail (with the
IssuerPanel) whenever/v1/coinsreturned a row, even when
the slug also matched a catalogue entry. Result:/assets/USDC/
was showing Circle's Stellar issuer detail instead of the
cross-chain page. The[network]route (/assets/USDC/Stellar/)
is now the only place per-issuer detail lives. Title +
description for catalogue slugs now use cross-chain framing
(USDC — Stablecoin) instead of Stellar-only framing.
Changed
- Ansible template now bakes in
anon_rate_limit_per_min = 600
/key_rate_limit_per_min = 6000. Codifies the live r1 bump
applied 2026-05-13. The prior defaults (60 / 1000 per min) were
too tight for any consumer doing a static build or dashboard
refresh from a single IP — the explorer Cloudflare Pages build
was the canary.
Fixed
- Explorer build no longer 429s on
/assets/[slug]/[network]
prerender. Next.js opts out of its built-in fetch dedup when
signalis set, so each prerendered slug+network page was
separately re-fetching/v1/assets/{slug}and the build was
firing hundreds of requests in parallel — far above r1's
anonymous-tier rate limit (60 req/min). Result: every
[slug]/[network]route prerendered as a "Not found" page on
prod. Fix consolidates the catalogue source: a single
/v1/assets/verifiedcall (with 429-aware retry) populates a
module-level Map from which bothgenerateStaticParamsand
per-pagefetchGlobalAssetread. Concurrent r1 config bump
(anon_rate_limit_per_min = 600,key_rate_limit_per_min = 6000) gives real consumers headroom too — the prior 60/min
was unworkable for any client doing a static build or
dashboard refresh.