Skip to content

Rates Engine v0.5.0-rc.51

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 14 May 16:03
· 735 commits to main since this release

[v0.5.0-rc.51] — 2026-05-14

Changed

  • verify-archive Tier A is now incremental. Pre-fix the nightly
    systemd unit re-walked the entire chain from genesis every night,
    taking ~13.8h of wall time and ~7h of CPU time per pass (67% of
    every day; visible as a sustained load-average drag on r1). Past
    LCM files are immutable, so re-hashing them is wasted compute.

    New scheme: ratesengine-ops verify-archive accepts
    -state-file PATH -from-last-verified [-safety-overlap N].
    Reads the prior run's high-water mark from a small JSON file,
    computes -from = max(2, last_verified - safety_overlap), and
    verifies only the new tail. The resume-from-hash from prior
    state is plumbed through so cross-run chain continuity is
    preserved (the next incremental run's first chunk must chain to
    the previous run's last verified hash).

    Default safety overlap: 5000 ledgers (~17h of chain) catches any
    anomalies that snuck in just before the last run's tip.
    systemd unit defaults updated:
    VERIFY_ARCHIVE_STATE_FILE=/var/lib/ratesengine/verify-archive-state.json,
    VERIFY_ARCHIVE_MAX_RUNTIME=4h (down from 16h). Typical
    incremental pass covers ~24h of new ledgers in minutes.

    A weekly full-archive re-pass (defense-in-depth against silent
    corruption in older chunks) remains a TBD sibling unit.

  • backfill_coverage[].density_pct replaces coverage_pct on
    /v1/diagnostics/ingestion.
    Pre-fix the metric was (latest - earliest) / (tip - genesis) — endpoint span, not data density. A
    source with one trade at genesis and one trade at tip scored
    100% even with the whole interior empty (caught live 2026-05-14
    when SDEX backfill was still running but coverage showed 99.8%
    and aquarius/comet/phoenix/soroswap all showed 99.99%).

    New metric: union of completed portions of all backfill cursor
    intervals that include this source in their decoder set, clamped
    to [genesis, tip], divided by tip - genesis + 1. Hits 100%
    only when backfill ranges actually cover the whole interval.
    Sparse sources no longer score 100% just for having endpoint
    trades — they score by what fraction of ledgers their backfill
    has processed, which is the question operators actually want
    answered.

    Wire: coverage_pct retained as a transitional field for one
    release. New fields: density_pct, covered_ledgers,
    expected_ledgers. Status page updated to render the new
    density (tooltip exposes the absolute "covered / expected"
    numerator + denominator).

Added

  • Chainlink ingest source (internal/sources/external/chainlink/).
    Promotes the formerly-divergence-only Chainlink reference into a
    full ingest source — writes canonical.OracleUpdate rows to
    oracle_updates on its own poller goroutine alongside Reflector /
    Redstone / Band. Implements external.Poller; lives parallel to
    the existing internal/divergence/chainlink.go cross-check (which
    stays in place for synchronous divergence_warning checks).

    Wire shape: poll AggregatorV3.latestRoundData() over JSON-RPC,
    dedupe by (feed_address, roundId), project to canonical with
    synthetic deterministic tx_hash (sha256(feed || roundId)) for
    idempotent restart. Default 30s cadence; per-feed Decimals/Invert
    overrides via TOML. Default endpoint is Cloudflare public; operator
    drops an Alchemy URL (with embedded API key) into r1's TOML or via
    CHAINLINK_RPC_URL env. Bounded concurrency (8) per tick.

    Backfill: new ratesengine-ops backfill-chainlink subcommand walks
    AnswerUpdated event logs via chunked eth_getLogs (5k blocks /
    call, the safe default for Alchemy / Infura / QuickNode response-
    size caps). ~33k RPC calls and ~7h wall time for all-time backfill
    of the default 6 majors on Alchemy free tier (~19% of monthly
    quota); scale linearly with feed count up to all 516 ETH-mainnet
    Chainlink feeds within the same free-tier envelope. Idempotent on
    the oracle_updates PK; safe to re-run over already-covered ranges.

    Surface: registered in external.Registry as
    ClassOracle / BackfillSafe=true / IncludeInVWAP=false. Picked up
    by /v1/sources?class=oracle automatically — explorer's /oracles
    page surfaces it without UI changes.

  • Oracle CAGG ladder (migration 0034). Seven continuous
    aggregates on oracle_updates at the standard
    1m/15m/1h/4h/1d/1w/1mo tiers — sister to the trade CAGG ladder
    in migration 0002. Closes the gap where every /v1/oracle/*
    history query was scanning raw oracle_updates; manageable at
    ~3 oracle sources × ~860 rows/day each, untenable once Chainlink
    arrives at scale.

    Aggregation semantics differ from trades: oracles are point-in-
    time observations, so each bucket carries first / last / min / max / last_decimals / count (no VWAP / TWAP because there is no
    volume dimension). One row per (source, asset, quote, bucket)
    per-source identity preserved so cross-oracle comparison stays
    meaningful. Refresh policies match the trade ladder; no retention
    on sub-1h tiers (matches the operator's "store everything forever"
    decision in migration 0031), indefinite for 1h+ per the proposal.

  • DeFindex vault decoder (internal/sources/defindex/).
    Event-based decoder (dispatcher.Decoder, NOT
    ContractCallDecoder) for paltalabs/defindex's autocompound
    vaults. Phase A matches ("DeFindexVault","deposit") and
    ("DeFindexVault","withdraw") events on the 3 known vaults
    (USDC / EURC / XLM autocompound). Decoder pulls
    depositor / withdrawer, multi-asset amounts vec
    (i128, no truncation per ADR-0003), and the share-token
    delta (df_tokens_minted / _burned) by name from the
    body Map (decode-by-name per
    contract-schema-evolution.md). Phase B will tag matching
    same-tx Blend / Soroswap legs as
    routed_via=defindex-{vault} and write
    aggregator_exposures rows from a separate periodic
    ticker. Pre-seed migration 0033_seed_defindex_vaults.up.sql
    populates the 3 vaults in the routers registry as
    kind='aggregator-vault'. WASM-history audit started at
    docs/operations/wasm-audits/defindex.md;
    BackfillSafe=false until the per-hash review lands.

  • Soroswap Router decoder (internal/sources/soroswap_router/).
    New ContractCallDecoder following the Band oracle pattern —
    matches by (contract_id, function_name) and decodes
    swap_exact_tokens_for_tokens / swap_tokens_for_exact_tokens
    invocations on the canonical pubnet router
    (CAG5LRYQ5JVEUI5TEID72EYOVX44TTUJT5BQR2J6J77FH65PCCFAJDDH).
    Phase A is log-only — every routed swap surfaces an INFO line
    with path, in/out amounts (i128, no truncation per ADR-0003),
    recipient, and deadline. Phase B will tag matching same-tx
    trades.routed_via rows via the existing migration-0025 column.
    New ClassRouter taxonomy in internal/sources/external/
    (alongside the existing ClassLending); router class is
    attribution-only, never contributes to VWAP. Pre-seed migration
    0032_seed_soroswap_router.up.sql populates the routers
    registry. WASM-history audit started at
    docs/operations/wasm-audits/soroswap-router.md;
    BackfillSafe=false until the per-hash review lands.

Changed

  • Raw trades retention removed (migration 0031). Pre-fix the
    trades hypertable aged out at 90 days; we relied on the
    hourly+ CAGGs to preserve historical OHLC. Operator wants raw
    per-trade fidelity preserved indefinitely (regulatory + proof-
    of-pricing queries can't be reconstructed from CAGGs).
    Justification: r1's postgres data dir is on a 1.5 TB ZFS volume
    with 4% used. Earlier "no room" analysis was wrong — was
    measuring the OS root disk (49 GB), not the postgres data
    volume. Status page coverage panel relabeled from "Raw-trades
    coverage (last 90 days)" → "Raw-trades coverage — genesis →
    tip"; coverage_pct grows monotonically as backfills land.
    Compression policy on chunks > 7d is unchanged (~5x reduction).

Added

  • CEX pair coverage — cross-fiat majors. All four CEX
    connectors (binance/bitstamp/coinbase/kraken) now stream BTC
    and ETH against EUR + GBP in addition to USD. Pre-fix, only
    Bitstamp published BTC/EUR — every aggregator tick on
    crypto:BTC/fiat:EUR was single-source, which falsely tripped
    Phase 2 freeze permanently. Bitstamp + Coinbase + Kraken +
    Binance all support these pairs natively; we just hadn't
    enumerated them in the connector defaults.

    Stop-gap pre-Tier-3. The next change in this area will replace
    the hand-curated DefaultPairs() maps with auto-discovery from
    each exchange's pair-catalogue endpoint
    (/api/v3/exchangeInfo / /products / /0/public/AssetPairs /
    /api/v2/trading-pairs-info), filtered by an allow-list of
    quote assets. That move expands coverage from ~50 hand-curated
    pairs/exchange to ~200-1500 active pairs/exchange. Storage
    scales with PAIR COUNT (CAGG rows, ~50 MB/year for 1500 pairs)
    not raw trade volume (90-day retention), so the cost is
    bounded.

Fixed

  • Backfill auto-refresh: three bugs caught on first real run.
    Yesterday's commit added refresh_continuous_aggregate calls
    after each backfill chunk but every CAGG refresh failed. Three
    fixes from the live test:

    1. 42P18: could not determine data type of parameter $1
      — lib/pq's CALL syntax doesn't propagate the procedure
      signature's parameter types, so untyped placeholders fail.
      Fix: explicit ::timestamptz casts in the SQL.
    2. 22023: refresh window too small for prices_4h /
      _1d / _1w / _1mo — Timescale rejects refresh windows
      narrower than 2× bucket width. A 10k-ledger chunk's ts
      span (~4h) was fine for prices_1h but failed every
      coarser CAGG. Fix: per-CAGG MinWindow declared in
      CAGGsLiveForever; new PadRefreshWindow helper expands
      the chunk's window to that minimum centered on the
      chunk's midpoint. Padded area materialises as empty
      buckets (cheap).
    3. 55P03: concurrent refresh — with -parallel N,
      multiple chunks race on the same coarse CAGG (prices_1mo
      was the worst — chunks finishing close together all want
      to refresh the same monthly bucket). Fix: retry-on-55P03
      with exponential backoff (200ms → 1.6s × 5 attempts).

    End-to-end verified live: 10k-ledger SDEX backfill at
    ledgers 50,000,000-50,010,000 inserted 718,873 trades AND
    populated 66,513 prices_1h buckets + 22,005 prices_1d
    buckets — those CAGGs will now persist past the 90-day raw
    retention. Yesterday's claim "auto-refresh now works" was
    premature; this commit is what makes it true.

  • Live-site QA pass — F-01/F-03/F-04 resolved, F-02 partial.
    Working through docs/review-2026-05-13-live-site-qa.md:

    • F-01 (degraded state invisible in explorer): new
      DegradedBanner component polls /v1/status every 60s and
      renders a fixed band between Navbar and content when
      overall ≠ "ok". Tone (amber/red) keys off pageCount > 0.
      Includes top alert name + link to status page. Quiet when
      everything's fine; noisy enough to set expectations when
      it isn't.
    • F-02 (pools 503 silently rendered as "No pools matched"):
      DexesView now branches on q.isError and surfaces an
      explicit error card with retry + link to status. Empty-
      state path is gated behind !q.isError. Backend perf
      (the underlying 7s cold-cache p99) tracked alongside the
      api_cache_miss_rate_high workstream.
    • F-03 (CORS credentials mismatch): explorer's useMe()
      no longer sends credentials: include against an API that
      explicitly refuses credentialed CORS. Cost: signed-in
      users see signed-out CTAs in the explorer navbar
      (dashboard.ratesengine.net is unaffected — same-origin).
      Inline comment documents the cross-origin cookie work
      needed to re-enable session detection (Domain=
      .ratesengine.net + ACA-Credentials + SameSite=None).
    • F-04 (deep_link API path leaked to next/link):
      NetworksPanel no longer feeds API deep_link values
      (e.g. /v1/assets/USDC-GA5Z…) into <Link>. Stellar
      rows now build the explorer route explicitly
      (/assets/{slug}/stellar); the API deep_link stays in
      the JSON for programmatic consumers.
  • Incident triage sweep — 9 active alerts → root-cause +
    preventatives.
    Worked through every alert firing on r1 today
    and either resolved the root cause, codified prevention in
    ansible, or filed it as a known-real signal needing follow-up:

    • node_root_disk_warning — disk 81% → 62% by truncating a
      7.3GB syslog. Root cause: Loki running at log_level=debug
      spamming ~4M caller=mock.go msg=Get key=collectors/...
      lines/day into syslog. Fix: set Loki to warn
      (configs/ansible/roles/loki/templates/loki-config.yaml.j2)
      and add a defense-in-depth rsyslog filter so even an
      accidental level regression can't reach /var/log/syslog
      (configs/ansible/roles/archival-node/tasks/15-log-discipline.yml).
      Also pruned 36 old binary backups + 9 stale toml backups +
      vacuumed journal to 7 days.

    • verify_archive_unit_failed — root cause: 8h max-runtime
      cap was tight for ~62.5M-ledger pubnet. Fresh run completed
      34.7M ledgers in 8h (1207 l/s aggregate at 8 workers) then
      exited 1/FAILURE on context deadline — the same as the
      previously-rotated journal would have shown. Bumped
      defaults to 12 workers + 16h cap (sits inside the 24h
      timer cadence with headroom). Updated both the in-repo
      unit (deploy/systemd/verify-archive-tier-a.service) and
      the live r1 drop-in. Started a fresh run on the new
      settings; the alert clears when it finishes.

    • sla_probe_unit_failed_alert — REAL: /v1/markets,
      /v1/assets cold-cache p99 spikes (~5s, ~2.4s) breach
      the 500ms target on the probe's first sample after each
      30s cache-TTL window. Filed as a perf workstream — needs
      /v1/assets + /v1/issuers cache wrappers + prewarm.

    • api_cache_miss_rate_high — REAL: prewarm covers
      markets/all_pools for limits {5,25,100,200} but
      markets/asset_markets and markets/source_markets
      ops aren't prewarmed at all; user-facing requests with
      novel param tuples miss cache. Same perf workstream.

    • anomaly_freeze_sustained / anomaly_freeze_engaged
      REAL but invisible: 1892 freeze decisions emitted, zero
      Redis markers, zero freeze_events rows. Phase 2's
      baseline z-score is unstable because we only have 7 days
      of prices_1h data (root cause = the SDEX backfill bug
      from the previous session). Added an INFO log in
      markPhase2Freeze so operators can grep
      journalctl -u ratesengine-aggregator | grep "phase2 freeze"
      to see which pairs are firing. Updated the alert
      annotation (both repo + R1 overlay) to call out the
      cold-baseline pattern + triage steps.

    • aggregator_supply_refresh_never_initialized — gated by
      [supply].aggregator_refresh_enabled = false (default).
      Enabling it requires the on-chain supply observers to be
      backfilled across the watched accounts; same workstream as
      the SDEX backfill. Not a quick fix; documented for follow-up.

    • supply_snapshot_never_initialized — RESOLVED: the
      supply-snapshot.service was running daily and exiting 0,
      but /etc/default/supply-snapshot didn't set TEXTFILE_OUTPUT,
      so the binary skipped the metric write. Wired the textfile
      path; metric now emits. Codified in
      configs/ansible/roles/archival-node/tasks/10-observability.yml
      so a rebuilt host gets the wiring automatically.

    • slo_latency_burn_slow — same family as the SLA-probe
      perf finding; will track with that workstream.

  • Backfill status surfaces "stalled" vs "running" separately.
    BackfillDecoderState (the per-decoder row on
    /v1/diagnostics/ingestion) decomposes the previously-opaque
    ranges_active count into ranges_complete (done),
    ranges_running (incomplete + updated within 10 min), and
    ranges_stalled (incomplete + idle > 10 min — needs
    ratesengine-ops backfill -resume). Status page renders three
    separate columns with green/blue/red coloring. The old
    ranges_active field stays on the wire for back-compat.

  • Backfill auto-refreshes the long-lived CAGGs (prices_1h /
    prices_4h / prices_1d / prices_1w / prices_1mo) at the
    end of every chunk.
    Without this, historical inserts get
    dropped by the 90-day raw-trades retention policy before the
    CAGG policy refresher's natural cadence picks them up — which
    is what happened to the May 6-11 2026 SDEX backfill (cursors
    hit last_ledger == range_end for every range, ~80M trades
    inserted, retention dropped them within 24h, no CAGG
    materialisation, ~5d of wall-clock work lost; trades
    MIN(ledger) for sdex collapsed back to 61,191,617).

    Backfill tool changes:

    • New -refresh-caggs flag (default true). After each
      chunk's trade-insert loop, derives the actual ts range from
      the inserted rows (Store.LedgerRangeToTimeRange) and
      force-refreshes every long-lived CAGG over that window
      (Store.RefreshContinuousAggregate).
    • Per-view soft-fail so one wedged CAGG doesn't block the
      others.
    • Procedure doc rewritten — manual CALL refresh_continuous_ aggregate step removed (now automatic).

    Diagnostics endpoint additions:

    • cagg_coverage field reports prices_1h MIN/MAX bucket +
      row count — the real source-of-truth answer to "do we have
      historical OHLC since genesis?" (raw trades only spans
      the last 90 days; hourly+ CAGGs are retained forever).

Added

  • Backfill coverage on /v1/diagnostics/ingestion + status page.
    New backfill_coverage[] array on the diagnostics endpoint
    reports per-source MIN/MAX ledger from the trades hypertable,
    joined with an operator-curated map of source genesis ledgers
    (1 for SDEX, contract deploy ledger for each Soroban DEX), with
    a derived coverage_pct so the answer to "do we have data from
    ledger 1 to tip?" is one column. CEX/FX sources surface as
    applies=false (their trades have no Stellar-ledger context).
    Backed by a process-local cache refreshed every 5 min in a
    background goroutine — the underlying SQL is 2-3s on a populated
    trades hypertable, too slow for the request path.

    Status page renders a new "Coverage — ledger genesis → tip"
    table with per-source progress bars (green ≥99%, amber ≥50%,
    red <50%). Today's r1 reading: SDEX 2.18% covered (61.2M → 62.5M
    out of 1 → 62.5M), Soroban DEXes 15-17%, off-chain sources N/A.

  • Status page — per-region "Ingestion" section. Polls each
    region's /v1/diagnostics/ingestion every 30s and renders a
    panel with: binary version + commit, live ledger card (latest,
    lag, 24h volume, indexed markets/assets), FX backfill coverage
    (date range, currencies, total quotes), CoinGecko market-cap
    cache state (entries, newest/oldest fetch age), supply observer
    counts, per-decoder backfill table (ranges total/active, oldest
    lag), and per-source health table joined with trailing-24h
    trades/volume/markets. Region list is a single REGIONS const —
    r2/r3 join by appending a row, no other code changes needed.

  • GET /v1/diagnostics/ingestion — single-fetch ingestion
    snapshot for the region. Composes: region label, binary version,
    live ledger tip + lag, per-decoder backfill state (ranges
    total/active, oldest lag), Frankfurter / fx_quotes coverage
    (earliest/latest dates, total quotes, distinct currencies),
    market-cap cache state, supply observer coverage (classic vs
    SEP-41 counts, last snapshot age), and the full source registry
    joined with trailing-24h trades/volume/markets.
    Designed as the only call the status page makes for its
    per-region ingestion panel — operators no longer have to scrape
    /v1/network/stats + /v1/sources + /v1/diagnostics/cursors

    • /v1/version and reconcile by hand. New storage helpers:
      FXCoverageStats, SupplyCoverageStats (one query each, ~1ms
      on populated tables). Cache: public, max-age=15.

Fixed

  • /assets/{slug} for catalogue slugs (usdc, chinese-yuan,
    btc, …) now renders the real cross-chain view instead of the
    "Asset not found" fallback.
    The page's fetchGlobalAsset
    was firing a per-slug /v1/assets/{slug} request at build
    time, just like [network] was before its consolidation —
    with ~1000 prerendered routes that storm tripped r1's anon
    rate limit and every catalogue page baked in the not-found
    fallback. Extracted the catalogue source to
    web/explorer/src/app/assets/catalogue.ts (shared module,
    single /v1/assets/verified call, memoised promise, 429-aware
    retry). Both [slug] and [slug]/[network] now read from the
    same map.
  • /assets/{slug} and /assets/{slug}/{network} now resolve in
    both case variants
    for catalogue entries. Previously only the
    uppercase form (/assets/USDC/) was prerendered because dedup in
    generateStaticParams picked first-seen casing, so user-typed
    lowercase URLs (/assets/usdc/) and any links pointing at the
    catalogue's canonical lowercase slugs returned 404. Now both
    cases get a route per catalogue entry; non-catalogue Stellar
    assets keep their listing casing as before.
  • /assets/{slug} for verified-catalogue currencies now renders
    the cross-chain identity view, not the Stellar-issuer view.

    The dispatcher used to fall through to AssetDetail (with the
    IssuerPanel) whenever /v1/coins returned a row, even when
    the slug also matched a catalogue entry. Result: /assets/USDC/
    was showing Circle's Stellar issuer detail instead of the
    cross-chain page. The [network] route (/assets/USDC/Stellar/)
    is now the only place per-issuer detail lives. Title +
    description for catalogue slugs now use cross-chain framing
    (USDC — Stablecoin) instead of Stellar-only framing.

Changed

  • Ansible template now bakes in anon_rate_limit_per_min = 600
    / key_rate_limit_per_min = 6000.
    Codifies the live r1 bump
    applied 2026-05-13. The prior defaults (60 / 1000 per min) were
    too tight for any consumer doing a static build or dashboard
    refresh from a single IP — the explorer Cloudflare Pages build
    was the canary.

Fixed

  • Explorer build no longer 429s on /assets/[slug]/[network]
    prerender.
    Next.js opts out of its built-in fetch dedup when
    signal is set, so each prerendered slug+network page was
    separately re-fetching /v1/assets/{slug} and the build was
    firing hundreds of requests in parallel — far above r1's
    anonymous-tier rate limit (60 req/min). Result: every
    [slug]/[network] route prerendered as a "Not found" page on
    prod. Fix consolidates the catalogue source: a single
    /v1/assets/verified call (with 429-aware retry) populates a
    module-level Map from which both generateStaticParams and
    per-page fetchGlobalAsset read. Concurrent r1 config bump
    (anon_rate_limit_per_min = 600, key_rate_limit_per_min = 6000) gives real consumers headroom too — the prior 60/min
    was unworkable for any client doing a static build or
    dashboard refresh.