Skip to content

Rates Engine v0.5.0-rc.49

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 12 May 15:32
· 925 commits to main since this release

[v0.5.0-rc.49] — 2026-05-12

audit-2026-05-12 remediation pass — 27 audit findings closed in
code/docs/config, 14 more verified already-resolved, plus the
pre-flip blocker F-1201 explorer migration that unblocks the
rc.48 deploy to R1.

Fixed

  • Explorer migrated off /v1/currencies (F-1201 — pre-flip
    blocker).
    rc.48 removed the /v1/coins + /v1/currencies
    HTTP surface. The explorer had eight files still making live
    calls against /v1/currencies — every one would 404 the
    moment rc.48 deploys to R1. Migrated:
    • HomeCurrencies.tsx/v1/price/batch?asset_ids=fiat:EUR,…&quote=fiat:USD
      (single RT, names hardcoded for the 6-tile home strip).
    • sitemap.ts/v1/assets/verified filtered to class=fiat.
    • HomeTryAPI.tsx → updated example paths to /v1/assets/verified
      • /v1/assets/euro.
    • embed/currency/[ticker]/page.tsx/v1/assets/{ticker}
      (GlobalAssetView). Sparkline + 24h/7d change degrade
      gracefully to a price-only widget; chart hookup is a follow-up.
    • AssetConverter.tsx/v1/price/batch for the FX rate table
      (inverts so the converter's rate_usd = 1 USD = N target
      contract stays unchanged).
    • convert/[from]/[to]/ConvertPair.tsx/v1/price/batch
      for the live from→to rate (one pair vs the old cross_rates
      bulk).
    • convert/[from]/[to]/page.tsx/v1/assets/{from} for
      identity + /v1/price/batch for the singleton cross-rate.
    • SearchModal.tsx/v1/assets/verified filtered to fiat
      for the ticker→/currencies/X affordance.
      Zero remaining live calls to the removed routes. Typecheck +
      lint + build all green.

Changed

  • Multi-region tooling now handles single-region operation
    gracefully (F-1234).
    Pre-R2/R3-bringup only R1 is deployed.
    scripts/dev/verify-cross-region.sh, ratesengine-ops cross-region-check, and ratesengine-ops cross-region-monitor
    all used to fail with "need at least 2 regions to compare"
    even when called against the only deployed region. Operators
    who triggered them got a confusing failure and learned to
    ignore the family. Now: each command logs a one-line
    "single-region — pre-launch posture, see r2-r3-bringup.md"
    notice and exits 0. The check fires for real once a second
    region URL is supplied. Default R1 URL in
    verify-cross-region.sh points at the live public hostname
    (api.ratesengine.net); R2/R3 default empty.

  • R1 prometheus.r1.yml scrape coverage + rule_files path
    (F-1219 + F-1220).
    Added scrape jobs for redis_exporter
    (port 9121, installed by the redis-sentinel ansible role),
    alertmanager self-scrape (so we can alert on alertmanager
    being down), postgres_exporter / pgbackrest_exporter
    placeholder slots (operator deploys the exporters, scrape
    picks them up), and minio cluster metrics with bearer-token
    auth path. Each job has a one-line comment naming the alert
    family it feeds. rule_files path changed from the empty-
    opt-in glob /etc/prometheus/rules.d/*.yml to the canonical
    /etc/prometheus/rules.r1/*.yml matching the deployed-asset
    path so operators no longer have to symlink the full
    configs/prometheus/rules.r1/ set into a parallel directory.

  • Prometheus multi-host ↔ R1 overlay drift caught at CI (F-1222).
    Multi-host rules in deploy/monitoring/rules/ use underscored
    job labels (ratesengine_api) matching the multi-host ansible
    scrape config; R1's single-host overlay at
    configs/prometheus/rules.r1/ uses hyphenated labels matching
    the R1 systemd units. Editing only one half silently breaks the
    other deployment shape. New header notes pin the convention in
    the canonical files; scripts/ci/lint-docs.sh now flags any
    multi-host rule file without an R1-overlay sibling so the gap
    surfaces at CI time. Created R1 overlays for cache.yml,
    stellar.yml, storage.yml to satisfy the new pairing
    check — the underlying rules already match upstream metric
    names so no expression changes were needed.

  • Tailored error for supply-observer backfill attempts (F-1243).
    ratesengine-ops backfill accounts (or any of the six supply
    observers: accounts, trustlines, claimable_balances,
    sac_balances, sep41_supply, liquidity_pools) used to fail
    with the generic "WASM-hash audit pending" error — misleading,
    since supply observers aren't Soroban price/oracle sources at
    all. They plug into different dispatcher hooks
    (LedgerEntryChange / OpDecoder / SEP-41) and have no
    historical replay path through this command. New
    checkBackfillSources helper distinguishes the two cases and
    emits a supply-observer-specific message pointing operators
    at the supply-snapshot timer (or a future supply-backfill
    command for SEP-41 windows). 3 new unit tests cover the
    closed name set, the tailored error path, and the unchanged
    WASM-audit gate.

Added

  • Customer-webhook delivery worker wired into the API binary
    (F-1270 follow-up).
    The worker that drains the
    webhook_deliveries queue (HMAC sign + POST + retry) now runs
    as a goroutine in cmd/ratesengine-api/main.go whenever the
    dashboard surface comes up (Postgres reachable). Pre-this:
    operator had to launch the worker as a separate process per
    the docblock "operator-launched via internal/customerwebhook.New".
    Single-binary deploy now does it inline — same context, same
    lifecycle, same logger; one less ansible task.

  • Customer-webhook delivery alerts + runbook (F-1270 follow-up).
    Two new Prometheus alerts wired into both the multi-host and R1
    rules: _delivery_failing (P3) fires when 5xx + network-error
    attempts exceed 0.1/s for 15+ min (single-customer outage);
    _delivery_exhausted (informational) fires when a delivery hits
    the 15-attempt retry budget. New customer-webhook-delivery-failing.md
    runbook covers the SQL to identify the failing webhook, the
    customer-outreach template, and the worker-vs-customer triage
    tree. Catalogued in alerts-catalog.md so operators see them
    alongside the rest of the API alerts.

Tested

  • internal/usage package gains unit-test coverage. The
    per-subject daily usage counter (Redis-backed) was the one
    remaining no-test package in internal/. 8 tests cover the
    Increment/Read round-trip, day-boundary handling, empty-subject
    no-op, retention clamp, key-prefix isolation, URL-encoded
    subjects (so : inside IPv6 addresses doesn't collide on the
    date separator), and the 35-day retention TTL applied on every
    key.

Documented

  • ADR-0012 placeholder (F-1262). Filled the numeric gap in
    docs/adr/ — 0011 jumped to 0013 with no file at 0012, even
    though docs/adr/README.md had listed the slot as Planned
    (reserved for Quorum-set composition per ADR-0004 Phase 3)
    since the initial audit. The placeholder documents what the
    future ADR must cover (third-party validator selection,
    HALT-LIVE-DROP scoring, cross-region quorum overlap, stellar-
    core [QUORUM_SET] thresholds) and what invariants it must
    preserve (Tier-1 independence, no self-included validators,
    ≤ 33% effective weight per validator). README index now links
    to the file.

  • Dashboard surface bypasses the v1 envelope on purpose (F-1235).
    /v1/dashboard/keys* handlers write bare JSON rather than the
    data / as_of / flags envelope used by market-data
    endpoints. Documented the rationale in
    docs/reference/api-design.md §4.1: different audience
    (dashboard React app, not SDK), session-scoped data (no
    market-quality flags to carry), distinct auth path (session
    cookie vs API key). Future contributors won't "fix" the
    perceived drift. RFC 9457 problem responses, Cache-Control: no-
    store, and X-Request-Id correlation are preserved.

Fixed

  • Guard test for flags.stale=true on every fallback path (F-1254).
    The stale-flag fix in internal/api/v1/price.go:287-299 (the
    May-10 SEV-2 lesson — "fallback chain is itself the staleness
    signal") had no regression test. New
    TestPrice_FallbackChainSetsStaleFlag covers both the
    triangulated and direct-rewrite fallback paths and asserts
    flags.stale=true on each — so a future change that re-clears
    the flag is caught immediately.

  • API reference doc drift after rc.48 (F-1246). Regenerated
    docs/reference/api/rates-engine.v1.yaml to match the current
    OpenAPI source — three residual /v1/coins references in the
    generated file (an ?issuer= description, the home-page
    summary, and an error-envelope instance example) lingered
    after rc.48 removed the route. Pure regen; OpenAPI source was
    already clean.

  • Postman collection drift (F-1247). make docs-postman now
    writes to the customer-facing canonical at
    examples/postman/rates-engine.postman_collection.json
    previously it wrote to a gitignored docs/reference/api/...
    path, so the tracked customer copy drifted silently every time
    the OpenAPI spec moved. The docs-site build pipeline runs its
    own regeneration; the in-repo file is for customers who clone
    the repo to import the collection. README + Makefile docstring
    updated; canonical refreshed (656k bytes).

  • classic_assets.first_seen_* ordering bug under chunked
    backfill (F-1239).
    The ON CONFLICT clause in
    registerClassicAssetSeen previously updated only last_seen_*
    (GREATEST) and observation_count — leaving first_seen_*
    pinned to whichever ledger first hit the row, regardless of
    actual chronology. Out-of-order parallel backfill (chunked
    ranges processed in parallel) could leave first_seen_ledger
    higher than the asset's true first observation. Fix: also
    update first_seen_* with LEAST(existing, incoming). Idempotent
    for forward ingest (incoming is always ≥ existing → no-op).

Removed

  • Unused GIN indexes on blend_auctions.bid / .lot (F-1238).
    Migration 0029 drops the two JSONB GIN indexes from migration
    0009. No reader in internal/storage/timescale/ queries those
    columns by content — LatestBlendAuctionEvent and
    ListBlendPools both filter only by pool /auction_type /
    user_address / ts. Index write-amplification on every
    blend-auction INSERT for a read path that never materialised.
    Down migration restores them.

Changed

  • ratesengine_ingestion_source_stopped alert window widened
    to 30m × 15m (F-1212b).
    The pre-existing 5-min window
    produced routine false positives on low-volume Soroban / FX
    sources (blend auctions, ECB FX dailies, Band oracle pushes,
    Comet pool swaps, Phoenix off-peak windows). On R1 this
    manifested as 5 simultaneous ticket-tier alerts at any given
    time; operators learned to ignore the family. The new window
    waits past the natural quiet-window cadence for these sources;
    total-outage coverage stays tight via the separate 3-min
    ratesengine_ingestion_all_sources_stopped (P1). Rule updated
    in both deploy/monitoring/rules/ingestion.yml and
    configs/prometheus/rules.r1/ingestion.yml. Runbook + alerts
    catalog updated to reflect the new threshold and rationale.

Documented

  • RFP F4.2 one-year retention catch-up procedure (F-1265).
    docs/operations/backfill-procedure.md gains a new section
    walking through the 1-year catch-up backfill needed to meet
    Freighter RFP F4.2's ≥1y retention commitment. Covers
    resolving the target ledger window from the Galexie archive
    manifest, sanity-checking upstream archive completeness, row-
    count estimation, the chunked-by-week run loop with -resume
    so a mid-chunk crash doesn't re-do 12 hours of work, the CAGG
    force-refresh sequence, and a /v1/chart?timeframe=1y
    verification step. Pre-flip operator step; the code path is
    unchanged.

Added

  • R1 TOML supply.watched_ defaults (F-1266).* The
    archival-node ansible role's ratesengine.toml.j2 template
    gains a [supply] block with sensible launch-day defaults:
    watched_classic_assets populated with the top Stellar-classic
    verified currencies (USDC / EURC / AQUA / yXLM / VELO / BLND /
    PHO / KALE) mirroring internal/currency/data/seed.yaml. Plus
    inventory-overridable knobs for watched_sep41_contracts,
    sdf_reserve_accounts, and reserve_balances_stroops.
    Pre-F-1266: R1's TOML had no supply block → every F2 field
    (market_cap_usd, fdv_usd, circulating_supply,
    total_supply, max_supply) returned NULL even though the
    code path is correct. The next archival-node role run will
    flip every one of those fields from NULL to a real value for
    the 8 watched currencies.

  • Opt-in Redis ACL lockdown template (F-1213). Closes the
    pre-flip Redis-ACL gap on R1 by codifying a narrow ACL config
    in the redis-sentinel ansible role. New redis_acl_lockdown
    flag (default false for backward compat) renders
    templates/users.acl.j2 to /etc/redis/users.acl, references
    it from redis.conf.j2 via aclfile, and:

    • Disables the legacy default user (off nopass nocommands)
      so no password-less access path remains.
    • Creates a ratesengine application user with
      +@read +@write +@scripting +@pubsub +@connection minus the
      @admin + @dangerous families, scoped via ~prefix:* to
      exactly the cache key prefixes the application uses (vwap,
      confidence, freeze, div, ratelimit, signup-ip, toml, meta,
      price, apikey, health, oracle, subscriber) plus the pub/sub
      channels (&closed-bucket-*, &stream-*).
      Application binaries get a new [storage].redis_username TOML
      key (default empty = legacy path; operators set it to
      ratesengine when they flip the lockdown). redisclient.Build
      threads it into both the FailoverClient and single-node code
      paths. Commented-out per-component (re_aggregator /
      re_api / re_indexer) users in the template show the
      follow-on split when operators add per-binary passwords.
  • L2.2 Phase 2 FX-anchor USD volume coverage (F-1268). New
    timescale.VWAPUSDFXResolver implements the pre-existing
    USDVolumeFXResolver interface against the prices_1m
    CAGG: for any on-chain quote asset not already on the
    operator's [trades].usd_pegged_classic_assets list, the
    resolver looks up <quote>/<USD-peg> at the trade's
    timestamp and supplies a per-minute-bucket-cached USD rate
    that tradeUSDVolume multiplies through. Pre-Phase-2: only
    CEX/FX + operator-allow-listed pegs contributed to
    volume_24h_usd; an EURC/XLM Soroswap trade contributed 0
    even though we had a fresh EURC/USDC VWAP one minute earlier.
    Now it inherits USD value through the peg chain. Wired
    alongside the Phase 1 quote spec in
    cmd/ratesengine-indexer/main.go whenever
    usd_pegged_classic_assets is non-empty (no new config
    knob). 7 unit tests cover defaults, cache hits, negative
    cache, TTL expiry, minute-bucket key stability. The
    AssetDetail volume_24h_usd docstring rewritten to
    document the three-tier coverage chain (Phase 1 off-chain

    • Phase 1 on-chain pegs + Phase 2 FX-anchor).
  • Customer-facing dashboard webhook CRUD handlers (F-1270
    complete).
    New internal/api/v1/dashboardwebhooks package
    mounts five routes: GET/POST/PATCH/DELETE
    /v1/dashboard/webhooks + GET
    /v1/dashboard/webhooks/{id}/deliveries. Session-gated,
    role-gated (Owner/Admin/Member create; Viewer/Billing 403),
    cross-account 404 (no existence-leak), 10-per-account quota,
    HTTPS-only URLs, closed-set event validation, secret returned
    ONCE on create. OpenAPI spec adds 5 paths + 5 schemas;
    postman + api docs regenerated. 8 unit tests cover happy
    path, 401, 403, malformed URL, unknown event, quota, list
    scoping, cross-account delete. Wired into the v1 server via
    the same DashboardAuthMounter pattern as keys, and into
    main.go's buildDashboardBundle so the handlers come up
    whenever Postgres is reachable.

  • Customer-webhook delivery worker (F-1270 close-out).
    New internal/customerwebhook package drains the queue the
    store wrote in the prior commit: poll-loop drains
    ListPendingDeliveries, HMAC-SHA-256 signs the payload,
    POSTs to the customer URL with X-RatesEngine-Signature +
    X-RatesEngine-Event + X-RatesEngine-Delivery-Id headers,
    marks delivered on 2xx, schedules retry on 5xx/network
    (exponential backoff 30s → 1h cap, 15-attempt budget),
    terminates on 4xx / disabled-webhook / missing-webhook /
    malformed-URL. New
    ratesengine_customer_webhook_delivery_attempts_total
    counter labelled by 10 outcomes; documented in metrics ref
    with two alert recipes. 5 unit tests cover the happy path,
    5xx-retry, 4xx-terminal, disabled-webhook, missing-webhook.

  • postgresstore.WebhookStore customer-webhook data plane
    (F-1270 partial).
    Implements the existing
    platform.WebhookStore interface against the
    customer_webhooks + webhook_deliveries tables from
    migration 0027: Create / Get / List / Update / Delete on the
    registry; Enqueue / ListPending / MarkDelivered /
    MarkAttemptFailed on the delivery queue; Append / Update /
    ListDeliveries on the dashboard delivery log. Four new
    WebhookEventType constants (incident.sev1,
    incident.resolved, anomaly.freeze, divergence.firing)
    pin the closed event set without forcing the schema to use an
    enum. New integration subtest
    WebhookStore/CRUD+queue covers the full lifecycle:
    create → list → update → enqueue → fail-with-retry →
    enqueue → mark-delivered → list-history → delete-cascades.
    RotateWebhookSecret is a tagged stub pending the dashboard
    CRUD handlers. Delivery worker + customer-facing API are
    follow-up commits.

  • Inline price_usd on /v1/assets/{id} (F-1271). The
    asset-detail body now carries price_usd whenever the price
    lookup succeeds — previously it only surfaced via the optional
    coins-overlay block (assets not in the coins catalogue had a
    null price_usd even though the same handler was already
    fetching the price for market_cap_usd). Freighter wallet

    • retail apps that just want the current price no longer pay
      a second /v1/price round-trip on every asset-detail render.
      Extracted populatePriceUSD runs before the supply early-
      return so off-chain assets without a supply snapshot also get
      the field; populateMarketCap now re-uses the already-inlined
      price instead of paying for a second lookup. OpenAPI spec
      updated; postman + api docs regenerated. 1 new unit test
      covers the no-supply path.
  • postgresstore.BillingStore subscription mirror (F-1231).
    UpsertSubscription and GetActiveSubscriptionForAccount,
    previously stubbed, now hit the subscriptions table from
    migration 0027. UPSERT is idempotent on stripe_subscription_id
    so a re-delivered webhook updates plan + period without
    duplicating rows. GetActiveSubscriptionForAccount enforces
    both the period-end and canceled-at semantics from
    platform.Subscription.IsActive. The Stripe webhook handler
    wire-up (which would need to resolve stripe_customer_id
    account_id + extract subscription IDs from the event
    payload) is the next layer; this commit lands the store half
    so the data path is end-to-end-ready. New integration
    subtest BillingStore/Subscription/UpsertAndGetActive covers
    insert / idempotent update / expired / validation paths.

  • Stripe webhook tier-upgrade audit log (F-1240). New
    internal/platform/postgresstore.AuditStore implements the
    platform.AuditStore interface against the audit_log table
    from migration 0027 (Append / AppendBatch / List). The Stripe
    webhook handler now writes one plan.upgrade audit row per
    successful upgrade event (one row per event, not per key —
    metadata carries identifier + tier + key counts so the
    dashboard can render "the upgrade happened" without N rows
    for a customer holding N keys). StripeWebhookConfig.Audit
    is a narrow StripeAuditSink interface so the v1 package
    doesn't import the full audit-store surface. Append failures
    log at WARN and never block the webhook ack — audit-log
    unavailability must not turn a successful Stripe upgrade
    into a Stripe retry storm. 3 new unit tests cover the happy
    path, the nil-sink legacy fallback, and the swallowed-error
    contract.

  • Depeg-scenario test wiring stablecoin late binding ↔ divergence
    worker (F-1230).
    ADR-0026's stablecoin late binding deliberately
    conceals stablecoin↔fiat drift so XLM/USDC trades flow into the
    same XLM/USD bucket as XLM/USDT — the divergence worker is the
    designed safety net that fires flags.divergence_warning when
    the concealed price drifts from external references. The two
    components had no test wiring them together; nothing would catch
    a regression that broke either side. New
    divergence/depeg_test.go exercises the round-trip:

    • TestStablecoinDepeg_DivergenceWorkerFiresaggregate.ProxyTrade
      rewrites XLM/USDC → XLM/fiat:USD, the aggregator publishes
      a price assuming USDC=$1, references show the true XLM/USD
      after USDC depegged to $0.95, and the worker fires
      WarningFired=true on the resulting ~5.3% delta.
    • TestStablecoinPegHolds_DivergenceWorkerStaysQuiet — symmetric
      negative case so a future change can't make the warning fire
      on the steady state.
  • Guard tests for two CLAUDE.md surprises (F-1242).
    Locks behaviours that no production test previously asserted:

    • comet.TestDecodeSwap_DispatchIsByTopicNotContract proves
      that two events with different ContractIDs but the same
      (POOL, swap) topic both decode to Source="comet" — i.e.,
      the Comet decoder is generic Balancer-v1, not contract-
      specific. A future change that narrows the decoder to a
      specific allow-list would silently drop trades from any new
      Balancer-v1 deployment; this test fires first.
    • sep41_supply.TestDecoder_CAP67_FourTopic_BackCompat exercises
      mint / burn / clawback events with both pre-P23 (3/2/3 topics)
      and post-P23/CAP-67 (4/3/4 topics with sep0011_asset)
      arities. The decoder reads counterparty positionally and
      must ignore the optional 4th topic; a future contributor who
      naively asserts topic length would break the post-P23 path.
      The third surprise (SEP-41 transfer dual i128/Map shape) has
      no production transfer-amount decoder yet, so the dual-shape
      guard already lives in sac_balances.TestObserver_Decode{I128, MapVal}; documented as such in the audit register.
  • Per-request CORS observability metric (F-1244). New
    ratesengine_api_cors_decisions_total{outcome} counter wired
    into the CORS middleware. Outcomes: no_origin /
    allowed_origin / allowed_wildcard / denied. The
    pre-existing warnOpenCORS startup-only check fires once at
    boot then drifts out of memory; this counter is the per-request
    companion so operators can dashboard real cross-origin traffic
    and alert when a wildcard policy starts handling actual cross-
    origin requests in production (the silent failure mode of
    RATESENGINE_ALLOWED_ORIGINS=* slipping in alongside
    credentialed auth_mode). Wired into the existing middleware
    without changing public CORS behaviour; one new test case covers
    all four outcomes.

  • Freeze EventSink LKG VWAP + recovery worker (F-1228 + F-1229).
    freeze.EventSink.RecordFreeze and freeze.Writer.Mark now
    carry the last-known-good VWAP we're freezing on as a
    fixed-precision decimal string (orchestrator passes
    formatRatFixed(prev, 12)); the timescale sink stamps it on
    the new freeze_events row instead of the previous hardcoded
    frozen_value = 0. The recovery worker is the inverse half:
    every 60s it lists open freeze_events rows, checks whether
    the Redis marker still exists, and calls MarkRecovered when
    the marker is gone (TTL elapsed → underlying anomaly cleared).
    Without it, durable rows accumulated forever and the explorer
    /anomalies timeline showed resolved freezes as still-firing.
    New metrics ratesengine_anomaly_freeze_recovered_total and
    ratesengine_anomaly_freeze_recovery_sweeps_total{outcome},
    new alert ratesengine_anomaly_freeze_recovery_stalled (P3),
    new runbook freeze-recovery-stalled.md. Two phase-1 + phase-2
    orchestrator callers updated to thread the prevVWAP through.
    3 new unit tests + extended existing freeze + orchestrator
    tests.

  • Per-IP signup throttle (F-1232). New v1.SignupIPThrottle
    interface + auth.RedisSignupIPThrottle Redis-backed
    implementation. Default 5 signups per IP per hour via
    INCR + EXPIRE sliding window. The global anonymous rate
    limit (60/min/IP) is plenty for browsing public surfaces but
    lets a single IP bulk-mint 3,600 email→key_id pairs/hour via
    signup. The new throttle closes that vector without affecting
    other anonymous traffic; falls open on Redis errors. Wired in
    cmd/ratesengine-api/main.go whenever Redis is available.
    New auth.ErrSignupRateLimited sentinel + exported
    middleware.RemoteIP for handlers needing trusted-proxy-aware
    client IP outside the middleware chain. 5 unit tests
    (under-cap, over-cap, distinct IPs, empty IP falls open,
    defaults applied).

  • Stripe webhook event dedupe (F-1227). New
    internal/platform/postgresstore.BillingStore implements the
    AppendStripeEvent / MarkStripeEventProcessed /
    MarkStripeEventFailed triple from
    internal/platform/billing.go against the stripe_event_log
    table from migration 0027. The webhook handler now claims a
    dedupe slot with INSERT INTO stripe_event_log BEFORE running
    any side effects; ErrAlreadyProcessed (Postgres
    23505 unique_violation) signals "we've already done this work"
    and acks 200 immediately without re-running the upgrade. Stripe
    at-least-once delivery means the same event can land hours
    later — without this guard, a manual operator-side downgrade
    between original delivery and redelivery silently re-upgrades
    the customer. Wired in cmd/ratesengine-api/main.go to the same
    *sql.DB the timescale store uses; falls open to the legacy
    "rely on idempotent UpdateRateLimit" path when Postgres is
    absent. Two new unit tests pin the contract
    (duplicate-doesn't-reupgrade + nil-events-store-falls-back).
    UpsertSubscription + GetActiveSubscriptionForAccount stubbed
    pending Phase-2 / F-1231.

  • SEP-10 challenge-replay defence (F-1224). Added a
    sep10.ReplayGuard interface + sep10.RedisReplayGuard Redis-
    backed implementation. After a challenge XDR clears
    txnbuild.VerifyChallengeTxSigners, the validator hashes the
    signed XDR with SHA-256 and SETNX's the dedupe key
    (sep10:seen:<base64-url-no-pad>) with TTL = ChallengeTTL.
    A second submission of the same signed XDR finds the slot
    taken and returns auth.ErrUnauthorized instead of minting a
    fresh JWT. Wired in cmd/ratesengine-api/main.go to the
    same Redis client the rest of the auth subsystem uses;
    initial validator construction at main.go:144 happens before
    rdb is available, so the validator is rebuilt with the guard
    once rdb exists. miniredis-backed unit tests pin the three
    contracts (first claim ok, replay rejected, TTL expiry allows
    fresh claim, distinct hashes don't collide).

  • ratesengine_aggregator_vwap_cache_write_errors_total metric

    • paired ratesengine_aggregator_cache_write_errors page-tier
      alert. The May-10 SEV-2 (Redis BGSAVE blocked by full root FS for
      ~9 h → every cache Set returned MISCONF → /v1/price 404'd on
      every rewritten / triangulated / stablecoin-proxy pair) had no
      upstream signal
      in monitoring — flags.stale did not flip
      because the aggregator process was alive and ticking, just unable
      to publish. The post-mortem (internal/incidents/data/2026-05-10-redis-writes-blocked-disk-full.md)
      explicitly recommended "alert on aggregator WARN rate (not just
      service-up status)" — this counter realises that recommendation
      as the cleanest signal: any non-zero rate(...[5m]) for ≥ 2 min
      pages. Increments at the single cache-write failure point in
      internal/aggregate/orchestrator/orchestrator.go:653. Closes
      audit-2026-05-12 F-1253; supports F-1254 (flags.stale semantic
      bug — separate fix).

Fixed

  • Postgres max_locks_per_transaction = 256 codified (F-1251).
    The 2026-05-06 SEV-3 (internal/incidents/data/2026-05-06-postgres-lock-table-full.md)
    hit out of shared memory (53200) when the per-tx lock table
    saturated under concurrent ingest from many sources. The
    operator bumped this knob to 256 by hand on R1; un-codified, a
    from-scratch R1 rebuild or R2/R3 cutover would inherit the
    Postgres default of 64 and re-experience the same incident
    class. Now templated by archival-node/templates/postgresql.conf.j2
    with default postgres_max_locks_per_transaction: 256 (4×
    headroom; 51,200-entry lock table at the current 200-connection
    limit). Paired with new ratesengine_timescale_lock_table_pressure
    Prometheus alert at 70% saturation so the next bump is
    forecast not forced — depends on postgres_exporter (not yet
    scraped on R1; rule lights up when the exporter lands).

  • web/status/wrangler.toml added (F-1245). Mirrors the
    explorer + dashboard wrangler.toml shape so Cloudflare Pages
    git-integration deploy works without manual project setup.

  • web/explorer/src/app/oracles/OraclesView.tsx ESLint
    react-hooks/exhaustive-deps warning fixed (F-1258).

    streamRows was a fresh [] on every render when
    streams.data was undefined, causing the downstream useMemo
    to recompute every tick. Wrapped in its own useMemo for
    referential stability.

  • internal/sources/comet/adapter_test.go: pin
    topic-vs-contract-id contract (F-1242).
    New
    TestDecoder_Decode_NoContractIDDiscrimination makes the
    CLAUDE.md surprise-list claim ("Comet decoder matches by
    topic, not contract address") executable. Any future change
    adding a contract-id allow-list at the decoder layer (instead
    of downstream filtering) MUST flip the assertion.

  • F-1228 + F-1229 acknowledged but deferred to a separate
    refactor.
    freeze_events.frozen_value always written as 0

    • MarkRecovered has zero callers. The structural fix
      (extend freeze.EventSink.RecordFreeze to accept the LKG
      VWAP, plus wire a recovery worker that calls MarkRecovered
      on Redis-marker TTL expiry) touches the EventSink interface
      used by 3 packages + tests. Both medium-severity, neither
      blocks the public flip.

Investigated, no code change

  • N-1262 ADR-0012 missing from disk — turns out to be an
    intentional reservation, documented in docs/adr/README.md:56:
    "0012 | Planned | Quorum-set composition (referenced by
    multi-region-topology) | —". Per the ADR README's
    "gaps allowed when reserved" rule. F-1262 closed as invalid.

  • flags.stale semantic bug fixed (F-1254).
    internal/api/v1/price.go reset stale = false after falling
    through to priceFallback (last-trade / stablecoin proxy /
    triangulation). The May-10 SEV-2 (Redis BGSAVE blocked → cache
    empty → every closed-bucket read hit ErrPriceNotFound →
    priceFallback served last-trade for ~9 h) hit this path: the
    customer-visible response was the fallback, but flags.stale
    was false. Per ADR-0018 §"flags.stale semantic" and the doc
    comment on Flags.Stale, fallback responses ARE stale by
    definition. Set stale = ok on the fallback branch in both
    the single-asset and /v1/price/batch paths so any non-VWAP
    response now correctly carries stale=true. Companion fix
    to F-1253's cache-write-error counter (the upstream signal)
    and F-1252 (the alert-routing the May-10 incident exposed).

  • /v1/price/batch sources nondeterminism (F-1259).
    internal/api/v1/price.go:902-905 lookupPriceBatch unioned
    per-row sources through a map[string]struct{} and emitted
    them in map-iteration order, breaking the ADR-0015
    byte-identical cross-region property for batch responses.
    Added sort.Strings(srcs) before writeJSON so batch
    responses match the single-asset path's stable lexical order
    (set by timescale.normalizeVwapSources at the storage
    boundary per F-0016 closure).

  • Cache-Control gap on credential surfaces (F-1225).
    /v1/auth/login, /v1/auth/callback, /v1/auth/logout,
    /v1/dashboard/keys*, /v1/signup, /v1/webhooks/stripe,
    /v1/price/stream, /v1/methodology, and /v1/incidents.atom
    all fell through policyForPath's switch with no case match,
    emitting no Cache-Control header. Most concerning was
    /v1/auth/callback: a CDN in front of the API could have
    cached the magic-link consume response and re-issued the
    session cookie to subsequent requests. Added explicit cases:
    every /v1/auth/* and /v1/dashboard/* and the two
    state-changing surfaces (/v1/signup, /v1/webhooks/stripe)
    use private, no-store; /v1/price/stream uses no-store;
    /v1/methodology and /v1/incidents.atom get explicit
    public-cache policies appropriate to their content cadence.

  • 4 of 4 make test-integration failures (F-1250).

    • TestPlatformPostgresStores/APIKey/CRUD+revoke+touch
      test/integration/platform_postgres_stores_test.go:400,510
      constructed key IDs as "kid_" + uuid.New().String()[:12]
      which contains a hyphen at position 9, violating
      migration-0027 check id ~ '^kid_[a-f0-9]{12,}$'. Switched
      to strings.ReplaceAll(uuid.New().String(), "-", "")[:12]
      (12 hex chars).
    • TestEndToEnd_LedgerstreamToTimescale/soroban_LCM_with_reflector_FX_update
      • TestTradesInRangeAndMarkets — both used hand-crafted
        G-strkeys (GA7QYNF7…UWDA and GA5ZSEJYB…ZVM) with invalid
        CRCs. The strkey package now enforces CRC; tests switched
        to AQUA's real mainnet G-strkey
        (GBNZILSTVQZ4R7IKQDGHYGY2QXL5QOFJYQMXPKWRRM5PAV7Y4M67AQUA)
        which round-trips cleanly and is distinct from USDC's
        issuer.
    • TestSupplyStorageRoundTrip — schema/reader drift:
      migration 0005_create_asset_supply_history.up.sql:60
      creates a UNIQUE index on (asset_key, ledger_sequence, time)
      (TimescaleDB requires the partition column in any unique
      index on a hypertable), but internal/storage/timescale/supply.go:47
      used ON CONFLICT (asset_key, ledger_sequence) DO NOTHING.
      Postgres requires an exact column-set match; the INSERT
      failed with 42P10. Updated the conflict target to all 3
      columns and revised the doc comment to explain the
      invariant preservation.
    • Plus TestTradesInRangeAndMarkets DistinctPairs returned
      0 markets after the strkey fix because the test inserted
      into trades directly but DistinctPairs reads from the
      prices_1m continuous aggregate (post rc.45 commit
      8717bc20). Added a CALL refresh_continuous_aggregate('prices_1m', NULL, NULL)
      before the assertion, mirroring test/integration/api_test.go:65-74.
  • R1 alert blackout closed: 9 alert families wired up, textfile
    evidence chain repaired (F-1219 + F-1220 + F-1221 + F-1252).

    Pre-change R1 loaded only 6 of 18 rule families
    (aggregator/api/infra/ingestion/meta/slo); every alert in
    anomaly, divergence, external-pollers, supply,
    supply-snapshot, supply-refresh, archive-completeness,
    verify-archive, sla-probe was permanently silent. The
    SLA-evidence chain specifically was broken end-to-end: the probe
    binary supports -textfile-output (cmd/ratesengine-sla-probe/textfile.go:190 writeTextfileAtomic) but the R1 wrapper at
    configs/healthchecks/sla-probe.sh never set it, the
    textfile-collector dir didn't exist, and node_exporter ran
    without --collector.textfile. Three changes close the chain:

    • configs/ansible/roles/archival-node/tasks/10-observability.yml
      now provisions /var/lib/node_exporter/textfile_collector/
      and adds --collector.textfile + --collector.textfile.directory
      to the node_exporter systemd unit.
    • configs/healthchecks/sla-probe.sh now defaults
      SLA_PROBE_TEXTFILE_OUTPUT=/var/lib/node_exporter/textfile_collector/sla_probe.prom
      and passes -textfile-output $value conditionally (preserves
      the opt-out for operators that set the env var blank).
    • configs/prometheus/rules.r1/ gains 9 rule files copied
      verbatim from deploy/monitoring/rules/ (none of them had
      job-label refs requiring single-host adaptation). README
      table updated; rules cache.yml / storage.yml / stellar.yml
      stay excluded with a clear note (redis_exporter +
      postgres_exporter + stellar-core-prometheus-exporter are
      not on R1).
  • Source-stopped alert false-positive class on low-volume
    Soroban contracts (F-1212b).
    ratesengine_ingestion_source_stopped
    used a 5-min rate window which routinely false-fired on
    band, blend, comet, ecb, phoenix (legitimate 5+-minute
    gaps during quiet trading windows — the source-stopped runbook
    itself acknowledges this at line 60). Widened to a 30-min rate
    window + 15-min for: in both deploy/monitoring/rules/ingestion.yml
    and configs/prometheus/rules.r1/ingestion.yml. Total-outage
    coverage stays tight via the separate _all_sources_stopped
    alert at 3 min — that one continues to catch the
    upstream-broke-across-the-fleet case.

  • Multi-host alert rule job labels (F-1222).
    deploy/monitoring/rules/api.yml / aggregator.yml /
    ingestion.yml / slo.yml / meta.yml referenced job="api"
    / "aggregator" / "indexer" but the multi-host ansible
    prometheus role's scrape config uses ratesengine_api /
    ratesengine_aggregator / ratesengine_indexer (underscores).
    Rules would never have evaluated true on a multi-host deploy.
    Renamed the canonical multi-host labels to match the scrape
    config; meta.yml's scrape-failing regex updated to the actual
    exporter job names (postgres_exporter, redis_exporter,
    node_exporter, minio). R1's configs/prometheus/rules.r1/
    copies already used the correct hyphenated R1 names and are
    unaffected.

  • rc.48 dead-route cleanup follow-up. rc.48 removed the
    /v1/coins + /v1/currencies HTTP surface but left several
    stale references behind: cmd/ratesengine-sla-probe was still
    probing /coins (would 404 after rc.48 deploy → SLA-probe
    perma-fail on availability); examples/curl/04-coins.sh +
    README still advertised the removed route; web/status synthetic
    smoke probe still pointed at /v1/coins?limit=1; openapi/rates-engine.v1.yaml
    carried 3 stale /v1/coins text references (incl. the rate-limit
    example's instance field); internal/api/v1/server.go Options
    doc comments still said "backs GET /v1/coins" / "backs /v1/currencies"
    even though the seams now feed /v1/assets and /v1/chart.
    All migrated to live equivalents:

    • cmd/ratesengine-sla-probe/main.go staticEndpoints switches
      /coins/assets (same fan-out coverage; comment explains
      the rc.47 → rc.48 → rc.49 progression).
    • examples/curl/04-coins.sh deleted; replaced with 04-assets.sh
      using ?order=volume_24h_usd:desc.
    • web/status/src/app/page.tsx synthetic-probe entry switched
      to /v1/assets?limit=1 with the same Catalogue group.
    • openapi/rates-engine.v1.yaml lines 193 / 1602 / 2608
      updated.
    • internal/api/v1/server.go Options.Coins / .Currencies /
      .FXHistory doc comments rewritten to describe the actual
      /v1/assets + /v1/chart consumers.
      Net: make verify clean; go test ./internal/api/v1/... +
      ./cmd/ratesengine-sla-probe/... green.
      Closes audit findings F-1202, F-1210 (cosmetic doc-text portion),
      F-1211, F-1223, F-1245 (smoke surface), F-RFP-0017.

Tooling

  • docs/reference/api/rates-engine.v1.yaml regenerated
    from openapi/rates-engine.v1.yaml via make docs-api. The
    checked-in copy had drifted ~990 lines (561 ins / 429 del) since
    the last regeneration. web/explorer/src/api/types.ts (the
    openapi-typescript output) auto-regenerated as a transitive
    consequence (~415 lines lighter; pnpm typecheck clean). Closes
    F-1246.
  • docs/reference/config/README.md regenerated from
    internal/config/config.go via make docs-config (+6 lines).
    Closes F-1255.