Rates Engine v0.5.0-rc.49
Pre-release[v0.5.0-rc.49] — 2026-05-12
audit-2026-05-12 remediation pass — 27 audit findings closed in
code/docs/config, 14 more verified already-resolved, plus the
pre-flip blocker F-1201 explorer migration that unblocks the
rc.48 deploy to R1.
Fixed
- Explorer migrated off
/v1/currencies(F-1201 — pre-flip
blocker). rc.48 removed the/v1/coins+/v1/currencies
HTTP surface. The explorer had eight files still making live
calls against/v1/currencies— every one would 404 the
moment rc.48 deploys to R1. Migrated:HomeCurrencies.tsx→/v1/price/batch?asset_ids=fiat:EUR,…"e=fiat:USD
(single RT, names hardcoded for the 6-tile home strip).sitemap.ts→/v1/assets/verifiedfiltered toclass=fiat.HomeTryAPI.tsx→ updated example paths to/v1/assets/verified/v1/assets/euro.
embed/currency/[ticker]/page.tsx→/v1/assets/{ticker}
(GlobalAssetView). Sparkline + 24h/7d change degrade
gracefully to a price-only widget; chart hookup is a follow-up.AssetConverter.tsx→/v1/price/batchfor the FX rate table
(inverts so the converter'srate_usd = 1 USD = N target
contract stays unchanged).convert/[from]/[to]/ConvertPair.tsx→/v1/price/batch
for the live from→to rate (one pair vs the old cross_rates
bulk).convert/[from]/[to]/page.tsx→/v1/assets/{from}for
identity +/v1/price/batchfor the singleton cross-rate.SearchModal.tsx→/v1/assets/verifiedfiltered to fiat
for the ticker→/currencies/X affordance.
Zero remaining live calls to the removed routes. Typecheck +
lint + build all green.
Changed
-
Multi-region tooling now handles single-region operation
gracefully (F-1234). Pre-R2/R3-bringup only R1 is deployed.
scripts/dev/verify-cross-region.sh,ratesengine-ops cross-region-check, andratesengine-ops cross-region-monitor
all used to fail with "need at least 2 regions to compare"
even when called against the only deployed region. Operators
who triggered them got a confusing failure and learned to
ignore the family. Now: each command logs a one-line
"single-region — pre-launch posture, see r2-r3-bringup.md"
notice and exits 0. The check fires for real once a second
region URL is supplied. Default R1 URL in
verify-cross-region.sh points at the live public hostname
(api.ratesengine.net); R2/R3 default empty. -
R1 prometheus.r1.yml scrape coverage + rule_files path
(F-1219 + F-1220). Added scrape jobs for redis_exporter
(port 9121, installed by the redis-sentinel ansible role),
alertmanager self-scrape (so we can alert on alertmanager
being down), postgres_exporter / pgbackrest_exporter
placeholder slots (operator deploys the exporters, scrape
picks them up), and minio cluster metrics with bearer-token
auth path. Each job has a one-line comment naming the alert
family it feeds.rule_filespath changed from the empty-
opt-in glob/etc/prometheus/rules.d/*.ymlto the canonical
/etc/prometheus/rules.r1/*.ymlmatching the deployed-asset
path so operators no longer have to symlink the full
configs/prometheus/rules.r1/ set into a parallel directory. -
Prometheus multi-host ↔ R1 overlay drift caught at CI (F-1222).
Multi-host rules indeploy/monitoring/rules/use underscored
job labels (ratesengine_api) matching the multi-host ansible
scrape config; R1's single-host overlay at
configs/prometheus/rules.r1/uses hyphenated labels matching
the R1 systemd units. Editing only one half silently breaks the
other deployment shape. New header notes pin the convention in
the canonical files;scripts/ci/lint-docs.shnow flags any
multi-host rule file without an R1-overlay sibling so the gap
surfaces at CI time. Created R1 overlays forcache.yml,
stellar.yml,storage.ymlto satisfy the new pairing
check — the underlying rules already match upstream metric
names so no expression changes were needed. -
Tailored error for supply-observer backfill attempts (F-1243).
ratesengine-ops backfill accounts(or any of the six supply
observers:accounts,trustlines,claimable_balances,
sac_balances,sep41_supply,liquidity_pools) used to fail
with the generic "WASM-hash audit pending" error — misleading,
since supply observers aren't Soroban price/oracle sources at
all. They plug into different dispatcher hooks
(LedgerEntryChange/OpDecoder/ SEP-41) and have no
historical replay path through this command. New
checkBackfillSourceshelper distinguishes the two cases and
emits a supply-observer-specific message pointing operators
at the supply-snapshot timer (or a future supply-backfill
command for SEP-41 windows). 3 new unit tests cover the
closed name set, the tailored error path, and the unchanged
WASM-audit gate.
Added
-
Customer-webhook delivery worker wired into the API binary
(F-1270 follow-up). The worker that drains the
webhook_deliveriesqueue (HMAC sign + POST + retry) now runs
as a goroutine incmd/ratesengine-api/main.gowhenever the
dashboard surface comes up (Postgres reachable). Pre-this:
operator had to launch the worker as a separate process per
the docblock "operator-launched via internal/customerwebhook.New".
Single-binary deploy now does it inline — same context, same
lifecycle, same logger; one less ansible task. -
Customer-webhook delivery alerts + runbook (F-1270 follow-up).
Two new Prometheus alerts wired into both the multi-host and R1
rules:_delivery_failing(P3) fires when 5xx + network-error
attempts exceed 0.1/s for 15+ min (single-customer outage);
_delivery_exhausted(informational) fires when a delivery hits
the 15-attempt retry budget. Newcustomer-webhook-delivery-failing.md
runbook covers the SQL to identify the failing webhook, the
customer-outreach template, and the worker-vs-customer triage
tree. Catalogued inalerts-catalog.mdso operators see them
alongside the rest of the API alerts.
Tested
internal/usagepackage gains unit-test coverage. The
per-subject daily usage counter (Redis-backed) was the one
remaining no-test package ininternal/. 8 tests cover the
Increment/Read round-trip, day-boundary handling, empty-subject
no-op, retention clamp, key-prefix isolation, URL-encoded
subjects (so:inside IPv6 addresses doesn't collide on the
date separator), and the 35-day retention TTL applied on every
key.
Documented
-
ADR-0012 placeholder (F-1262). Filled the numeric gap in
docs/adr/— 0011 jumped to 0013 with no file at 0012, even
thoughdocs/adr/README.mdhad listed the slot asPlanned
(reserved for Quorum-set composition per ADR-0004 Phase 3)
since the initial audit. The placeholder documents what the
future ADR must cover (third-party validator selection,
HALT-LIVE-DROP scoring, cross-region quorum overlap, stellar-
core[QUORUM_SET]thresholds) and what invariants it must
preserve (Tier-1 independence, no self-included validators,
≤ 33% effective weight per validator). README index now links
to the file. -
Dashboard surface bypasses the v1 envelope on purpose (F-1235).
/v1/dashboard/keys*handlers write bare JSON rather than the
data/as_of/flagsenvelope used by market-data
endpoints. Documented the rationale in
docs/reference/api-design.md §4.1: different audience
(dashboard React app, not SDK), session-scoped data (no
market-quality flags to carry), distinct auth path (session
cookie vs API key). Future contributors won't "fix" the
perceived drift. RFC 9457 problem responses, Cache-Control: no-
store, and X-Request-Id correlation are preserved.
Fixed
-
Guard test for
flags.stale=trueon every fallback path (F-1254).
The stale-flag fix ininternal/api/v1/price.go:287-299(the
May-10 SEV-2 lesson — "fallback chain is itself the staleness
signal") had no regression test. New
TestPrice_FallbackChainSetsStaleFlagcovers both the
triangulated and direct-rewrite fallback paths and asserts
flags.stale=trueon each — so a future change that re-clears
the flag is caught immediately. -
API reference doc drift after rc.48 (F-1246). Regenerated
docs/reference/api/rates-engine.v1.yamlto match the current
OpenAPI source — three residual/v1/coinsreferences in the
generated file (an?issuer=description, the home-page
summary, and an error-envelopeinstanceexample) lingered
after rc.48 removed the route. Pure regen; OpenAPI source was
already clean. -
Postman collection drift (F-1247).
make docs-postmannow
writes to the customer-facing canonical at
examples/postman/rates-engine.postman_collection.json—
previously it wrote to a gitignoreddocs/reference/api/...
path, so the tracked customer copy drifted silently every time
the OpenAPI spec moved. The docs-site build pipeline runs its
own regeneration; the in-repo file is for customers who clone
the repo to import the collection. README + Makefile docstring
updated; canonical refreshed (656k bytes). -
classic_assets.first_seen_*ordering bug under chunked
backfill (F-1239). The ON CONFLICT clause in
registerClassicAssetSeenpreviously updated onlylast_seen_*
(GREATEST) andobservation_count— leavingfirst_seen_*
pinned to whichever ledger first hit the row, regardless of
actual chronology. Out-of-order parallel backfill (chunked
ranges processed in parallel) could leavefirst_seen_ledger
higher than the asset's true first observation. Fix: also
updatefirst_seen_*withLEAST(existing, incoming). Idempotent
for forward ingest (incoming is always ≥ existing → no-op).
Removed
- Unused GIN indexes on
blend_auctions.bid/.lot(F-1238).
Migration 0029 drops the two JSONB GIN indexes from migration
0009. No reader ininternal/storage/timescale/queries those
columns by content —LatestBlendAuctionEventand
ListBlendPoolsboth filter only bypool/auction_type/
user_address/ts. Index write-amplification on every
blend-auction INSERT for a read path that never materialised.
Down migration restores them.
Changed
ratesengine_ingestion_source_stoppedalert window widened
to 30m × 15m (F-1212b). The pre-existing 5-min window
produced routine false positives on low-volume Soroban / FX
sources (blend auctions, ECB FX dailies, Band oracle pushes,
Comet pool swaps, Phoenix off-peak windows). On R1 this
manifested as 5 simultaneous ticket-tier alerts at any given
time; operators learned to ignore the family. The new window
waits past the natural quiet-window cadence for these sources;
total-outage coverage stays tight via the separate 3-min
ratesengine_ingestion_all_sources_stopped(P1). Rule updated
in bothdeploy/monitoring/rules/ingestion.ymland
configs/prometheus/rules.r1/ingestion.yml. Runbook + alerts
catalog updated to reflect the new threshold and rationale.
Documented
- RFP F4.2 one-year retention catch-up procedure (F-1265).
docs/operations/backfill-procedure.mdgains a new section
walking through the 1-year catch-up backfill needed to meet
Freighter RFP F4.2's ≥1y retention commitment. Covers
resolving the target ledger window from the Galexie archive
manifest, sanity-checking upstream archive completeness, row-
count estimation, the chunked-by-week run loop with-resume
so a mid-chunk crash doesn't re-do 12 hours of work, the CAGG
force-refresh sequence, and a/v1/chart?timeframe=1y
verification step. Pre-flip operator step; the code path is
unchanged.
Added
-
R1 TOML supply.watched_ defaults (F-1266).* The
archival-nodeansible role'sratesengine.toml.j2template
gains a[supply]block with sensible launch-day defaults:
watched_classic_assetspopulated with the top Stellar-classic
verified currencies (USDC / EURC / AQUA / yXLM / VELO / BLND /
PHO / KALE) mirroringinternal/currency/data/seed.yaml. Plus
inventory-overridable knobs forwatched_sep41_contracts,
sdf_reserve_accounts, andreserve_balances_stroops.
Pre-F-1266: R1's TOML had no supply block → every F2 field
(market_cap_usd,fdv_usd,circulating_supply,
total_supply,max_supply) returned NULL even though the
code path is correct. The nextarchival-noderole run will
flip every one of those fields from NULL to a real value for
the 8 watched currencies. -
Opt-in Redis ACL lockdown template (F-1213). Closes the
pre-flip Redis-ACL gap on R1 by codifying a narrow ACL config
in the redis-sentinel ansible role. Newredis_acl_lockdown
flag (default false for backward compat) renders
templates/users.acl.j2to/etc/redis/users.acl, references
it fromredis.conf.j2viaaclfile, and:- Disables the legacy
defaultuser (off nopass nocommands)
so no password-less access path remains. - Creates a
ratesengineapplication user with
+@read +@write +@scripting +@pubsub +@connectionminus the
@admin+@dangerousfamilies, scoped via~prefix:*to
exactly the cache key prefixes the application uses (vwap,
confidence, freeze, div, ratelimit, signup-ip, toml, meta,
price, apikey, health, oracle, subscriber) plus the pub/sub
channels (&closed-bucket-*,&stream-*).
Application binaries get a new[storage].redis_usernameTOML
key (default empty = legacy path; operators set it to
ratesenginewhen they flip the lockdown).redisclient.Build
threads it into both the FailoverClient and single-node code
paths. Commented-out per-component (re_aggregator/
re_api/re_indexer) users in the template show the
follow-on split when operators add per-binary passwords.
- Disables the legacy
-
L2.2 Phase 2 FX-anchor USD volume coverage (F-1268). New
timescale.VWAPUSDFXResolverimplements the pre-existing
USDVolumeFXResolverinterface against theprices_1m
CAGG: for any on-chain quote asset not already on the
operator's[trades].usd_pegged_classic_assetslist, the
resolver looks up<quote>/<USD-peg>at the trade's
timestamp and supplies a per-minute-bucket-cached USD rate
thattradeUSDVolumemultiplies through. Pre-Phase-2: only
CEX/FX + operator-allow-listed pegs contributed to
volume_24h_usd; an EURC/XLM Soroswap trade contributed 0
even though we had a fresh EURC/USDC VWAP one minute earlier.
Now it inherits USD value through the peg chain. Wired
alongside the Phase 1 quote spec in
cmd/ratesengine-indexer/main.gowhenever
usd_pegged_classic_assetsis non-empty (no new config
knob). 7 unit tests cover defaults, cache hits, negative
cache, TTL expiry, minute-bucket key stability. The
AssetDetailvolume_24h_usddocstring rewritten to
document the three-tier coverage chain (Phase 1 off-chain- Phase 1 on-chain pegs + Phase 2 FX-anchor).
-
Customer-facing dashboard webhook CRUD handlers (F-1270
complete). Newinternal/api/v1/dashboardwebhookspackage
mounts five routes: GET/POST/PATCH/DELETE
/v1/dashboard/webhooks+ GET
/v1/dashboard/webhooks/{id}/deliveries. Session-gated,
role-gated (Owner/Admin/Member create; Viewer/Billing 403),
cross-account 404 (no existence-leak), 10-per-account quota,
HTTPS-only URLs, closed-set event validation, secret returned
ONCE on create. OpenAPI spec adds 5 paths + 5 schemas;
postman + api docs regenerated. 8 unit tests cover happy
path, 401, 403, malformed URL, unknown event, quota, list
scoping, cross-account delete. Wired into the v1 server via
the sameDashboardAuthMounterpattern as keys, and into
main.go'sbuildDashboardBundleso the handlers come up
whenever Postgres is reachable. -
Customer-webhook delivery worker (F-1270 close-out).
Newinternal/customerwebhookpackage drains the queue the
store wrote in the prior commit: poll-loop drains
ListPendingDeliveries, HMAC-SHA-256 signs the payload,
POSTs to the customer URL withX-RatesEngine-Signature+
X-RatesEngine-Event+X-RatesEngine-Delivery-Idheaders,
marks delivered on 2xx, schedules retry on 5xx/network
(exponential backoff 30s → 1h cap, 15-attempt budget),
terminates on 4xx / disabled-webhook / missing-webhook /
malformed-URL. New
ratesengine_customer_webhook_delivery_attempts_total
counter labelled by 10 outcomes; documented in metrics ref
with two alert recipes. 5 unit tests cover the happy path,
5xx-retry, 4xx-terminal, disabled-webhook, missing-webhook. -
postgresstore.WebhookStorecustomer-webhook data plane
(F-1270 partial). Implements the existing
platform.WebhookStoreinterface against the
customer_webhooks+webhook_deliveriestables from
migration 0027: Create / Get / List / Update / Delete on the
registry; Enqueue / ListPending / MarkDelivered /
MarkAttemptFailed on the delivery queue; Append / Update /
ListDeliveries on the dashboard delivery log. Four new
WebhookEventTypeconstants (incident.sev1,
incident.resolved,anomaly.freeze,divergence.firing)
pin the closed event set without forcing the schema to use an
enum. New integration subtest
WebhookStore/CRUD+queuecovers the full lifecycle:
create → list → update → enqueue → fail-with-retry →
enqueue → mark-delivered → list-history → delete-cascades.
RotateWebhookSecretis a tagged stub pending the dashboard
CRUD handlers. Delivery worker + customer-facing API are
follow-up commits. -
Inline
price_usdon/v1/assets/{id}(F-1271). The
asset-detail body now carriesprice_usdwhenever the price
lookup succeeds — previously it only surfaced via the optional
coins-overlay block (assets not in the coins catalogue had a
nullprice_usdeven though the same handler was already
fetching the price formarket_cap_usd). Freighter wallet- retail apps that just want the current price no longer pay
a second/v1/priceround-trip on every asset-detail render.
ExtractedpopulatePriceUSDruns before the supply early-
return so off-chain assets without a supply snapshot also get
the field;populateMarketCapnow re-uses the already-inlined
price instead of paying for a second lookup. OpenAPI spec
updated; postman + api docs regenerated. 1 new unit test
covers the no-supply path.
- retail apps that just want the current price no longer pay
-
postgresstore.BillingStoresubscription mirror (F-1231).
UpsertSubscriptionandGetActiveSubscriptionForAccount,
previously stubbed, now hit thesubscriptionstable from
migration 0027. UPSERT is idempotent onstripe_subscription_id
so a re-delivered webhook updates plan + period without
duplicating rows.GetActiveSubscriptionForAccountenforces
both the period-end and canceled-at semantics from
platform.Subscription.IsActive. The Stripe webhook handler
wire-up (which would need to resolvestripe_customer_id→
account_id+ extract subscription IDs from the event
payload) is the next layer; this commit lands the store half
so the data path is end-to-end-ready. New integration
subtestBillingStore/Subscription/UpsertAndGetActivecovers
insert / idempotent update / expired / validation paths. -
Stripe webhook tier-upgrade audit log (F-1240). New
internal/platform/postgresstore.AuditStoreimplements the
platform.AuditStoreinterface against theaudit_logtable
from migration 0027 (Append / AppendBatch / List). The Stripe
webhook handler now writes oneplan.upgradeaudit row per
successful upgrade event (one row per event, not per key —
metadata carries identifier + tier + key counts so the
dashboard can render "the upgrade happened" without N rows
for a customer holding N keys).StripeWebhookConfig.Audit
is a narrowStripeAuditSinkinterface so the v1 package
doesn't import the full audit-store surface. Append failures
log at WARN and never block the webhook ack — audit-log
unavailability must not turn a successful Stripe upgrade
into a Stripe retry storm. 3 new unit tests cover the happy
path, the nil-sink legacy fallback, and the swallowed-error
contract. -
Depeg-scenario test wiring stablecoin late binding ↔ divergence
worker (F-1230). ADR-0026's stablecoin late binding deliberately
conceals stablecoin↔fiat drift so XLM/USDC trades flow into the
same XLM/USD bucket as XLM/USDT — the divergence worker is the
designed safety net that firesflags.divergence_warningwhen
the concealed price drifts from external references. The two
components had no test wiring them together; nothing would catch
a regression that broke either side. New
divergence/depeg_test.goexercises the round-trip:- TestStablecoinDepeg_DivergenceWorkerFires —
aggregate.ProxyTrade
rewrites XLM/USDC → XLM/fiat:USD, the aggregator publishes
a price assuming USDC=$1, references show the true XLM/USD
after USDC depegged to $0.95, and the worker fires
WarningFired=true on the resulting ~5.3% delta. - TestStablecoinPegHolds_DivergenceWorkerStaysQuiet — symmetric
negative case so a future change can't make the warning fire
on the steady state.
- TestStablecoinDepeg_DivergenceWorkerFires —
-
Guard tests for two CLAUDE.md surprises (F-1242).
Locks behaviours that no production test previously asserted:comet.TestDecodeSwap_DispatchIsByTopicNotContractproves
that two events with differentContractIDs but the same
(POOL, swap)topic both decode toSource="comet"— i.e.,
the Comet decoder is generic Balancer-v1, not contract-
specific. A future change that narrows the decoder to a
specific allow-list would silently drop trades from any new
Balancer-v1 deployment; this test fires first.sep41_supply.TestDecoder_CAP67_FourTopic_BackCompatexercises
mint / burn / clawback events with both pre-P23 (3/2/3 topics)
and post-P23/CAP-67 (4/3/4 topics withsep0011_asset)
arities. The decoder reads counterparty positionally and
must ignore the optional 4th topic; a future contributor who
naively asserts topic length would break the post-P23 path.
The third surprise (SEP-41 transfer dual i128/Map shape) has
no production transfer-amount decoder yet, so the dual-shape
guard already lives insac_balances.TestObserver_Decode{I128, MapVal}; documented as such in the audit register.
-
Per-request CORS observability metric (F-1244). New
ratesengine_api_cors_decisions_total{outcome}counter wired
into the CORS middleware. Outcomes:no_origin/
allowed_origin/allowed_wildcard/denied. The
pre-existingwarnOpenCORSstartup-only check fires once at
boot then drifts out of memory; this counter is the per-request
companion so operators can dashboard real cross-origin traffic
and alert when a wildcard policy starts handling actual cross-
origin requests in production (the silent failure mode of
RATESENGINE_ALLOWED_ORIGINS=*slipping in alongside
credentialed auth_mode). Wired into the existing middleware
without changing public CORS behaviour; one new test case covers
all four outcomes. -
Freeze EventSink LKG VWAP + recovery worker (F-1228 + F-1229).
freeze.EventSink.RecordFreezeandfreeze.Writer.Marknow
carry the last-known-good VWAP we're freezing on as a
fixed-precision decimal string (orchestrator passes
formatRatFixed(prev, 12)); the timescale sink stamps it on
the newfreeze_eventsrow instead of the previous hardcoded
frozen_value = 0. The recovery worker is the inverse half:
every 60s it lists openfreeze_eventsrows, checks whether
the Redis marker still exists, and callsMarkRecoveredwhen
the marker is gone (TTL elapsed → underlying anomaly cleared).
Without it, durable rows accumulated forever and the explorer
/anomalies timeline showed resolved freezes as still-firing.
New metricsratesengine_anomaly_freeze_recovered_totaland
ratesengine_anomaly_freeze_recovery_sweeps_total{outcome},
new alertratesengine_anomaly_freeze_recovery_stalled(P3),
new runbookfreeze-recovery-stalled.md. Two phase-1 + phase-2
orchestrator callers updated to thread the prevVWAP through.
3 new unit tests + extended existing freeze + orchestrator
tests. -
Per-IP signup throttle (F-1232). New
v1.SignupIPThrottle
interface +auth.RedisSignupIPThrottleRedis-backed
implementation. Default 5 signups per IP per hour via
INCR + EXPIREsliding window. The global anonymous rate
limit (60/min/IP) is plenty for browsing public surfaces but
lets a single IP bulk-mint 3,600 email→key_id pairs/hour via
signup. The new throttle closes that vector without affecting
other anonymous traffic; falls open on Redis errors. Wired in
cmd/ratesengine-api/main.gowhenever Redis is available.
Newauth.ErrSignupRateLimitedsentinel + exported
middleware.RemoteIPfor handlers needing trusted-proxy-aware
client IP outside the middleware chain. 5 unit tests
(under-cap, over-cap, distinct IPs, empty IP falls open,
defaults applied). -
Stripe webhook event dedupe (F-1227). New
internal/platform/postgresstore.BillingStoreimplements the
AppendStripeEvent/MarkStripeEventProcessed/
MarkStripeEventFailedtriple from
internal/platform/billing.goagainst thestripe_event_log
table from migration 0027. The webhook handler now claims a
dedupe slot withINSERT INTO stripe_event_logBEFORE running
any side effects;ErrAlreadyProcessed(Postgres
23505 unique_violation) signals "we've already done this work"
and acks 200 immediately without re-running the upgrade. Stripe
at-least-once delivery means the same event can land hours
later — without this guard, a manual operator-side downgrade
between original delivery and redelivery silently re-upgrades
the customer. Wired incmd/ratesengine-api/main.goto the same
*sql.DBthe timescale store uses; falls open to the legacy
"rely on idempotent UpdateRateLimit" path when Postgres is
absent. Two new unit tests pin the contract
(duplicate-doesn't-reupgrade + nil-events-store-falls-back).
UpsertSubscription+GetActiveSubscriptionForAccountstubbed
pending Phase-2 / F-1231. -
SEP-10 challenge-replay defence (F-1224). Added a
sep10.ReplayGuardinterface +sep10.RedisReplayGuardRedis-
backed implementation. After a challenge XDR clears
txnbuild.VerifyChallengeTxSigners, the validator hashes the
signed XDR with SHA-256 andSETNX's the dedupe key
(sep10:seen:<base64-url-no-pad>) with TTL =ChallengeTTL.
A second submission of the same signed XDR finds the slot
taken and returnsauth.ErrUnauthorizedinstead of minting a
fresh JWT. Wired incmd/ratesengine-api/main.goto the
same Redis client the rest of the auth subsystem uses;
initial validator construction atmain.go:144happens before
rdb is available, so the validator is rebuilt with the guard
once rdb exists.miniredis-backed unit tests pin the three
contracts (first claim ok, replay rejected, TTL expiry allows
fresh claim, distinct hashes don't collide). -
ratesengine_aggregator_vwap_cache_write_errors_totalmetric- paired
ratesengine_aggregator_cache_write_errorspage-tier
alert. The May-10 SEV-2 (Redis BGSAVE blocked by full root FS for
~9 h → every cacheSetreturned MISCONF →/v1/price404'd on
every rewritten / triangulated / stablecoin-proxy pair) had no
upstream signal in monitoring —flags.staledid not flip
because the aggregator process was alive and ticking, just unable
to publish. The post-mortem (internal/incidents/data/2026-05-10-redis-writes-blocked-disk-full.md)
explicitly recommended "alert on aggregator WARN rate (not just
service-up status)" — this counter realises that recommendation
as the cleanest signal: any non-zerorate(...[5m])for ≥ 2 min
pages. Increments at the single cache-write failure point in
internal/aggregate/orchestrator/orchestrator.go:653. Closes
audit-2026-05-12 F-1253; supports F-1254 (flags.stalesemantic
bug — separate fix).
- paired
Fixed
-
Postgres
max_locks_per_transaction = 256codified (F-1251).
The 2026-05-06 SEV-3 (internal/incidents/data/2026-05-06-postgres-lock-table-full.md)
hitout of shared memory (53200)when the per-tx lock table
saturated under concurrent ingest from many sources. The
operator bumped this knob to 256 by hand on R1; un-codified, a
from-scratch R1 rebuild or R2/R3 cutover would inherit the
Postgres default of 64 and re-experience the same incident
class. Now templated byarchival-node/templates/postgresql.conf.j2
with defaultpostgres_max_locks_per_transaction: 256(4×
headroom; 51,200-entry lock table at the current 200-connection
limit). Paired with newratesengine_timescale_lock_table_pressure
Prometheus alert at 70% saturation so the next bump is
forecast not forced — depends onpostgres_exporter(not yet
scraped on R1; rule lights up when the exporter lands). -
web/status/wrangler.tomladded (F-1245). Mirrors the
explorer + dashboard wrangler.toml shape so Cloudflare Pages
git-integration deploy works without manual project setup. -
web/explorer/src/app/oracles/OraclesView.tsxESLint
react-hooks/exhaustive-depswarning fixed (F-1258).
streamRowswas a fresh[]on every render when
streams.datawas undefined, causing the downstreamuseMemo
to recompute every tick. Wrapped in its ownuseMemofor
referential stability. -
internal/sources/comet/adapter_test.go: pin
topic-vs-contract-id contract (F-1242). New
TestDecoder_Decode_NoContractIDDiscriminationmakes the
CLAUDE.md surprise-list claim ("Comet decoder matches by
topic, not contract address") executable. Any future change
adding a contract-id allow-list at the decoder layer (instead
of downstream filtering) MUST flip the assertion. -
F-1228 + F-1229 acknowledged but deferred to a separate
refactor.freeze_events.frozen_valuealways written as 0MarkRecoveredhas zero callers. The structural fix
(extendfreeze.EventSink.RecordFreezeto accept the LKG
VWAP, plus wire a recovery worker that calls MarkRecovered
on Redis-marker TTL expiry) touches the EventSink interface
used by 3 packages + tests. Both medium-severity, neither
blocks the public flip.
Investigated, no code change
-
N-1262 ADR-0012 missing from disk — turns out to be an
intentional reservation, documented indocs/adr/README.md:56:
"0012 | Planned | Quorum-set composition (referenced by
multi-region-topology) | —". Per the ADR README's
"gaps allowed when reserved" rule. F-1262 closed asinvalid. -
flags.stalesemantic bug fixed (F-1254).
internal/api/v1/price.goresetstale = falseafter falling
through topriceFallback(last-trade / stablecoin proxy /
triangulation). The May-10 SEV-2 (Redis BGSAVE blocked → cache
empty → every closed-bucket read hit ErrPriceNotFound →
priceFallback served last-trade for ~9 h) hit this path: the
customer-visible response was the fallback, butflags.stale
was false. Per ADR-0018 §"flags.stale semantic" and the doc
comment onFlags.Stale, fallback responses ARE stale by
definition. Setstale = okon the fallback branch in both
the single-asset and/v1/price/batchpaths so any non-VWAP
response now correctly carriesstale=true. Companion fix
to F-1253's cache-write-error counter (the upstream signal)
and F-1252 (the alert-routing the May-10 incident exposed). -
/v1/price/batchsources nondeterminism (F-1259).
internal/api/v1/price.go:902-905 lookupPriceBatchunioned
per-row sources through amap[string]struct{}and emitted
them in map-iteration order, breaking the ADR-0015
byte-identical cross-region property for batch responses.
Addedsort.Strings(srcs)beforewriteJSONso batch
responses match the single-asset path's stable lexical order
(set bytimescale.normalizeVwapSourcesat the storage
boundary per F-0016 closure). -
Cache-Control gap on credential surfaces (F-1225).
/v1/auth/login,/v1/auth/callback,/v1/auth/logout,
/v1/dashboard/keys*,/v1/signup,/v1/webhooks/stripe,
/v1/price/stream,/v1/methodology, and/v1/incidents.atom
all fell throughpolicyForPath's switch with no case match,
emitting noCache-Controlheader. Most concerning was
/v1/auth/callback: a CDN in front of the API could have
cached the magic-link consume response and re-issued the
session cookie to subsequent requests. Added explicit cases:
every/v1/auth/*and/v1/dashboard/*and the two
state-changing surfaces (/v1/signup,/v1/webhooks/stripe)
useprivate, no-store;/v1/price/streamusesno-store;
/v1/methodologyand/v1/incidents.atomget explicit
public-cache policies appropriate to their content cadence. -
4 of 4
make test-integrationfailures (F-1250).TestPlatformPostgresStores/APIKey/CRUD+revoke+touch—
test/integration/platform_postgres_stores_test.go:400,510
constructed key IDs as"kid_" + uuid.New().String()[:12]
which contains a hyphen at position 9, violating
migration-0027 checkid ~ '^kid_[a-f0-9]{12,}$'. Switched
tostrings.ReplaceAll(uuid.New().String(), "-", "")[:12]
(12 hex chars).TestEndToEnd_LedgerstreamToTimescale/soroban_LCM_with_reflector_FX_updateTestTradesInRangeAndMarkets— both used hand-crafted
G-strkeys (GA7QYNF7…UWDAandGA5ZSEJYB…ZVM) with invalid
CRCs. The strkey package now enforces CRC; tests switched
to AQUA's real mainnet G-strkey
(GBNZILSTVQZ4R7IKQDGHYGY2QXL5QOFJYQMXPKWRRM5PAV7Y4M67AQUA)
which round-trips cleanly and is distinct from USDC's
issuer.
TestSupplyStorageRoundTrip— schema/reader drift:
migration0005_create_asset_supply_history.up.sql:60
creates a UNIQUE index on(asset_key, ledger_sequence, time)
(TimescaleDB requires the partition column in any unique
index on a hypertable), butinternal/storage/timescale/supply.go:47
usedON CONFLICT (asset_key, ledger_sequence) DO NOTHING.
Postgres requires an exact column-set match; the INSERT
failed with42P10. Updated the conflict target to all 3
columns and revised the doc comment to explain the
invariant preservation.- Plus
TestTradesInRangeAndMarketsDistinctPairsreturned
0 markets after the strkey fix because the test inserted
intotradesdirectly butDistinctPairsreads from the
prices_1mcontinuous aggregate (post rc.45 commit
8717bc20). Added aCALL refresh_continuous_aggregate('prices_1m', NULL, NULL)
before the assertion, mirroringtest/integration/api_test.go:65-74.
-
R1 alert blackout closed: 9 alert families wired up, textfile
evidence chain repaired (F-1219 + F-1220 + F-1221 + F-1252).
Pre-change R1 loaded only 6 of 18 rule families
(aggregator/api/infra/ingestion/meta/slo); every alert in
anomaly,divergence,external-pollers,supply,
supply-snapshot,supply-refresh,archive-completeness,
verify-archive,sla-probewas permanently silent. The
SLA-evidence chain specifically was broken end-to-end: the probe
binary supports-textfile-output(cmd/ratesengine-sla-probe/textfile.go:190 writeTextfileAtomic) but the R1 wrapper at
configs/healthchecks/sla-probe.shnever set it, the
textfile-collector dir didn't exist, andnode_exporterran
without--collector.textfile. Three changes close the chain:configs/ansible/roles/archival-node/tasks/10-observability.yml
now provisions/var/lib/node_exporter/textfile_collector/
and adds--collector.textfile+--collector.textfile.directory
to the node_exporter systemd unit.configs/healthchecks/sla-probe.shnow defaults
SLA_PROBE_TEXTFILE_OUTPUT=/var/lib/node_exporter/textfile_collector/sla_probe.prom
and passes-textfile-output $valueconditionally (preserves
the opt-out for operators that set the env var blank).configs/prometheus/rules.r1/gains 9 rule files copied
verbatim fromdeploy/monitoring/rules/(none of them had
job-label refs requiring single-host adaptation). README
table updated; rulescache.yml/storage.yml/stellar.yml
stay excluded with a clear note (redis_exporter+
postgres_exporter+stellar-core-prometheus-exporterare
not on R1).
-
Source-stopped alert false-positive class on low-volume
Soroban contracts (F-1212b).ratesengine_ingestion_source_stopped
used a 5-min rate window which routinely false-fired on
band,blend,comet,ecb,phoenix(legitimate 5+-minute
gaps during quiet trading windows — the source-stopped runbook
itself acknowledges this at line 60). Widened to a 30-min rate
window + 15-minfor:in bothdeploy/monitoring/rules/ingestion.yml
andconfigs/prometheus/rules.r1/ingestion.yml. Total-outage
coverage stays tight via the separate_all_sources_stopped
alert at 3 min — that one continues to catch the
upstream-broke-across-the-fleet case. -
Multi-host alert rule job labels (F-1222).
deploy/monitoring/rules/api.yml/aggregator.yml/
ingestion.yml/slo.yml/meta.ymlreferencedjob="api"
/"aggregator"/"indexer"but the multi-host ansible
prometheus role's scrape config usesratesengine_api/
ratesengine_aggregator/ratesengine_indexer(underscores).
Rules would never have evaluated true on a multi-host deploy.
Renamed the canonical multi-host labels to match the scrape
config;meta.yml's scrape-failing regex updated to the actual
exporter job names (postgres_exporter,redis_exporter,
node_exporter,minio). R1'sconfigs/prometheus/rules.r1/
copies already used the correct hyphenated R1 names and are
unaffected. -
rc.48 dead-route cleanup follow-up. rc.48 removed the
/v1/coins+/v1/currenciesHTTP surface but left several
stale references behind:cmd/ratesengine-sla-probewas still
probing/coins(would 404 after rc.48 deploy → SLA-probe
perma-fail on availability);examples/curl/04-coins.sh+
README still advertised the removed route;web/statussynthetic
smoke probe still pointed at/v1/coins?limit=1;openapi/rates-engine.v1.yaml
carried 3 stale/v1/coinstext references (incl. the rate-limit
example'sinstancefield);internal/api/v1/server.goOptions
doc comments still said "backs GET /v1/coins" / "backs /v1/currencies"
even though the seams now feed/v1/assetsand/v1/chart.
All migrated to live equivalents:cmd/ratesengine-sla-probe/main.gostaticEndpointsswitches
/coins→/assets(same fan-out coverage; comment explains
the rc.47 → rc.48 → rc.49 progression).examples/curl/04-coins.shdeleted; replaced with04-assets.sh
using?order=volume_24h_usd:desc.web/status/src/app/page.tsxsynthetic-probe entry switched
to/v1/assets?limit=1with the same Catalogue group.openapi/rates-engine.v1.yamllines 193 / 1602 / 2608
updated.internal/api/v1/server.goOptions.Coins / .Currencies /
.FXHistory doc comments rewritten to describe the actual
/v1/assets+/v1/chartconsumers.
Net:make verifyclean;go test ./internal/api/v1/...+
./cmd/ratesengine-sla-probe/...green.
Closes audit findings F-1202, F-1210 (cosmetic doc-text portion),
F-1211, F-1223, F-1245 (smoke surface), F-RFP-0017.
Tooling
docs/reference/api/rates-engine.v1.yamlregenerated
fromopenapi/rates-engine.v1.yamlviamake docs-api. The
checked-in copy had drifted ~990 lines (561 ins / 429 del) since
the last regeneration.web/explorer/src/api/types.ts(the
openapi-typescript output) auto-regenerated as a transitive
consequence (~415 lines lighter;pnpm typecheckclean). Closes
F-1246.docs/reference/config/README.mdregenerated from
internal/config/config.goviamake docs-config(+6 lines).
Closes F-1255.