Rates Engine v0.5.0-rc.83
Pre-release
Pre-release
·
483 commits
to main
since this release
[v0.5.0-rc.83] — 2026-05-28
Tested against Stellar Protocol 23 (Whisk).
Added
ratesengine_api_price_stalealert (both R1 overlay + multi-host) gets anabsent_over_timeOR-branch so the cascade-wedge case fires instead of going no-data silent (F-0104 closure). The staleness gauge is emitted by the aggregator at end-of-tick; when the aggregator wedges, the gauge stops being scraped, the series goes stale, and a bare> 120predicate sees no-data — i.e. the alert designed to catch exactly that cascade was itself a victim of it. New expr:staleness > 120 OR absent_over_time(...[10m]) == 1. Same pattern as aggregator_silent (F-0080) and the exporter-down meta-alerts (F-0085). Annotation updated so operators reading the page know to consult the aggregator-silent runbook when the gauge is absent rather than the price-stale runbook.http_request_success_duration_secondshistogram (F-0105 closure). The middleware records into this metric only when the response status is < 500 and not 499 (client-aborted). The latency SLO recording rules inslo.yml(both R1 overlay + multi-host) now usehttp_request_success_duration_seconds_bucket{le="0.2"}for the fast-success numerator while keepinghttp_request_duration_seconds_countas the all-request denominator. Pre-this-PR a 5 ms 500 landed in the same histogram as a 5 ms 200 and reported as "good fast" against the SLO, even though the customer experience was a hard outage. After: a fast 5xx burns the latency budget (numerator excludes the error, denominator counts it). One new regression test pins_success_duration_seconds_countat 0 for a synthetic 500. Availability SLO (http_requests_total{status_class=5xx}) is unchanged; this PR only fixes the latency dimension./v1/diagnostics/cursorsdistinguishes transient storage errors (503cursors-transient+cursors-timeout) from genuine 500s. Under the F-0039 cascade this operator-diagnostic was the most-needed surface but returned the same opaque 500 for "postgres briefly stalled" and "endpoint permanently broken" — operators couldn't tell whether to retry or escalate. Now: 5s ctx-timeout on the ListCursors call; deadline-exceeded → 503cursors-timeout;transientStorageErr(driver-bad-connection / 57014 cancel / broken-pipe / EOF) → 503cursors-transient; client-aborted is filtered; the residual 500 is reserved for genuinely-unknown errors. Two new tests pin transient → 503 and non-transient → 500.handleCursorsextracts the seven-branch error map intowriteCursorsListErrorto stay under the gocognit ceiling. Closes F-0094 (audit 2026-05-26).- Bounded-cardinality counters now pre-seed their well-known label combos at startup so alert PromQL is well-defined before the first event fires (F-0033 closure).
ratesengine_aggregator_triangulations_totalseeds{outcome=ok|missing_leg|parse_error|redis_error};ratesengine_stripe_platform_sync_errors_totalseeds{operation=get_account|upsert_subscription|account_update|list_keys|key_update}. Pre-this-PRrate(...{outcome="ok"}[15m])resolved to "no data" until the first triangulation landed — the audit found multiple alert rules whose underlying metric was "missing from scrape output" for this reason. The fix isobs.init()-time.WithLabelValues(...)(no-opInc-less call) which is enough to publish the series at zero. Counters with unbounded per-pair labels (AggregatorFXSnapFallbackTotal) stay emit-on-error. NewTestZeroSeed_F0033pins the 9 expected series at0in the scrape body. The other two metrics F-0033 flagged (ratesengine_ledgerstream_tier_read_total,ratesengine_stellar_archive_publish_errors_total) are intentionally inert today — both are documented in their respective files as Phase-3 / cold-tier reservations. - Param-name aliasing extended across the rest of the
asset=-canonical endpoints (F-0068, F-0091, F-0073 closure)./v1/observationsand/v1/chartnow both acceptbase=as an alias forasset=; passing both is a 400./v1/price/batchacceptspairs=as an alias forasset_ids=so CG-style callers calling the request "pairs" reach the endpoint without a 400 detour; both-supplied is a 400. New shared resolverresolveAssetOrBaseParaminprice.gofactors out the asset/base alias logic so futureasset=-canonical endpoints inherit the contract for free. Six new tests (chart × 2, observations × 2, price-batch × 2) pin alias-accepted + both-supplied-rejected. Closes F-0068, F-0073, F-0091 — completes the cluster started by F-0061. assetandbasequery parameters are now interchangeable across/v1/price(asset=canonical,base=accepted) and every endpoint that flows throughparseBaseQuote—/v1/history,/v1/twap,/v1/vwap,/v1/ohlc— (base=canonical,asset=accepted). Developers copying URLs between endpoints no longer get the F-0061 two-step rejection. Passing BOTHbaseandassetreturns 400invalid-parameterwith a self-explanatory message about which form is canonical for the endpoint they're calling, avoiding silent precedence picks. The pre-existing helpful "this endpoint uses base/quote (not asset/quote)" detail string is replaced with the alias acceptance + the mutually-exclusive 400 — the redirect was a workaround the alias makes unnecessary. New tests pin (a)asset=accepted asbase=alias on /v1/history, (b) both-supplied returns 400 with "mutually exclusive" in the body.handlePriceextraction intoparsePriceAssetParamkeeps the handler under the gocognit ceiling. Closes F-0061 (audit 2026-05-26).window=query parameter on every endpoint that usesparseFromTo(so/v1/twap,/v1/vwap,/v1/ohlc,/v1/history— and any future endpoint that picks up the helper). Convenience shorthand forfrom = to - windowso CG-style customers don't have to compute it; pre-this-PR the param was silently ignored and/v1/twap?window=24hreturned a 1h-default 404 with no explanation. Accepts Go's [time.ParseDuration] units (ns,us,ms,s,m,h, including compound1h30m) plus a trailing-dshortcut for days (7d= 168h). Combiningwindow=with an explicitfromis now a 400 — they're conflicting controls for the same value, and rejecting it loudly catches the F-0072 surprise. Three new internal-test functions pin happy-path (hours / minutes / days / compound), conflict rejection, and reject-malformed (garbage,1x,1d2h,-5h,0). Closes F-0072 (audit 2026-05-26).ratesengine-ops backfillchunk-complete log now reports bothchunk_size_ledgers(the [from,to] range) andledgers_walked(the LCM-callback count from the bucket). The previousledgers=Nfield reported the range size — operators ran a backfill against an empty bucket (F-0159:-bucket galexie-archivefor a range that lived only ingalexie-live) and gotledgers=5331in the chunk-complete log after a 200ms run. With this change, the same scenario logschunk_size_ledgers=5331 ledgers_walked=0and also returns an explicit error:backfill walked 0 of 5331 ledgers in range [...] from bucket "galexie-archive" — bucket likely has no files in this range; check --bucket and the galexie-archive/-live mirror for the target range. The chunk fails loudly instead of silently succeeding. Closes F-0159 (audit 2026-05-26).- TLS cert expiry self-probe (F-0051). API binary now runs a goroutine that
tls.Dials each configured public hostname every 6 h, extracts the leaf cert'sNotAfter, and emits it asratesengine_tls_cert_not_after_unix{host}. New alertratesengine_tls_cert_expiring_soon(P2, both R1 overlay + multi-host) fires when(NotAfter - time()) < 14 dayssustained 1 h. Default hosts list coversapi.ratesengine.net+status.ratesengine.net+ratesengine.net(apex); operators override via[api].tls_cert_probe_hosts. Companionratesengine_tls_cert_probe_total{host, outcome}counter exposes probe health (ok / dial_error / timeout / no_cert). Runbook atdocs/operations/runbooks/tls-cert-expiring-soon.mddocuments 5-min triage, five likely root causes (ACME rate limit, DNS-01 failing, HTTP-01 firewall, disk full, Caddy crashed), and manual renewal sequence. Closes the "Caddy auto-renews but if it fails we don't know until expiry" gap. 5 unit tests pin the probe behaviour including a self-signed httptest TLS server for happy-path coverage. - Operator-config wiring for the new per-asset supply-refresh stale-component overrides:
[supply].stale_component_ledgers_by_assetmap (asset_key → ledger threshold) is now consumed by all three refresher builders (classic / SEP-41 / XLM). Operators set this inratesengine.tomland the aggregator picks per-asset overrides at startup; empty map preserves the global default for every asset. Concrete deployment example documented in the config doc. F-0040 follow-up to library knob shipped earlier. - Per-asset stale-component threshold override for supply refresher (
supply.WithStaleComponentLedgersFor(assetKey, maxLag)). F-0040 audit (2026-05-26): PHO governance-token snapshots were being rejected at gap ≈1190 ledgers (~100 min) because of the global 1000-ledger threshold; PHO is low-activity and 1200-ledger lag is normal. Operators can now relax the gate per-asset (e.g.PHO → 5000) without loosening the gate for high-activity XLM/USDC. Two new tests pin (a) the relaxed asset accepts what the global default rejects and (b) the override doesn't bleed into other assets. Caller wires per-asset overrides viasupply.NewRefresher(..., WithStaleComponentLedgersFor(...)). make verify-r1-syncnow checks for pending Postgres migrations on r1 too. Compares the highestmigrations/NNNN_*.up.sqlnumber locally againstschema_migrations.versionon r1's Postgres and prints the exact scp+migrate-up command if local is ahead. Closes a real gap: rc.83 adds two columns/tables (migration 0046ingestion_cursors.first_ledger, 0047sep41_transfershypertable) — without operator-applied migrations the new binary crashes on its first DB write.feedback_migrations_not_auto_deployedalready documents the manual step; this surfaces drift before deploy instead of after.ratesengine_ingestion_source_insert_stalePrometheus alert (P2, R1 overlay + multi-host). Fires whenratesengine_source_last_insert_unixhasn't advanced in >1 h whilesource_enabled=1. Timestamp-shape sibling toingestion_duplicate_flood— catches low-volume sources (phoenix, comet) whose insert rate sits under the rate-shape alert's 0.5/s threshold. Reuses the existing duplicate-flood runbook (same root-cause cluster).ratesengine_source_last_insert_unix{source}gauge — wall-clock Unix-seconds timestamp of the most recent successfully-inserted trade row per source. Emitted fromStore.InsertTradeonly onrowsInserted == 1(not onON CONFLICT DO NOTHING). Pairs withratesengine_source_last_event_unix(dispatcher-matched) to expose the stuck-cursor / duplicate-flood pattern: when the dispatcher keeps matching events but every insert short-circuits,last_event_unixclimbs whilelast_insert_unixflat-lines. Direct alert template:time() - ratesengine_source_last_insert_unix{source=X} > 3600. Complements the rate-shapetrade_insert_outcome_totalalert with a timestamp-shape signal that fires even without sustained traffic.ratesengine_ingestion_duplicate_floodPrometheus alert (P2, both R1 overlay + multi-host) and the matching runbook atdocs/operations/runbooks/ingestion-duplicate-flood.md. Fires when a source has duplicate-insert rate > 0.5/s sustained 10 min with zero new-insert rate — the exact diagnostic signature of the live r1 2026-05-28 stuck-cursor pattern that the newtrade_insert_outcome_totalcounter (below) exposes. Runbook documents 5-min triage (curl metrics, psql max-ts check), three likely causes (cursor jumped past data, stale event channel, replay loop), and per-cause remediation (targeted backfill, indexer restart, stop the loop).ratesengine_trade_insert_outcome_total{source, outcome}counter — distinguishesoutcome=new(the row actually landed) fromoutcome=duplicate(ON CONFLICT DO NOTHINGshort-circuited). The pre-existingratesengine_trade_inserts_totalcounter is silent about dedupe, so a stuck-cursor / replay loop is invisible to operators. Live evidence on r1 (2026-05-28): 157 SDEX insert-attempts/min while the trades hypertable'smax(ts)was 11 h old — all duplicates. Alert template:rate({outcome="new"}[5m]) == 0 AND rate({outcome="duplicate"}[5m]) > 0. Integration test pins both branches via existingstartTimescaletestcontainer.- DeFindex factory-layer topic recognition closes F-0018 (2026-05-28). New
PrefixFactory = "DeFindexFactory"+classifyFactory()coveringcreate/n_fee.Decoder.Matches()returns true for the factory topic prefix;Decode()returns(nil, nil)on a factory match — recognised but not decoded into a flow, so the dispatcher's drop-counter stops filing factory events as "unmatched topic". With the earlier strategyharvest+ vault 9-topic admin/rebalance classifications already in place, every previously-**NO**defindex row ininventory/every-event-coverage.tsvis nowclassification-onlycoverage. Body decode (especially forcreate— the vault-spawn signal needsevents.Event.OpArgsper Surprising-gotcha #2 in the WASM audit doc, since the body itself doesn't carry the new vault address) is Phase C. Two new tests pin (a)classifyFactory()byte-equality for both symbols + every reject path and (b)Decoder.Decodereturning(nil, nil)rather thanErrUnknownEventon a factory match. - DeFindex decoder enumerates the full upstream event surface (EVERY-event policy).
classify()(strategy layer) addsharvest;classifyVault()adds the nine governance / admin / multiplexed-rebalance topics from the audit doc:rescue,paused,unpaused,nreceiver,nmanager,nemanager,rbmanager,dfees,rebalance. Classification only — no canonical Trade or VaultFlow produced for these yet; the goal is closed-set completeness so future per-event decoders (or thesoroban_eventslanding zone, ADR-0029) can route on them. Test fixtures updated: the previous "harvest (not Phase A) → " case flips to a positive classification, and every new vault topic gains a per-name subtest. - Phoenix decoder's
classifyAny()now enumerates the six previously-unclassified governance/lifecycle topics published byphoenix-contracts/contracts/pool/src/contract.rs: the four admin variants under topic[0]="XYK Pool: "(admin-replacement-requested, replace-with-new-admin, undo-admin-change, accepted-new-admin) plus the two"initialize"variants (XYK LP token_a / token_b). Same EVERY-event policy rationale as the aquarius change below — these topics were silently dropped at the classification step despite Phoenix beingBackfillSafe=true. Classification only; only swap continues to produce acanonical.Trade.actionAdmin+actionInitializeenum values added so future per-event decoders or thesoroban_eventslanding zone (ADR-0029) can route on them. - Aquarius decoder's
classify()now enumerates every topic published byaquarius-amm/liquidity_pool_events/src/lib.rs(verified 2026-05-27 against the upstream Rust). Eleven previously-unclassified topics (reserves_sync,set_protocol_fee,claim_protocol_fee,kill_deposit/unkill_deposit,kill_swap/unkill_swap,kill_claim/unkill_claim,kill_gauges_claim/unkill_gauges_claim) are now recognised. Per the EVERY-event policy (project_every_event_principle, 2026-05-25 —classify()is the authoritative completeness gate before flippingBackfillSafe), this closes a latent invariant violation: aquarius was alreadyBackfillSafe=truebut eleven event topics were silently dropped at the classification step. Onlytradeproduces acanonical.Tradetoday; the new classifications make the topics visible to thesoroban_eventslanding zone (ADR-0029) and any future per-event decoder. ATestClassify_completenessVsUpstreamforcing function fails CI if a futureEvent*constant is added without wiring itsTopicSymbol*intoclassify(). - SEP-41 transfer projection: new
sep41_transfershypertable (migration 0047) materialises everytransfer/approve/set_admin/set_authorizedevent for a watched SEP-41 contract via a sibling-of-sep41_supplydecoder atinternal/sources/sep41_transfers/. New endpointGET /v1/contracts/{contract_id}/transfers?from=&to=&limit=exposes the per-account audit trail with per-(contract, from) and per-(contract, to) indexes backing sub-100ms scans.ratesengine-ops sep41-transfers-backfill -from -tosubcommand replays the soroban_events landing zone (ADR-0029) through the live decoder for historical coverage. Closes F-0021 partial-scope (audit-2026-05-26) and unlocks the per-account net-position Stellar moat that CG/CMC structurally cannot do (their data ingest doesn't observe on-chain transfers). Operator must apply migration 0047 manually (CLAUDE.md migrations-not-auto-deployed). /v1/ohlcnow supports multi-bar series viainterval=1m|5m|15m|30m|1h|4h|1d|1w+limit=N(max 1000, default 100). Closes the CG/CMC parity gap where consumers expected a series response instead of a single bar (F-0071). Single-bar behaviour preserved whenintervalis unset. Multi-bar mode reads the closed-bucketprices_<N>CAGGs (with re-bucketing viatime_bucketfor 5m/30m ←prices_1mand 4h ←prices_1h); the in-progress bucket is excluded per ADR-0015. Empty series returns 200 +intervals: [](NOT 404 — series clients expect a stable shape). Wire fields are compact (t/o/h/l/c/v_base/v_quote/n) matching CoinGecko / CoinMarketCap conventions.- Density coverage calc (
/v1/diagnostics/ingestion) now includes the live ledgerstream cursor's coverage fromfirst_ledger(newly persisted via migration 0046). Density_pct can now hit 1.0 on a perfectly-backfilled-plus-live-tail source. Previously the calc was backfill-cursor-only and capped at ~0.98 even at perfect ingestion (perproject_density_100pct_goalmission). Theingestion_cursorstable gains afirst_ledgercolumn populated for existing backfill cursors by parsingfromout ofsub_source; the live cursor'sfirst_ledgeris captured byUpsertCursor's INSERT branch and preserved across every advance by the ON CONFLICT DO UPDATE clause. NULLfirst_ledger(pre-migration rows) falls back tosourceGenesisLedgerso the live span is credited [genesis, last_ledger] until the indexer re-inserts. - docs-lint check that fails CI when any /v1/incidents entry has unchecked
[ ]follow-up checkboxes AND the incident is older than 30 days (F-0099 forcing function). Closes the meta-failure-mode of post-mortem action items rotting indefinitely between recurrences of the same cascade — the 2026-05-10 SEV-2 shipped with 4[ ]items and the same cascade recurred on 2026-05-26 with all four still unchecked.
Changed
- 2026-05-10-redis-writes-blocked-disk-full post-mortem: checked off the Prometheus root-FS alert follow-up (shipped in #1229 as
ratesengine_node_root_disk_warning+_full) and the recovery-sequence runbook follow-up (docs/operations/runbooks/redis-write-blocked-disk-full.mdlanded in #1228). Remaining open follow-ups:postgresql-commonlogrotate audit and WASM-audit stderr-capture relocation.
Fixed
GET /v1/contracts/{contract_id}/transfersnow validates inputs up-front as Stellar strkeys:contract_idmust be a 56-char C-strkey (else 400),?from=/?to=must each be a G-strkey if present (else 400). Previously a garbage value reached the SQL layer and returned 200 with empty transfers — indistinguishable from "no matching transfers" and actively misleading for the operator-debugging use case. Extracted the 30-line validation block intoparseSEP41TransferIdentifiersto keep the handler under the gocognit ceiling. 3 new tests pin the validation paths (4 invalid-contract-id cases, 4 invalid-address cases, plus a happy-path sanity check that valid inputs reach the reader)./v1/assets/{id}cold-cache latency for unknown classic assets dropped from ~4–5 s to single-digit ms. F-0157 perf root cause:Store.HasAsset'sWHERE base_asset = $1 OR quote_asset = $1over the 2.7 B-row trades hypertable had to seek every chunk's index even with EXISTS+LIMIT 1. NewhasClassicAssetfast path routesAssetClassicto a primary-key lookup on theclassic_assetsregistry (migration 0023). The registry is populated byInsertTrade'sregisterClassicAssetSeenhook, so its presence is a strict subset of trade-table presence; unknown classic assets short-circuit without touching the hypertable. Other asset types fall through to the original scan unchanged. Integration test extended for the bogus-classic-asset path.- Smoke
expect_statushelper now actually supports a per-check timeout (--timeout Nflag) — the comment claimed this for ages but the code was passing the global$TIMEOUTto every curl call.asset not foundbehaviour-pin bumped to 20 s because the cold-cache/v1/assets/AAAA-G…resolver takes 4-5 s and was occasionally crossing the global 10 s ceiling, surfacing asFAIL asset not found — curl errorin live smoke runs. F-0157 reopened during a live smoke check and verified fixed via direct r1 run. - Aggregator divergence refresh is now gated by a configurable minimum interval (default 300 s =
cachekeys.DivergenceTTL). F-0030 follow-up: the daily-batched lookup fix landed earlier was still ~10× over the CMC free-tier monthly cap (10K calls / month) because every aggregator tick (30 s on r1 × 12 pairs) drove one external lookup. Thediv:<asset>Redis entry has a 5-minute TTL, so a 5-minute refresh interval keeps the API'sflags.divergence_warningcache continuously populated while burning ~one-tenth the external quota. Newaggregate.divergence_min_interval_secondsconfig knob; zero preserves legacy every-tick behaviour. Two new unit tests pin the gate behaviour. - Production Content-Security-Policy on the explorer (
ratesengine.net+/embed/*) and status site (status.ratesengine.net) no longer permitshttp://localhost:3000inconnect-src. F-0054 audit (2026-05-26) flagged this as dev/prod config drift — the Next dev server doesn't read_headersanyway, so the localhost permit was pure leakage. New section 16 inscripts/ci/lint-docs.sh(Content-Security-Policy:.*localhostgrep acrossweb/explorer/public/_headers+web/status/public/_headers) fails CI on regression. /v1/oracle/latestp95 latency: a new in-processCachedOracleReaderlayer (3 s TTL + single-flight) sits between the handler and the existing Redis cache, collapsing concurrent cold-miss stampedes and surviving Redis MISCONF. F-0013 audit (2026-05-26) measured p95 ~271 ms vs the 200 ms SLO. The underlyingDISTINCT ON (source)query againstoracle_updateshas no covering index (sort is unavoidable post-asset-filter), and oracle data refreshes on a 10–60 s cadence, so a 3 s in-process TTL gives customer-visible freshness identical to a direct read while absorbing burst traffic. Key normalisation sorts the asset-strkey list so[native, crypto:XLM]and[crypto:XLM, native]share one slot. Mirrors the F-0011CachedIssuersReadershape (delete-on-error, waiter-err-pointer single-flight). 8 unit tests added — the previous Redis-onlycachedOracleReadershipped without unit-test coverage.ratesengine_price_staleness_secondsXLM ↔ native mirror is now order-independent (F-0032 follow-up). The aggregator iteratescfg.Pairsand emits a staleness gauge per asset; the mirror code inemitStalenessGaugesused to set the other form's gauge to the current pair's stale value as a side-effect, so whichever of(crypto:XLM, native)was iterated last won and the alertratesengine_api_price_stalewould fire (or not) based on iteration order. Post-fix both labels carryMIN(stale_native, stale_crypto_XLM)— the freshest form drives both. AddedTestEmitStalenessGauges_xlmNativeMirrorOrderIndependentto lock in the invariant, plusTestEmitStalenessGauges_growsAcrossTicksas a baseline regression test (the metric was previously untested end-to-end)./v1/issuersp95 latency: ~404ms → sub-millisecond on cache hit via in-process TTL + single-flight cache (F-0011, was over 200ms SLO target). EXPLAIN ANALYZE on r1 showed the listing's HashAggregate-over-58k-issuers + top-N heapsort hits ~196ms in PG alone before JSON marshalling; no index helps because the GROUP BY + sum(observation_count) requires a full hashagg regardless of access path. Newinternal/api/v1.CachedIssuersReaderwrapsIssuersReaderwith a 5min TTL + single-flight refresh onListIssuers(passesGetIssuer+ListIssuerAssetsthrough — those are point lookups already on indexed columns). Mirrors theCachedSourcesStatsReader/CachedMarketsReadershape; sameratesengine_api_cache_ops_total{cache="issuers"}instrumentation feeds the existingapi_cache_miss_rate_highalert.- Binance + Bitstamp CEX WebSocket connections reconnect 12x faster (5s -> 60s exponential, was 60s blanket) and TCP keepalive is set on the dialer. Combined with verified PING/PONG auto-handling in the underlying coder/websocket v1.8.14 library, this reduces the per-cycle data-loss window from ~60s to ~5s. New metric
ratesengine_cex_stream_disconnect_total{source,reason}surfaces disconnect cadence (F-0029). - Indexer's postgres connection pool now sets explicit pool-tuning constants (
internal/storage/timescale.PoolConnMaxLifetime= 30 min,PoolConnMaxIdleTime= 5 min,PoolMaxOpenConns= 25,PoolMaxIdleConns= 5) via a new extractedconfigurePoolhelper, and the indexer'swatchPostgresPinggoroutine probes the pool every 60 s emittingratesengine_postgres_ping_total{outcome=ok|error}plus theratesengine_postgres_ping_failure_streakgauge. A newratesengine_postgres_ping_failingpage alert (in bothconfigs/prometheus/rules.r1/storage.ymlanddeploy/monitoring/rules/storage.yml) fires when the error rate stays above 0.5/s for 2 min, with the newdocs/operations/runbooks/postgres-ping-failing.md. Previously, a postgres outage that lasted past the natural conn lifetime left dead conns in the pool and the indexer would silently fail writes for hours until manually restarted — root cause of the ~14 h cascade-gap on 2026-05-26-27 (F-0151). The new lifetime forces fresh conns regularly; the ping surfaces stuck pools to alerting in minutes instead of hours. - CoinGecko poller default cadence bumped from 60s to 300s; the connector already uses the
/simple/pricebatch endpoint, so daily call volume drops from ~1,440/day to ~288/day with ample headroom for the market-cap refresher and divergence reference under a shared demo-tier IP cap. Closes the sustained "poller error … http 429 — backing off 59m59s" loop observed live on r1 (F-0030). internal/divergence/coingecko.gonow batches per-tick lookups into a single/simple/pricecall instead of one HTTP call per pair. Daily call volume drops from ~25,920 (9 pairs × every-30s tick) to ~2,880 (one batched call × every-30s tick) — well within the demo-tier 10K limit (F-0030 follow-up).galexie-archive-fillPhase-1b auto-detection of trailing-edge partial partitions: file-count the latestPARTIAL_CHECK_WINDOW=4partitions per hourly fire and re-mirror any local partition that has fewer files than AWS. Closes the F-0158 trailing-partition-stuck failure mode wherecomm -23 aws localtreated a partition with 416/64000 files as "present" and never revisited it. Recovered ~150k missing files inFC42F7FF--62720000-62783999,FC43F1FF--62656000-62719999,FC44EBFF--62592000-62655999on r1 same session.