Skip to content

Rates Engine v0.5.0-rc.83

Pre-release
Pre-release

Choose a tag to compare

@github-actions github-actions released this 28 May 11:44
· 483 commits to main since this release

[v0.5.0-rc.83] — 2026-05-28

Tested against Stellar Protocol 23 (Whisk).

Added

  • ratesengine_api_price_stale alert (both R1 overlay + multi-host) gets an absent_over_time OR-branch so the cascade-wedge case fires instead of going no-data silent (F-0104 closure). The staleness gauge is emitted by the aggregator at end-of-tick; when the aggregator wedges, the gauge stops being scraped, the series goes stale, and a bare > 120 predicate sees no-data — i.e. the alert designed to catch exactly that cascade was itself a victim of it. New expr: staleness > 120 OR absent_over_time(...[10m]) == 1. Same pattern as aggregator_silent (F-0080) and the exporter-down meta-alerts (F-0085). Annotation updated so operators reading the page know to consult the aggregator-silent runbook when the gauge is absent rather than the price-stale runbook.
  • http_request_success_duration_seconds histogram (F-0105 closure). The middleware records into this metric only when the response status is < 500 and not 499 (client-aborted). The latency SLO recording rules in slo.yml (both R1 overlay + multi-host) now use http_request_success_duration_seconds_bucket{le="0.2"} for the fast-success numerator while keeping http_request_duration_seconds_count as the all-request denominator. Pre-this-PR a 5 ms 500 landed in the same histogram as a 5 ms 200 and reported as "good fast" against the SLO, even though the customer experience was a hard outage. After: a fast 5xx burns the latency budget (numerator excludes the error, denominator counts it). One new regression test pins _success_duration_seconds_count at 0 for a synthetic 500. Availability SLO (http_requests_total{status_class=5xx}) is unchanged; this PR only fixes the latency dimension.
  • /v1/diagnostics/cursors distinguishes transient storage errors (503 cursors-transient + cursors-timeout) from genuine 500s. Under the F-0039 cascade this operator-diagnostic was the most-needed surface but returned the same opaque 500 for "postgres briefly stalled" and "endpoint permanently broken" — operators couldn't tell whether to retry or escalate. Now: 5s ctx-timeout on the ListCursors call; deadline-exceeded → 503 cursors-timeout; transientStorageErr (driver-bad-connection / 57014 cancel / broken-pipe / EOF) → 503 cursors-transient; client-aborted is filtered; the residual 500 is reserved for genuinely-unknown errors. Two new tests pin transient → 503 and non-transient → 500. handleCursors extracts the seven-branch error map into writeCursorsListError to stay under the gocognit ceiling. Closes F-0094 (audit 2026-05-26).
  • Bounded-cardinality counters now pre-seed their well-known label combos at startup so alert PromQL is well-defined before the first event fires (F-0033 closure). ratesengine_aggregator_triangulations_total seeds {outcome=ok|missing_leg|parse_error|redis_error}; ratesengine_stripe_platform_sync_errors_total seeds {operation=get_account|upsert_subscription|account_update|list_keys|key_update}. Pre-this-PR rate(...{outcome="ok"}[15m]) resolved to "no data" until the first triangulation landed — the audit found multiple alert rules whose underlying metric was "missing from scrape output" for this reason. The fix is obs.init()-time .WithLabelValues(...) (no-op Inc-less call) which is enough to publish the series at zero. Counters with unbounded per-pair labels (AggregatorFXSnapFallbackTotal) stay emit-on-error. New TestZeroSeed_F0033 pins the 9 expected series at 0 in the scrape body. The other two metrics F-0033 flagged (ratesengine_ledgerstream_tier_read_total, ratesengine_stellar_archive_publish_errors_total) are intentionally inert today — both are documented in their respective files as Phase-3 / cold-tier reservations.
  • Param-name aliasing extended across the rest of the asset=-canonical endpoints (F-0068, F-0091, F-0073 closure). /v1/observations and /v1/chart now both accept base= as an alias for asset=; passing both is a 400. /v1/price/batch accepts pairs= as an alias for asset_ids= so CG-style callers calling the request "pairs" reach the endpoint without a 400 detour; both-supplied is a 400. New shared resolver resolveAssetOrBaseParam in price.go factors out the asset/base alias logic so future asset=-canonical endpoints inherit the contract for free. Six new tests (chart × 2, observations × 2, price-batch × 2) pin alias-accepted + both-supplied-rejected. Closes F-0068, F-0073, F-0091 — completes the cluster started by F-0061.
  • asset and base query parameters are now interchangeable across /v1/price (asset= canonical, base= accepted) and every endpoint that flows through parseBaseQuote/v1/history, /v1/twap, /v1/vwap, /v1/ohlc — (base= canonical, asset= accepted). Developers copying URLs between endpoints no longer get the F-0061 two-step rejection. Passing BOTH base and asset returns 400 invalid-parameter with a self-explanatory message about which form is canonical for the endpoint they're calling, avoiding silent precedence picks. The pre-existing helpful "this endpoint uses base/quote (not asset/quote)" detail string is replaced with the alias acceptance + the mutually-exclusive 400 — the redirect was a workaround the alias makes unnecessary. New tests pin (a) asset= accepted as base= alias on /v1/history, (b) both-supplied returns 400 with "mutually exclusive" in the body. handlePrice extraction into parsePriceAssetParam keeps the handler under the gocognit ceiling. Closes F-0061 (audit 2026-05-26).
  • window= query parameter on every endpoint that uses parseFromTo (so /v1/twap, /v1/vwap, /v1/ohlc, /v1/history — and any future endpoint that picks up the helper). Convenience shorthand for from = to - window so CG-style customers don't have to compute it; pre-this-PR the param was silently ignored and /v1/twap?window=24h returned a 1h-default 404 with no explanation. Accepts Go's [time.ParseDuration] units (ns, us, ms, s, m, h, including compound 1h30m) plus a trailing-d shortcut for days (7d = 168h). Combining window= with an explicit from is now a 400 — they're conflicting controls for the same value, and rejecting it loudly catches the F-0072 surprise. Three new internal-test functions pin happy-path (hours / minutes / days / compound), conflict rejection, and reject-malformed (garbage, 1x, 1d2h, -5h, 0). Closes F-0072 (audit 2026-05-26).
  • ratesengine-ops backfill chunk-complete log now reports both chunk_size_ledgers (the [from,to] range) and ledgers_walked (the LCM-callback count from the bucket). The previous ledgers=N field reported the range size — operators ran a backfill against an empty bucket (F-0159: -bucket galexie-archive for a range that lived only in galexie-live) and got ledgers=5331 in the chunk-complete log after a 200ms run. With this change, the same scenario logs chunk_size_ledgers=5331 ledgers_walked=0 and also returns an explicit error: backfill walked 0 of 5331 ledgers in range [...] from bucket "galexie-archive" — bucket likely has no files in this range; check --bucket and the galexie-archive/-live mirror for the target range. The chunk fails loudly instead of silently succeeding. Closes F-0159 (audit 2026-05-26).
  • TLS cert expiry self-probe (F-0051). API binary now runs a goroutine that tls.Dials each configured public hostname every 6 h, extracts the leaf cert's NotAfter, and emits it as ratesengine_tls_cert_not_after_unix{host}. New alert ratesengine_tls_cert_expiring_soon (P2, both R1 overlay + multi-host) fires when (NotAfter - time()) < 14 days sustained 1 h. Default hosts list covers api.ratesengine.net + status.ratesengine.net + ratesengine.net (apex); operators override via [api].tls_cert_probe_hosts. Companion ratesengine_tls_cert_probe_total{host, outcome} counter exposes probe health (ok / dial_error / timeout / no_cert). Runbook at docs/operations/runbooks/tls-cert-expiring-soon.md documents 5-min triage, five likely root causes (ACME rate limit, DNS-01 failing, HTTP-01 firewall, disk full, Caddy crashed), and manual renewal sequence. Closes the "Caddy auto-renews but if it fails we don't know until expiry" gap. 5 unit tests pin the probe behaviour including a self-signed httptest TLS server for happy-path coverage.
  • Operator-config wiring for the new per-asset supply-refresh stale-component overrides: [supply].stale_component_ledgers_by_asset map (asset_key → ledger threshold) is now consumed by all three refresher builders (classic / SEP-41 / XLM). Operators set this in ratesengine.toml and the aggregator picks per-asset overrides at startup; empty map preserves the global default for every asset. Concrete deployment example documented in the config doc. F-0040 follow-up to library knob shipped earlier.
  • Per-asset stale-component threshold override for supply refresher (supply.WithStaleComponentLedgersFor(assetKey, maxLag)). F-0040 audit (2026-05-26): PHO governance-token snapshots were being rejected at gap ≈1190 ledgers (~100 min) because of the global 1000-ledger threshold; PHO is low-activity and 1200-ledger lag is normal. Operators can now relax the gate per-asset (e.g. PHO → 5000) without loosening the gate for high-activity XLM/USDC. Two new tests pin (a) the relaxed asset accepts what the global default rejects and (b) the override doesn't bleed into other assets. Caller wires per-asset overrides via supply.NewRefresher(..., WithStaleComponentLedgersFor(...)).
  • make verify-r1-sync now checks for pending Postgres migrations on r1 too. Compares the highest migrations/NNNN_*.up.sql number locally against schema_migrations.version on r1's Postgres and prints the exact scp+migrate-up command if local is ahead. Closes a real gap: rc.83 adds two columns/tables (migration 0046 ingestion_cursors.first_ledger, 0047 sep41_transfers hypertable) — without operator-applied migrations the new binary crashes on its first DB write. feedback_migrations_not_auto_deployed already documents the manual step; this surfaces drift before deploy instead of after.
  • ratesengine_ingestion_source_insert_stale Prometheus alert (P2, R1 overlay + multi-host). Fires when ratesengine_source_last_insert_unix hasn't advanced in >1 h while source_enabled=1. Timestamp-shape sibling to ingestion_duplicate_flood — catches low-volume sources (phoenix, comet) whose insert rate sits under the rate-shape alert's 0.5/s threshold. Reuses the existing duplicate-flood runbook (same root-cause cluster).
  • ratesengine_source_last_insert_unix{source} gauge — wall-clock Unix-seconds timestamp of the most recent successfully-inserted trade row per source. Emitted from Store.InsertTrade only on rowsInserted == 1 (not on ON CONFLICT DO NOTHING). Pairs with ratesengine_source_last_event_unix (dispatcher-matched) to expose the stuck-cursor / duplicate-flood pattern: when the dispatcher keeps matching events but every insert short-circuits, last_event_unix climbs while last_insert_unix flat-lines. Direct alert template: time() - ratesengine_source_last_insert_unix{source=X} > 3600. Complements the rate-shape trade_insert_outcome_total alert with a timestamp-shape signal that fires even without sustained traffic.
  • ratesengine_ingestion_duplicate_flood Prometheus alert (P2, both R1 overlay + multi-host) and the matching runbook at docs/operations/runbooks/ingestion-duplicate-flood.md. Fires when a source has duplicate-insert rate > 0.5/s sustained 10 min with zero new-insert rate — the exact diagnostic signature of the live r1 2026-05-28 stuck-cursor pattern that the new trade_insert_outcome_total counter (below) exposes. Runbook documents 5-min triage (curl metrics, psql max-ts check), three likely causes (cursor jumped past data, stale event channel, replay loop), and per-cause remediation (targeted backfill, indexer restart, stop the loop).
  • ratesengine_trade_insert_outcome_total{source, outcome} counter — distinguishes outcome=new (the row actually landed) from outcome=duplicate (ON CONFLICT DO NOTHING short-circuited). The pre-existing ratesengine_trade_inserts_total counter is silent about dedupe, so a stuck-cursor / replay loop is invisible to operators. Live evidence on r1 (2026-05-28): 157 SDEX insert-attempts/min while the trades hypertable's max(ts) was 11 h old — all duplicates. Alert template: rate({outcome="new"}[5m]) == 0 AND rate({outcome="duplicate"}[5m]) > 0. Integration test pins both branches via existing startTimescale testcontainer.
  • DeFindex factory-layer topic recognition closes F-0018 (2026-05-28). New PrefixFactory = "DeFindexFactory" + classifyFactory() covering create / n_fee. Decoder.Matches() returns true for the factory topic prefix; Decode() returns (nil, nil) on a factory match — recognised but not decoded into a flow, so the dispatcher's drop-counter stops filing factory events as "unmatched topic". With the earlier strategy harvest + vault 9-topic admin/rebalance classifications already in place, every previously-**NO** defindex row in inventory/every-event-coverage.tsv is now classification-only coverage. Body decode (especially for create — the vault-spawn signal needs events.Event.OpArgs per Surprising-gotcha #2 in the WASM audit doc, since the body itself doesn't carry the new vault address) is Phase C. Two new tests pin (a) classifyFactory() byte-equality for both symbols + every reject path and (b) Decoder.Decode returning (nil, nil) rather than ErrUnknownEvent on a factory match.
  • DeFindex decoder enumerates the full upstream event surface (EVERY-event policy). classify() (strategy layer) adds harvest; classifyVault() adds the nine governance / admin / multiplexed-rebalance topics from the audit doc: rescue, paused, unpaused, nreceiver, nmanager, nemanager, rbmanager, dfees, rebalance. Classification only — no canonical Trade or VaultFlow produced for these yet; the goal is closed-set completeness so future per-event decoders (or the soroban_events landing zone, ADR-0029) can route on them. Test fixtures updated: the previous "harvest (not Phase A) → " case flips to a positive classification, and every new vault topic gains a per-name subtest.
  • Phoenix decoder's classifyAny() now enumerates the six previously-unclassified governance/lifecycle topics published by phoenix-contracts/contracts/pool/src/contract.rs: the four admin variants under topic[0]="XYK Pool: " (admin-replacement-requested, replace-with-new-admin, undo-admin-change, accepted-new-admin) plus the two "initialize" variants (XYK LP token_a / token_b). Same EVERY-event policy rationale as the aquarius change below — these topics were silently dropped at the classification step despite Phoenix being BackfillSafe=true. Classification only; only swap continues to produce a canonical.Trade. actionAdmin + actionInitialize enum values added so future per-event decoders or the soroban_events landing zone (ADR-0029) can route on them.
  • Aquarius decoder's classify() now enumerates every topic published by aquarius-amm/liquidity_pool_events/src/lib.rs (verified 2026-05-27 against the upstream Rust). Eleven previously-unclassified topics (reserves_sync, set_protocol_fee, claim_protocol_fee, kill_deposit/unkill_deposit, kill_swap/unkill_swap, kill_claim/unkill_claim, kill_gauges_claim/unkill_gauges_claim) are now recognised. Per the EVERY-event policy (project_every_event_principle, 2026-05-25 — classify() is the authoritative completeness gate before flipping BackfillSafe), this closes a latent invariant violation: aquarius was already BackfillSafe=true but eleven event topics were silently dropped at the classification step. Only trade produces a canonical.Trade today; the new classifications make the topics visible to the soroban_events landing zone (ADR-0029) and any future per-event decoder. A TestClassify_completenessVsUpstream forcing function fails CI if a future Event* constant is added without wiring its TopicSymbol* into classify().
  • SEP-41 transfer projection: new sep41_transfers hypertable (migration 0047) materialises every transfer / approve / set_admin / set_authorized event for a watched SEP-41 contract via a sibling-of-sep41_supply decoder at internal/sources/sep41_transfers/. New endpoint GET /v1/contracts/{contract_id}/transfers?from=&to=&limit= exposes the per-account audit trail with per-(contract, from) and per-(contract, to) indexes backing sub-100ms scans. ratesengine-ops sep41-transfers-backfill -from -to subcommand replays the soroban_events landing zone (ADR-0029) through the live decoder for historical coverage. Closes F-0021 partial-scope (audit-2026-05-26) and unlocks the per-account net-position Stellar moat that CG/CMC structurally cannot do (their data ingest doesn't observe on-chain transfers). Operator must apply migration 0047 manually (CLAUDE.md migrations-not-auto-deployed).
  • /v1/ohlc now supports multi-bar series via interval=1m|5m|15m|30m|1h|4h|1d|1w + limit=N (max 1000, default 100). Closes the CG/CMC parity gap where consumers expected a series response instead of a single bar (F-0071). Single-bar behaviour preserved when interval is unset. Multi-bar mode reads the closed-bucket prices_<N> CAGGs (with re-bucketing via time_bucket for 5m/30m ← prices_1m and 4h ← prices_1h); the in-progress bucket is excluded per ADR-0015. Empty series returns 200 + intervals: [] (NOT 404 — series clients expect a stable shape). Wire fields are compact (t/o/h/l/c/v_base/v_quote/n) matching CoinGecko / CoinMarketCap conventions.
  • Density coverage calc (/v1/diagnostics/ingestion) now includes the live ledgerstream cursor's coverage from first_ledger (newly persisted via migration 0046). Density_pct can now hit 1.0 on a perfectly-backfilled-plus-live-tail source. Previously the calc was backfill-cursor-only and capped at ~0.98 even at perfect ingestion (per project_density_100pct_goal mission). The ingestion_cursors table gains a first_ledger column populated for existing backfill cursors by parsing from out of sub_source; the live cursor's first_ledger is captured by UpsertCursor's INSERT branch and preserved across every advance by the ON CONFLICT DO UPDATE clause. NULL first_ledger (pre-migration rows) falls back to sourceGenesisLedger so the live span is credited [genesis, last_ledger] until the indexer re-inserts.
  • docs-lint check that fails CI when any /v1/incidents entry has unchecked [ ] follow-up checkboxes AND the incident is older than 30 days (F-0099 forcing function). Closes the meta-failure-mode of post-mortem action items rotting indefinitely between recurrences of the same cascade — the 2026-05-10 SEV-2 shipped with 4 [ ] items and the same cascade recurred on 2026-05-26 with all four still unchecked.

Changed

  • 2026-05-10-redis-writes-blocked-disk-full post-mortem: checked off the Prometheus root-FS alert follow-up (shipped in #1229 as ratesengine_node_root_disk_warning + _full) and the recovery-sequence runbook follow-up (docs/operations/runbooks/redis-write-blocked-disk-full.md landed in #1228). Remaining open follow-ups: postgresql-common logrotate audit and WASM-audit stderr-capture relocation.

Fixed

  • GET /v1/contracts/{contract_id}/transfers now validates inputs up-front as Stellar strkeys: contract_id must be a 56-char C-strkey (else 400), ?from= / ?to= must each be a G-strkey if present (else 400). Previously a garbage value reached the SQL layer and returned 200 with empty transfers — indistinguishable from "no matching transfers" and actively misleading for the operator-debugging use case. Extracted the 30-line validation block into parseSEP41TransferIdentifiers to keep the handler under the gocognit ceiling. 3 new tests pin the validation paths (4 invalid-contract-id cases, 4 invalid-address cases, plus a happy-path sanity check that valid inputs reach the reader).
  • /v1/assets/{id} cold-cache latency for unknown classic assets dropped from ~4–5 s to single-digit ms. F-0157 perf root cause: Store.HasAsset's WHERE base_asset = $1 OR quote_asset = $1 over the 2.7 B-row trades hypertable had to seek every chunk's index even with EXISTS+LIMIT 1. New hasClassicAsset fast path routes AssetClassic to a primary-key lookup on the classic_assets registry (migration 0023). The registry is populated by InsertTrade's registerClassicAssetSeen hook, so its presence is a strict subset of trade-table presence; unknown classic assets short-circuit without touching the hypertable. Other asset types fall through to the original scan unchanged. Integration test extended for the bogus-classic-asset path.
  • Smoke expect_status helper now actually supports a per-check timeout (--timeout N flag) — the comment claimed this for ages but the code was passing the global $TIMEOUT to every curl call. asset not found behaviour-pin bumped to 20 s because the cold-cache /v1/assets/AAAA-G… resolver takes 4-5 s and was occasionally crossing the global 10 s ceiling, surfacing as FAIL asset not found — curl error in live smoke runs. F-0157 reopened during a live smoke check and verified fixed via direct r1 run.
  • Aggregator divergence refresh is now gated by a configurable minimum interval (default 300 s = cachekeys.DivergenceTTL). F-0030 follow-up: the daily-batched lookup fix landed earlier was still ~10× over the CMC free-tier monthly cap (10K calls / month) because every aggregator tick (30 s on r1 × 12 pairs) drove one external lookup. The div:<asset> Redis entry has a 5-minute TTL, so a 5-minute refresh interval keeps the API's flags.divergence_warning cache continuously populated while burning ~one-tenth the external quota. New aggregate.divergence_min_interval_seconds config knob; zero preserves legacy every-tick behaviour. Two new unit tests pin the gate behaviour.
  • Production Content-Security-Policy on the explorer (ratesengine.net + /embed/*) and status site (status.ratesengine.net) no longer permits http://localhost:3000 in connect-src. F-0054 audit (2026-05-26) flagged this as dev/prod config drift — the Next dev server doesn't read _headers anyway, so the localhost permit was pure leakage. New section 16 in scripts/ci/lint-docs.sh (Content-Security-Policy:.*localhost grep across web/explorer/public/_headers + web/status/public/_headers) fails CI on regression.
  • /v1/oracle/latest p95 latency: a new in-process CachedOracleReader layer (3 s TTL + single-flight) sits between the handler and the existing Redis cache, collapsing concurrent cold-miss stampedes and surviving Redis MISCONF. F-0013 audit (2026-05-26) measured p95 ~271 ms vs the 200 ms SLO. The underlying DISTINCT ON (source) query against oracle_updates has no covering index (sort is unavoidable post-asset-filter), and oracle data refreshes on a 10–60 s cadence, so a 3 s in-process TTL gives customer-visible freshness identical to a direct read while absorbing burst traffic. Key normalisation sorts the asset-strkey list so [native, crypto:XLM] and [crypto:XLM, native] share one slot. Mirrors the F-0011 CachedIssuersReader shape (delete-on-error, waiter-err-pointer single-flight). 8 unit tests added — the previous Redis-only cachedOracleReader shipped without unit-test coverage.
  • ratesengine_price_staleness_seconds XLM ↔ native mirror is now order-independent (F-0032 follow-up). The aggregator iterates cfg.Pairs and emits a staleness gauge per asset; the mirror code in emitStalenessGauges used to set the other form's gauge to the current pair's stale value as a side-effect, so whichever of (crypto:XLM, native) was iterated last won and the alert ratesengine_api_price_stale would fire (or not) based on iteration order. Post-fix both labels carry MIN(stale_native, stale_crypto_XLM) — the freshest form drives both. Added TestEmitStalenessGauges_xlmNativeMirrorOrderIndependent to lock in the invariant, plus TestEmitStalenessGauges_growsAcrossTicks as a baseline regression test (the metric was previously untested end-to-end).
  • /v1/issuers p95 latency: ~404ms → sub-millisecond on cache hit via in-process TTL + single-flight cache (F-0011, was over 200ms SLO target). EXPLAIN ANALYZE on r1 showed the listing's HashAggregate-over-58k-issuers + top-N heapsort hits ~196ms in PG alone before JSON marshalling; no index helps because the GROUP BY + sum(observation_count) requires a full hashagg regardless of access path. New internal/api/v1.CachedIssuersReader wraps IssuersReader with a 5min TTL + single-flight refresh on ListIssuers (passes GetIssuer + ListIssuerAssets through — those are point lookups already on indexed columns). Mirrors the CachedSourcesStatsReader / CachedMarketsReader shape; same ratesengine_api_cache_ops_total{cache="issuers"} instrumentation feeds the existing api_cache_miss_rate_high alert.
  • Binance + Bitstamp CEX WebSocket connections reconnect 12x faster (5s -> 60s exponential, was 60s blanket) and TCP keepalive is set on the dialer. Combined with verified PING/PONG auto-handling in the underlying coder/websocket v1.8.14 library, this reduces the per-cycle data-loss window from ~60s to ~5s. New metric ratesengine_cex_stream_disconnect_total{source,reason} surfaces disconnect cadence (F-0029).
  • Indexer's postgres connection pool now sets explicit pool-tuning constants (internal/storage/timescale.PoolConnMaxLifetime = 30 min, PoolConnMaxIdleTime = 5 min, PoolMaxOpenConns = 25, PoolMaxIdleConns = 5) via a new extracted configurePool helper, and the indexer's watchPostgresPing goroutine probes the pool every 60 s emitting ratesengine_postgres_ping_total{outcome=ok|error} plus the ratesengine_postgres_ping_failure_streak gauge. A new ratesengine_postgres_ping_failing page alert (in both configs/prometheus/rules.r1/storage.yml and deploy/monitoring/rules/storage.yml) fires when the error rate stays above 0.5/s for 2 min, with the new docs/operations/runbooks/postgres-ping-failing.md. Previously, a postgres outage that lasted past the natural conn lifetime left dead conns in the pool and the indexer would silently fail writes for hours until manually restarted — root cause of the ~14 h cascade-gap on 2026-05-26-27 (F-0151). The new lifetime forces fresh conns regularly; the ping surfaces stuck pools to alerting in minutes instead of hours.
  • CoinGecko poller default cadence bumped from 60s to 300s; the connector already uses the /simple/price batch endpoint, so daily call volume drops from ~1,440/day to ~288/day with ample headroom for the market-cap refresher and divergence reference under a shared demo-tier IP cap. Closes the sustained "poller error … http 429 — backing off 59m59s" loop observed live on r1 (F-0030).
  • internal/divergence/coingecko.go now batches per-tick lookups into a single /simple/price call instead of one HTTP call per pair. Daily call volume drops from ~25,920 (9 pairs × every-30s tick) to ~2,880 (one batched call × every-30s tick) — well within the demo-tier 10K limit (F-0030 follow-up).
  • galexie-archive-fill Phase-1b auto-detection of trailing-edge partial partitions: file-count the latest PARTIAL_CHECK_WINDOW=4 partitions per hourly fire and re-mirror any local partition that has fewer files than AWS. Closes the F-0158 trailing-partition-stuck failure mode where comm -23 aws local treated a partition with 416/64000 files as "present" and never revisited it. Recovered ~150k missing files in FC42F7FF--62720000-62783999, FC43F1FF--62656000-62719999, FC44EBFF--62592000-62655999 on r1 same session.