Release Rates Engine v0.5.0-rc.60 · StellarIndex/stellar-index

[v0.5.0-rc.60] — 2026-05-20

Added

ratesengine-ops trim-galexie-archive operator (#7
implementation step 2b — second half of ADR-0027 §Step 2).
DESTRUCTIVE subcommand that deletes LCM files from the local hot
tier (galexie-archive MinIO) whose entire ledger range is below
the operator-specified --older-than-ledger N, after verifying
upstream presence in the cold tier. Five-layer safety stack:
(1) --dry-run is the default when neither --dry-run nor
--commit is set — actual deletion requires explicit
--commit; (2) --verify-upstream is the default — every
candidate is HEAD'd against cold before being marked for
deletion; --no-verify-upstream is a documented escape hatch
for restore-from-backup workflows; (3) --max-files caps
deletions per run (default 100000) — a typo cannot trim the
full archive in one shot; (4) --older-than-ledger is
required (no implicit cutoff); (5) cold tier MUST be
configured (refuses to run otherwise — trim without a cold
fallback is unrecoverable data loss). Rollback is mechanical:
ratesengine-ops rehydrate-galexie-archive -from N -to N
re-fetches from cold. Per-object DeleteObject (vs bulk
DeleteObjects) so a partial failure leaves a clear position
cursor — operator re-runs --dry-run to see what's left.
Promotes aws-sdk-go-v2/{aws,config,credentials,service/s3}
from transitive to direct dependencies (already in our tree
via go-stellar-sdk) — needed because the SDK's
datastore.DataStore interface lacks a Delete method. Tests
cover the safety primitives (default verify-upstream, default
no-commit, --commit opt-in, uint32 overflow guard) +
splitBucketPath (SDK-compatible bucket/prefix parsing). The
full ADR-0027 §Step 2 is now complete (rehydrate from §2a +
trim from §2b); §Steps 3-5 (flag-flip in r1's TOML, first bulk
trim, monthly cadence) are operator-gated and follow.
ratesengine-ops rehydrate-galexie-archive operator (#7
implementation step 2a — first half of ADR-0027 §Step 2).
Non-destructive subcommand that copies LCM files for a ledger
range from the configured cold tier (storage.s3_cold_*) back
into the local hot tier (storage.s3_bucket_archive MinIO
bucket). Idempotent via PutFileIfNotExists — files already
present in hot are skipped, not refetched. -dry-run reports
the file list + skipped / would-copy / missing-in-cold counts
without writing. Use cases: recover from accidental trim,
pre-warm hot before a planned backfill, cold-tier integrity
spot check (the missing_in_cold counter surfaces files that
genuinely never landed upstream). Refuses to run when cold tier
isn't configured. The path-enumeration logic uses the SDK's
DataStoreSchema.GetObjectKeyFromSequenceNumber + steps by
LedgersPerFile so each schema-aligned file is visited once;
a defensive fallback handles LedgersPerFile == 0 (a
malformed schema would otherwise infinite-loop). Tests cover:
alignment of -from down to file boundary, no-duplicates,
zero-LPF fallback, single-LPF (the Galexie default), flag
parsing (4 cases). The destructive trim operator (the
second half of §Step 2) follows in a separate commit — it
needs delete capability which the SDK's datastore.DataStore
doesn't expose, so it'll wire AWS SDK v2's s3.Client
directly.
StorageConfig cold-tier fields + LedgerstreamConfig wires
them (#7 implementation step 1c). New TOML fields in
[storage]: s3_cold_endpoint, s3_cold_region,
s3_cold_bucket_archive, s3_cold_access_key_env,
s3_cold_secret_key_env. All default to empty — every
pre-ADR-0027 deployment continues to use the legacy single-
source path byte-for-byte. StorageConfig.ColdTieringEnabled()
returns true iff s3_cold_bucket_archive is set (the
LCM_TIER_ENABLED=false of ADR-0027 §Step 1 expressed as a
field presence). pipeline.LedgerstreamConfig populates
ledgerstream.Config.ColdDataStore when tiering is enabled
and the caller is reading the archive bucket — the live
bucket (galexie-live) is the rolling near-tip working set
authored locally and is never tiered. Tests cover the
no-cold-tier default, the cold-tier-archive path, the
cold-tier-skipped-for-live-bucket guard, and the
ColdTieringEnabled truth table. ADR-0027 §Step 1 is now
complete in code; §Steps 2-5 (trim + rehydrate operators,
flag-flip on r1, bulk trim, monthly cadence) follow as
separate commits.
ledgerstream.Stream gains an opt-in tiered read path
(#7 implementation step 1b). Config learns a new optional
ColdDataStore datastore.DataStoreConfig field; when set,
Stream constructs a TieredDataStore wrapping
the hot (Config.DataStore) + cold (Config.ColdDataStore)
underlying stores, builds a BufferedStorageBackend directly
on top, and drives the LCM iteration with a loop that mirrors
the SDK's ingest.ApplyLedgerMetadata shape — same bounded /
unbounded validation, same max(2, range.From) clamp, same
GetLedger-per-ledger sequence, same WithMetrics wrap when a
registry is provided. When ColdDataStore is zero-valued
(the default), the legacy single-source path through
ingest.ApplyLedgerMetadata is used unchanged — backward
compatible with every existing caller. This satisfies ADR-0027
§Sequencing step 1 ("Land the dual-source read path behind a
LCM_TIER_ENABLED=false flag"); the flag here is the
presence/absence of ColdDataStore rather than a separate
bool. Operator-facing config wiring (parsing the cold-tier TOML
block + populating cfg.ColdDataStore) is the next step.
ledgerstream.TieredDataStore — two-tier datastore.DataStore
fallback chain (#7 implementation step 1). Satisfies the SDK's
datastore.DataStore interface; composes a hot + cold
underlying store. Reads try hot first, fall through to cold on
IsNotFound errors only — transient errors (network timeouts,
auth failures, throttling) propagate immediately so a
misconfigured hot endpoint surfaces as the operator's problem
rather than being masked by a slow cold path that always
succeeds. Writes (PutFile, PutFileIfNotExists) target hot
exclusively (cold is read-only by design — production cold is
aws-public-blockchain, the AWS Open Data Sponsorship bucket).
ListFilePaths unions hot + cold with hot-wins dedup so a
backfill spanning the tier boundary sees every partition.
Optional Prometheus metrics: ratesengine_ledgerstream_tier_read_total{outcome="hot"|"cold"|"both_missing"}
and ratesengine_ledgerstream_cold_read_duration_seconds. Not
yet wired into ledgerstream.Stream's Config — that integration
is the next step (still behind the planned LCM_TIER_ENABLED
feature flag per ADR-0027 §Sequencing).

Docs

docs/operations/lcm-cache-tiering.md — operator runbook for
ADR-0027 §Steps 3-5 (#7 implementation companion). Step-by-
step playbook for the operator-gated transition: TOML flag-flip
(Step 3), first bulk trim with chunked 1M-ledger invocations
- per-chunk pool monitoring (Step 4), and the monthly cadence
  caveat (Step 5 — timer not yet shipped, pending an
  --older-than-duration mode that resolves tip at run time).
  Includes pre-flight checklist, cutoff-ledger computation
  formula (TIP - 90 × 17280), rollback playbook via the
  rehydrate operator, and a "common failure modes" catalogue
  (cold tier check fails, cold.Exists warnings, pool capacity
  rise during trim, indexer cold.GetFile errors). Metrics
  reference points operators at
  ratesengine_ledgerstream_tier_read_total{outcome=...} and
  ratesengine_ledgerstream_cold_read_duration_seconds for
  real-time visibility.
ADR-0027 (Proposed): LCM cache tiering — local
galexie-archive as hot, aws-public-blockchain as cold (#7
design pass). R1's ZFS pool is at 93% (12.5 TB used, 1.35 TB
free) with the 2026-05-17 SEV showing what structural-tight
headroom costs. The biggest single tier-able lever is the
4.96 TB data/minio dataset (mostly galexie-archive's
genesis→tip LCM mirror); the AWS Open Data Sponsorship publishes
the same data at sub-15ms for in-region readers and ~80 ms per-
GET (amortised over 64-ledger partitions) for r1. ADR proposes a
90 d hot window in local galexie-archive with cold reads
falling back to AWS, a HEAD-verify-before-delete trim operator
(ratesengine-ops trim-galexie-archive), and a five-step
rollout that lands the dual-source read path under a feature
flag before any deletion happens. Recovers ~3-4 TB at the 90 d
cutoff, unblocking #30 (composite index on the 2.7B-row
trades hypertable) and #35 (the SEV-frozen Soroban-era
backfill resume). History-archive offload + galexie-live
promotion-cadence tuning + PostgreSQL chunk retention beyond
current policy are explicitly out of scope as separate ADRs.

Changed

buildPoolsQuery reads from pools_per_source_1h (#25 phase
2). Replaces three trades-hypertable scans (vol_24h CTE,
last_px DISTINCT-ON CTE, outer FROM trades) with a single CAGG
scan + GROUP BY. The XLM-fallback semantics for unpriced trades
are preserved exactly (priced trades contribute their stored
usd_volume; trades with an XLM leg fall back to
base_amount × XLM/USD or quote_amount × XLM/USD; pure-token-
token unpriced trades contribute 0 — pre-#25 returned NULL, but
the handler scan collapsed NULL and "0" identically, so client-
visible behaviour unchanged). Trade-off: last_trade_at lags by
up to one CAGG refresh interval (5 min); acceptable for a pools
discovery surface. After this commit ships, #23's
CachedMarketsReader SWR layer becomes a latency nicety rather
than load-bearing — refresh fills stop paying the 8-30s trades-
scan cost. Integration test bootstrap force-refreshes the new
CAGG alongside prices_1m. Operator note: the CAGG was
created WITH NO DATA in migration 0036; the 5-minute policy
only refreshes the last 7 days. Run
CALL refresh_continuous_aggregate('pools_per_source_1h', NULL, NULL)
once on r1 after the 0036 migration applies, to backfill the
14d-window's historical buckets so /v1/pools sees the full pool
set immediately rather than ramping up over a week.

Added

Per-source pools continuous aggregate — pools_per_source_1h
(#25, migration 0036). The durable backing for /v1/pools.
Pre-#25 the handler's buildPoolsQuery scanned the full trades
hypertable for ts >= NOW() - 24h grouped by (source, base, quote) — measured 8-30s; #23 wrapped it in SWR (sub-ms warm,
~8s cold first hit). This CAGG pre-aggregates per
(source, base_asset, quote_asset, 1h bucket):
sum_usd_priced, sum_base_unpriced / sum_quote_unpriced
(Phase-1 vs needs-XLM-fallback splits), trade_count, and
last(quote_amount/base_amount, ts) for the per-pool latest
price. Refresh policy every 5 minutes covering the last 7 days
(over-refresh tolerates late-arriving backfilled trades — the
#38 router/defindex run is the current example). Storage:
~3-4M rows steady-state (~hundreds of MB) — small enough to keep
no retention so operators can later widen the window past 24h.
Handler refactor to read from the CAGG ships in a follow-up
commit (this commit lands the migration first so the CAGG can
materialize cleanly before the handler depends on it). After
refactor, #23's SWR becomes a latency nicety rather than
load-bearing.

Changed

node-exporter consolidation: legacy → Debian package (#33).
The pre-#33 state was two units fighting :9100 — a hand-rolled
node_exporter.service (custom unit + /usr/local/bin/node_exporter,
2024 binary) running, and the apt-installed
prometheus-node-exporter.service perpetually failing because
the port was taken. Cut over to the Debian package live on r1:
configured /etc/default/prometheus-node-exporter with the
legacy's exact flags (--collector.systemd --collector.processes --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
— preserves every existing textfile metric: archive_completeness.prom,
sla_probe.prom, galexie_archive_tip_lag.prom), stopped + disabled
the legacy unit, restarted the package. Live-verified: 13 textfile
metric lines visible, 3127 node_* metrics serving,
prometheus-node-exporter no longer in the failed-unit list.
Codified in Ansible (10-observability.yml): apt-install the
package, template ARGS=, enable, idempotent stop+disable of any
pre-existing legacy unit. Legacy unit file + binary deliberately
retained for zero-downtime rollback (systemctl stop prometheus-node-exporter && systemctl start node_exporter).
/v1/diagnostics/ingestion pregenerated server-side (#16).
Background goroutine Server.StartIngestionSnapshotRefresh
builds the full ingestion-diagnostics snapshot every 15 s into
an atomic.Pointer[ingestionSnapshotEntry]; the handler reads
the atomic and writes it sub-ms instead of the previous ~417 ms
inline build (7 parallel DB-filler goroutines + post-fillers
coverage projection). Inline build remains as the cold-start
fallback (the atomic is nil until the first refresh fires).
Cadence (15 s) matches the existing Cache-Control: max-age=15
header. Refresh uses a detached context.Background()-derived
ctx (//nolint:gosec,contextcheck — intentional, the parent is
the api process lifetime, not any request). Launched alongside
the existing prewarmCaches goroutine in cmd/ratesengine-api.

Added

galexie-archive tip-lag alert (#31) — defense-in-depth for
#26. Adds a Prometheus textfile-collector metric
(galexie_archive_tip_lag_ledgers and friends) computed every
5 min by galexie-archive-tip-lag.{service,timer} running
/usr/local/bin/galexie-archive-tip-lag. The accompanying alert
pages (ratesengine_galexie_archive_tip_lag_severe) within hours
if the hourly galexie-archive-fill.timer silently breaks — the
exact failure class that let #26 go undetected for 23 days.
Rules added to BOTH deploy/monitoring/rules/galexie-archive.yml
and configs/prometheus/rules.r1/galexie-archive.yml (wave-96
dual-dir). Runbook at
docs/operations/runbooks/galexie-archive-tip-lag.md. Codified
in Ansible (07-galexie.yml: copy script + install
.j2-templated unit + enable timer). Live on r1 (current lag
9,388 ledgers — well below the warn threshold of 5,000 sustained
for 30 min). Three alert variants: _high (P3, warn 5 k for
30 m), _severe (P1, page 50 k for 30 m), _metric_stale (P3,
the metric file hasn't refreshed in 30 m — the alert canary).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rates Engine v0.5.0-rc.60

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

[v0.5.0-rc.60] — 2026-05-20

Added

Docs

Changed

Added

Changed

Added

Uh oh!