Skip to content

fix: use audius_core_root_dir in rollback entrypoint#226

Open
raymondjacobson wants to merge 10 commits intomainfrom
fix/rollback-entrypoint-core-root-dir
Open

fix: use audius_core_root_dir in rollback entrypoint#226
raymondjacobson wants to merge 10 commits intomainfrom
fix/rollback-entrypoint-core-root-dir

Conversation

@raymondjacobson
Copy link
Copy Markdown
Contributor

Summary

  • The rollback entrypoint hardcoded /data/core when searching for the CometBFT data directory
  • Nodes can configure a different root via audius_core_root_dir (e.g. /data/bolt on creatornode3)
  • Now uses ${audius_core_root_dir:-/data/core} so the rollback subcommand works on all nodes

Test plan

  • Scale down a node, run kubectl apply with debug pod using command: ["/bin/entrypoint.sh", "rollback", "-blocks", "1", "-dry-run"], verify it finds the correct CometBFT data directory

🤖 Generated with Claude Code

raymondjacobson and others added 10 commits May 5, 2026 15:38
The rollback entrypoint hardcoded /data/core as the CometBFT root,
but nodes can configure a different path via audius_core_root_dir.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CometBFT does not emit NewBlock events during fast sync, so the block
event subscription never updates currentHeight. This caused syncHeight
in GetStatus to freeze at the startup value, making it impossible to
monitor block sync progress via the API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tate sync

pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB
of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off
and wal_level=minimal (which only helps when table and COPY are in the same
transaction, which pg_restore doesn't do).

Split the restore into three phases: pre-data (schema), data, post-data
(indexes). During the data phase, all tables are set to UNLOGGED, which
completely suppresses WAL generation for writes. Tables are converted back
to LOGGED after data load, before indexes are built.

Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before
the restore to reduce background I/O pressure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a message field to StateSyncInfo proto (field 7) and set it during
each phase of RestoreDatabase: schema restore, preparing tables, data
load, finalizing tables, building indexes. Display it on the console
overview page under the phase label so operators can see which step the
restore is on rather than just "Restoring Database".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues during state sync data restore:
1. Migrations pre-populate some tables (core_db_migrations) before the
   pg_restore runs, causing COPY to fail with duplicate key errors. Fix by
   truncating all public tables with CASCADE before the data phase.
2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed
   first (non-deterministic without ORDER BY), the UNLOGGED alter fails.
   Fix by ordering the table list alphabetically.

Also adds error/success logging for the postgres tuning commands so we
can confirm they are actually applied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fork PRs run CI with the approval gate but never receive secrets.
Make login conditional on inputs.push and pass push=false for fork PRs
so build+test runs but nothing is pushed to Docker Hub.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During state sync, the TRUNCATE loop was calling db.Exec() while rows
were still open on the same connection, causing all truncates to fail
with "conn busy". Collect names into a slice first (same pattern as
alterPersistence) then close rows before running the TRUNCATEs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…napshots

stateSyncLatestBlock was using the oldest snapshot on the RPC server to
compute the trust height. When that snapshot is older than the 168h trust
period, the light client fails with "old header has expired".

Fix: derive trust height as (latestHeight - 288000) — ~2 days before
current. This is always within the trust period, doesn't require the
RPC servers to have any snapshots, and stays below recent snapshots
(interval ~100k blocks) so CometBFT can still pick one to sync from.

Also removes the now-unused sdk and connect imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When OfferSnapshot rejected a snapshot due to hash mismatch, it left
acceptedSnapshotHeight set, causing all subsequent snapshot offers to
be rejected with "already syncing to different snapshot". Reset the
accepted snapshot state so CometBFT can fall through to the next
available snapshot height.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…restore path

Two reliability fixes for state sync:

1. Snapshot creation after catch-up: When a node restarts and block-syncs
   to the current head, createSnapshot skips every interval that fires
   while CatchingUp=true. The snapshot creator now waits for sync to
   complete, then immediately creates a snapshot if the last one is more
   than BlockInterval blocks old. Prevents the current 8-16 hour gap
   where no fresh snapshots exist after a node restarts.

2. Remove direct restore path: The startup optimization that skipped
   CometBFT state sync when all chunks were already on disk left
   CometBFT's state.db at height 0 while ABCI returned the snapshot
   height, causing a handshake mismatch on every start. CometBFT's
   native state sync path (ApplySnapshotChunk) already reuses chunks
   already on disk and correctly updates state.db to the snapshot height.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
raymondjacobson added a commit that referenced this pull request May 7, 2026
Each merge to main pushes a new :edge tag, triggering the auto-upgrader
on snapshot-serving nodes. When those nodes restart and catch up, the
snapshot creator skips every 100k-block interval that fires during
CatchingUp=true, leaving no fresh snapshots for up to ~16 hours.

Re-enable once PR #226 (catch-up snapshot fix) is merged and deployed.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant