fix: use audius_core_root_dir in rollback entrypoint#226
Open
raymondjacobson wants to merge 10 commits intomainfrom
Open
fix: use audius_core_root_dir in rollback entrypoint#226raymondjacobson wants to merge 10 commits intomainfrom
raymondjacobson wants to merge 10 commits intomainfrom
Conversation
The rollback entrypoint hardcoded /data/core as the CometBFT root, but nodes can configure a different path via audius_core_root_dir. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CometBFT does not emit NewBlock events during fast sync, so the block event subscription never updates currentHeight. This caused syncHeight in GetStatus to freeze at the startup value, making it impossible to monitor block sync progress via the API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tate sync pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal (which only helps when table and COPY are in the same transaction, which pg_restore doesn't do). Split the restore into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED, which completely suppresses WAL generation for writes. Tables are converted back to LOGGED after data load, before indexes are built. Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before the restore to reduce background I/O pressure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a message field to StateSyncInfo proto (field 7) and set it during each phase of RestoreDatabase: schema restore, preparing tables, data load, finalizing tables, building indexes. Display it on the console overview page under the phase label so operators can see which step the restore is on rather than just "Restoring Database". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues during state sync data restore: 1. Migrations pre-populate some tables (core_db_migrations) before the pg_restore runs, causing COPY to fail with duplicate key errors. Fix by truncating all public tables with CASCADE before the data phase. 2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed first (non-deterministic without ORDER BY), the UNLOGGED alter fails. Fix by ordering the table list alphabetically. Also adds error/success logging for the postgres tuning commands so we can confirm they are actually applied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fork PRs run CI with the approval gate but never receive secrets. Make login conditional on inputs.push and pass push=false for fork PRs so build+test runs but nothing is pushed to Docker Hub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During state sync, the TRUNCATE loop was calling db.Exec() while rows were still open on the same connection, causing all truncates to fail with "conn busy". Collect names into a slice first (same pattern as alterPersistence) then close rows before running the TRUNCATEs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…napshots stateSyncLatestBlock was using the oldest snapshot on the RPC server to compute the trust height. When that snapshot is older than the 168h trust period, the light client fails with "old header has expired". Fix: derive trust height as (latestHeight - 288000) — ~2 days before current. This is always within the trust period, doesn't require the RPC servers to have any snapshots, and stays below recent snapshots (interval ~100k blocks) so CometBFT can still pick one to sync from. Also removes the now-unused sdk and connect imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When OfferSnapshot rejected a snapshot due to hash mismatch, it left acceptedSnapshotHeight set, causing all subsequent snapshot offers to be rejected with "already syncing to different snapshot". Reset the accepted snapshot state so CometBFT can fall through to the next available snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…restore path Two reliability fixes for state sync: 1. Snapshot creation after catch-up: When a node restarts and block-syncs to the current head, createSnapshot skips every interval that fires while CatchingUp=true. The snapshot creator now waits for sync to complete, then immediately creates a snapshot if the last one is more than BlockInterval blocks old. Prevents the current 8-16 hour gap where no fresh snapshots exist after a node restarts. 2. Remove direct restore path: The startup optimization that skipped CometBFT state sync when all chunks were already on disk left CometBFT's state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every start. CometBFT's native state sync path (ApplySnapshotChunk) already reuses chunks already on disk and correctly updates state.db to the snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5 tasks
raymondjacobson
added a commit
that referenced
this pull request
May 7, 2026
Each merge to main pushes a new :edge tag, triggering the auto-upgrader on snapshot-serving nodes. When those nodes restart and catch up, the snapshot creator skips every 100k-block interval that fires during CatchingUp=true, leaving no fresh snapshots for up to ~16 hours. Re-enable once PR #226 (catch-up snapshot fix) is merged and deployed. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/data/corewhen searching for the CometBFT data directoryaudius_core_root_dir(e.g./data/bolton creatornode3)${audius_core_root_dir:-/data/core}so therollbacksubcommand works on all nodesTest plan
kubectl applywith debug pod usingcommand: ["/bin/entrypoint.sh", "rollback", "-blocks", "1", "-dry-run"], verify it finds the correct CometBFT data directory🤖 Generated with Claude Code