fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore)#253
Merged
Merged
Conversation
…tate sync pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal (which only helps when table and COPY are in the same transaction, which pg_restore doesn't do). Split the restore into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED, which completely suppresses WAL generation for writes. Tables are converted back to LOGGED after data load, before indexes are built. Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before the restore to reduce background I/O pressure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues during state sync data restore: 1. Migrations pre-populate some tables (core_db_migrations) before the pg_restore runs, causing COPY to fail with duplicate key errors. Fix by truncating all public tables with CASCADE before the data phase. 2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed first (non-deterministic without ORDER BY), the UNLOGGED alter fails. Fix by ordering the table list alphabetically. Also adds error/success logging for the postgres tuning commands so we can confirm they are actually applied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During state sync, the TRUNCATE loop was calling db.Exec() while rows were still open on the same connection, causing all truncates to fail with "conn busy". Collect names into a slice first (same pattern as alterPersistence) then close rows before running the TRUNCATEs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…restore path Two reliability fixes for state sync: 1. Snapshot creation after catch-up: When a node restarts and block-syncs to the current head, createSnapshot skips every interval that fires while CatchingUp=true. The snapshot creator now waits for sync to complete, then immediately creates a snapshot if the last one is more than BlockInterval blocks old. Prevents the current 8-16 hour gap where no fresh snapshots exist after a node restarts. 2. Remove direct restore path: The startup optimization that skipped CometBFT state sync when all chunks were already on disk left CometBFT's state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every start. CometBFT's native state sync path (ApplySnapshotChunk) already reuses chunks already on disk and correctly updates state.db to the snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4ef5799 to
e08ebb8
Compare
raymondjacobson
added a commit
that referenced
this pull request
May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two interrelated state sync reliability issues:
No fresh snapshots after node restart — When a node restarts and block-syncs to the current head,
createSnapshotreturns early for every interval that fires whileCatchingUp=true. Both creatornode.audius.co and creatornode2.audius.co hit this: after their recent restarts they missed every snapshot interval between ~23.7M and ~24.2M blocks (~8h gap). Nodes needing to state sync had no usable recent snapshot.OOM during pg_restore —
pg_restoreon large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even withsynchronous_commit=offandwal_level=minimal. Migrations also pre-populate some tables beforepg_restoreruns, causing COPY to fail with duplicate key errors.Changes
pkg/core/server/state_sync.goCatchingUp=false), immediately create a snapshot if the latest existing snapshot is more thanBlockIntervalblocks old. Prevents the 8-16h gap where no fresh snapshots exist after a restart.RestoreDatabaseinto three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set toUNLOGGED(suppresses WAL completely), then converted back toLOGGEDbefore index build.conn busyerror (can't callExecwhilerows.Next()is iterating on same connection).pkg/core/server/abci.gostate.dbat height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every subsequent start. CometBFT's nativeApplySnapshotChunkalready reuses on-disk chunks and correctly updatesstate.db.Scope
Intentionally minimal — only the 4 commits that fix snapshot creation and restore reliability. Does not include: trust height fix, console display changes, CI changes, or rollback entrypoint changes (those are in separate PRs).
Deploy plan
🤖 Generated with Claude Code