Skip to content

fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore)#253

Merged
raymondjacobson merged 4 commits into
mainfrom
fix/state-sync-snapshot-reliability
May 9, 2026
Merged

fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore)#253
raymondjacobson merged 4 commits into
mainfrom
fix/state-sync-snapshot-reliability

Conversation

@raymondjacobson
Copy link
Copy Markdown
Contributor

Problem

Two interrelated state sync reliability issues:

  1. No fresh snapshots after node restart — When a node restarts and block-syncs to the current head, createSnapshot returns early for every interval that fires while CatchingUp=true. Both creatornode.audius.co and creatornode2.audius.co hit this: after their recent restarts they missed every snapshot interval between ~23.7M and ~24.2M blocks (~8h gap). Nodes needing to state sync had no usable recent snapshot.

  2. OOM during pg_restorepg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal. Migrations also pre-populate some tables before pg_restore runs, causing COPY to fail with duplicate key errors.

Changes

pkg/core/server/state_sync.go

  • After block sync completes (CatchingUp=false), immediately create a snapshot if the latest existing snapshot is more than BlockInterval blocks old. Prevents the 8-16h gap where no fresh snapshots exist after a restart.
  • Split RestoreDatabase into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED (suppresses WAL completely), then converted back to LOGGED before index build.
  • Truncate all public tables before the data phase so migration-created rows don't cause duplicate key errors.
  • Collect table names into a slice before running TRUNCATEs to avoid pgx conn busy error (can't call Exec while rows.Next() is iterating on same connection).

pkg/core/server/abci.go

  • Remove broken "direct restore path" that skipped CometBFT state sync when all chunks were already on disk. This left state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every subsequent start. CometBFT's native ApplySnapshotChunk already reuses on-disk chunks and correctly updates state.db.

Scope

Intentionally minimal — only the 4 commits that fix snapshot creation and restore reliability. Does not include: trust height fix, console display changes, CI changes, or rollback entrypoint changes (those are in separate PRs).

Deploy plan

  1. Merge and deploy to creatornode.audius.co and creatornode2.audius.co — on restart, each will immediately create a fresh snapshot at the current block height instead of waiting up to 100k blocks (~37h).
  2. Once a fresh snapshot appears, wipe and restart creatornode3.audius.co to state sync against it.

🤖 Generated with Claude Code

raymondjacobson and others added 4 commits May 7, 2026 19:00
…tate sync

pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB
of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off
and wal_level=minimal (which only helps when table and COPY are in the same
transaction, which pg_restore doesn't do).

Split the restore into three phases: pre-data (schema), data, post-data
(indexes). During the data phase, all tables are set to UNLOGGED, which
completely suppresses WAL generation for writes. Tables are converted back
to LOGGED after data load, before indexes are built.

Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before
the restore to reduce background I/O pressure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two issues during state sync data restore:
1. Migrations pre-populate some tables (core_db_migrations) before the
   pg_restore runs, causing COPY to fail with duplicate key errors. Fix by
   truncating all public tables with CASCADE before the data phase.
2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed
   first (non-deterministic without ORDER BY), the UNLOGGED alter fails.
   Fix by ordering the table list alphabetically.

Also adds error/success logging for the postgres tuning commands so we
can confirm they are actually applied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
During state sync, the TRUNCATE loop was calling db.Exec() while rows
were still open on the same connection, causing all truncates to fail
with "conn busy". Collect names into a slice first (same pattern as
alterPersistence) then close rows before running the TRUNCATEs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…restore path

Two reliability fixes for state sync:

1. Snapshot creation after catch-up: When a node restarts and block-syncs
   to the current head, createSnapshot skips every interval that fires
   while CatchingUp=true. The snapshot creator now waits for sync to
   complete, then immediately creates a snapshot if the last one is more
   than BlockInterval blocks old. Prevents the current 8-16 hour gap
   where no fresh snapshots exist after a node restarts.

2. Remove direct restore path: The startup optimization that skipped
   CometBFT state sync when all chunks were already on disk left
   CometBFT's state.db at height 0 while ABCI returned the snapshot
   height, causing a handshake mismatch on every start. CometBFT's
   native state sync path (ApplySnapshotChunk) already reuses chunks
   already on disk and correctly updates state.db to the snapshot height.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@raymondjacobson raymondjacobson force-pushed the fix/state-sync-snapshot-reliability branch from 4ef5799 to e08ebb8 Compare May 8, 2026 23:54
@raymondjacobson raymondjacobson merged commit 7e900cf into main May 9, 2026
4 checks passed
@raymondjacobson raymondjacobson deleted the fix/state-sync-snapshot-reliability branch May 9, 2026 00:08
raymondjacobson added a commit that referenced this pull request May 11, 2026
State sync reliability fixes are now merged (PR #253, #254, #256).
Restoring the edge tag job so nodes on auto-upgrade resume tracking
the latest main.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant