fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore) by raymondjacobson · Pull Request #253 · OpenAudio/go-openaudio

raymondjacobson · 2026-05-08T02:01:32Z

Problem

Two interrelated state sync reliability issues:

No fresh snapshots after node restart — When a node restarts and block-syncs to the current head, createSnapshot returns early for every interval that fires while CatchingUp=true. Both creatornode.audius.co and creatornode2.audius.co hit this: after their recent restarts they missed every snapshot interval between ~23.7M and ~24.2M blocks (~8h gap). Nodes needing to state sync had no usable recent snapshot.
OOM during pg_restore — pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal. Migrations also pre-populate some tables before pg_restore runs, causing COPY to fail with duplicate key errors.

Changes

pkg/core/server/state_sync.go

After block sync completes (CatchingUp=false), immediately create a snapshot if the latest existing snapshot is more than BlockInterval blocks old. Prevents the 8-16h gap where no fresh snapshots exist after a restart.
Split RestoreDatabase into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED (suppresses WAL completely), then converted back to LOGGED before index build.
Truncate all public tables before the data phase so migration-created rows don't cause duplicate key errors.
Collect table names into a slice before running TRUNCATEs to avoid pgx conn busy error (can't call Exec while rows.Next() is iterating on same connection).

pkg/core/server/abci.go

Remove broken "direct restore path" that skipped CometBFT state sync when all chunks were already on disk. This left state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every subsequent start. CometBFT's native ApplySnapshotChunk already reuses on-disk chunks and correctly updates state.db.

Scope

Intentionally minimal — only the 4 commits that fix snapshot creation and restore reliability. Does not include: trust height fix, console display changes, CI changes, or rollback entrypoint changes (those are in separate PRs).

Deploy plan

Merge and deploy to creatornode.audius.co and creatornode2.audius.co — on restart, each will immediately create a fresh snapshot at the current block height instead of waiting up to 100k blocks (~37h).
Once a fresh snapshot appears, wipe and restart creatornode3.audius.co to state sync against it.

🤖 Generated with Claude Code

…tate sync pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal (which only helps when table and COPY are in the same transaction, which pg_restore doesn't do). Split the restore into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED, which completely suppresses WAL generation for writes. Tables are converted back to LOGGED after data load, before indexes are built. Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before the restore to reduce background I/O pressure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two issues during state sync data restore: 1. Migrations pre-populate some tables (core_db_migrations) before the pg_restore runs, causing COPY to fail with duplicate key errors. Fix by truncating all public tables with CASCADE before the data phase. 2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed first (non-deterministic without ORDER BY), the UNLOGGED alter fails. Fix by ordering the table list alphabetically. Also adds error/success logging for the postgres tuning commands so we can confirm they are actually applied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

During state sync, the TRUNCATE loop was calling db.Exec() while rows were still open on the same connection, causing all truncates to fail with "conn busy". Collect names into a slice first (same pattern as alterPersistence) then close rows before running the TRUNCATEs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…restore path Two reliability fixes for state sync: 1. Snapshot creation after catch-up: When a node restarts and block-syncs to the current head, createSnapshot skips every interval that fires while CatchingUp=true. The snapshot creator now waits for sync to complete, then immediately creates a snapshot if the last one is more than BlockInterval blocks old. Prevents the current 8-16 hour gap where no fresh snapshots exist after a node restarts. 2. Remove direct restore path: The startup optimization that skipped CometBFT state sync when all chunks were already on disk left CometBFT's state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every start. CometBFT's native state sync path (ApplySnapshotChunk) already reuses chunks already on disk and correctly updates state.db to the snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

State sync reliability fixes are now merged (PR #253, #254, #256). Restoring the edge tag job so nodes on auto-upgrade resume tracking the latest main. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

raymondjacobson and others added 4 commits May 7, 2026 19:00

raymondjacobson force-pushed the fix/state-sync-snapshot-reliability branch from 4ef5799 to e08ebb8 Compare May 8, 2026 23:54

raymondjacobson merged commit 7e900cf into main May 9, 2026
4 checks passed

raymondjacobson deleted the fix/state-sync-snapshot-reliability branch May 9, 2026 00:08

raymondjacobson mentioned this pull request May 11, 2026

chore: re-enable edge tag push on main #257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore)#253

fix: state sync snapshot reliability (catch-up snapshot + UNLOGGED restore)#253
raymondjacobson merged 4 commits into
mainfrom
fix/state-sync-snapshot-reliability

raymondjacobson commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented May 8, 2026

Problem

Changes

Scope

Deploy plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant