fix: use audius_core_root_dir in rollback entrypoint by raymondjacobson · Pull Request #226 · OpenAudio/go-openaudio

raymondjacobson · 2026-05-05T22:38:34Z

Summary

The rollback entrypoint hardcoded /data/core when searching for the CometBFT data directory
Nodes can configure a different root via audius_core_root_dir (e.g. /data/bolt on creatornode3)
Now uses ${audius_core_root_dir:-/data/core} so the rollback subcommand works on all nodes

Test plan

Scale down a node, run kubectl apply with debug pod using command: ["/bin/entrypoint.sh", "rollback", "-blocks", "1", "-dry-run"], verify it finds the correct CometBFT data directory

🤖 Generated with Claude Code

The rollback entrypoint hardcoded /data/core as the CometBFT root, but nodes can configure a different path via audius_core_root_dir. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CometBFT does not emit NewBlock events during fast sync, so the block event subscription never updates currentHeight. This caused syncHeight in GetStatus to freeze at the startup value, making it impossible to monitor block sync progress via the API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tate sync pg_restore on large tables (core_transactions, 19M+ rows) generates ~4.4GB of WAL per checkpoint cycle, causing OOM even with synchronous_commit=off and wal_level=minimal (which only helps when table and COPY are in the same transaction, which pg_restore doesn't do). Split the restore into three phases: pre-data (schema), data, post-data (indexes). During the data phase, all tables are set to UNLOGGED, which completely suppresses WAL generation for writes. Tables are converted back to LOGGED after data load, before indexes are built. Also tunes synchronous_commit, max_wal_size, and checkpoint_timeout before the restore to reduce background I/O pressure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a message field to StateSyncInfo proto (field 7) and set it during each phase of RestoreDatabase: schema restore, preparing tables, data load, finalizing tables, building indexes. Display it on the console overview page under the phase label so operators can see which step the restore is on rather than just "Restoring Database". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two issues during state sync data restore: 1. Migrations pre-populate some tables (core_db_migrations) before the pg_restore runs, causing COPY to fail with duplicate key errors. Fix by truncating all public tables with CASCADE before the data phase. 2. sla_rollups has a FK to sla_node_reports; if sla_rollups is processed first (non-deterministic without ORDER BY), the UNLOGGED alter fails. Fix by ordering the table list alphabetically. Also adds error/success logging for the postgres tuning commands so we can confirm they are actually applied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fork PRs run CI with the approval gate but never receive secrets. Make login conditional on inputs.push and pass push=false for fork PRs so build+test runs but nothing is pushed to Docker Hub. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

During state sync, the TRUNCATE loop was calling db.Exec() while rows were still open on the same connection, causing all truncates to fail with "conn busy". Collect names into a slice first (same pattern as alterPersistence) then close rows before running the TRUNCATEs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…napshots stateSyncLatestBlock was using the oldest snapshot on the RPC server to compute the trust height. When that snapshot is older than the 168h trust period, the light client fails with "old header has expired". Fix: derive trust height as (latestHeight - 288000) — ~2 days before current. This is always within the trust period, doesn't require the RPC servers to have any snapshots, and stays below recent snapshots (interval ~100k blocks) so CometBFT can still pick one to sync from. Also removes the now-unused sdk and connect imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When OfferSnapshot rejected a snapshot due to hash mismatch, it left acceptedSnapshotHeight set, causing all subsequent snapshot offers to be rejected with "already syncing to different snapshot". Reset the accepted snapshot state so CometBFT can fall through to the next available snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…restore path Two reliability fixes for state sync: 1. Snapshot creation after catch-up: When a node restarts and block-syncs to the current head, createSnapshot skips every interval that fires while CatchingUp=true. The snapshot creator now waits for sync to complete, then immediately creates a snapshot if the last one is more than BlockInterval blocks old. Prevents the current 8-16 hour gap where no fresh snapshots exist after a node restarts. 2. Remove direct restore path: The startup optimization that skipped CometBFT state sync when all chunks were already on disk left CometBFT's state.db at height 0 while ABCI returned the snapshot height, causing a handshake mismatch on every start. CometBFT's native state sync path (ApplySnapshotChunk) already reuses chunks already on disk and correctly updates state.db to the snapshot height. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each merge to main pushes a new :edge tag, triggering the auto-upgrader on snapshot-serving nodes. When those nodes restart and catch up, the snapshot creator skips every 100k-block interval that fires during CatchingUp=true, leaving no fresh snapshots for up to ~16 hours. Re-enable once PR #226 (catch-up snapshot fix) is merged and deployed. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

raymondjacobson and others added 10 commits May 5, 2026 15:38

fix: use audius_core_root_dir in rollback entrypoint

cc5f9e0

The rollback entrypoint hardcoded /data/core as the CometBFT root, but nodes can configure a different path via audius_core_root_dir. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

raymondjacobson mentioned this pull request May 7, 2026

chore: temporarily disable edge tag push on main #236

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use audius_core_root_dir in rollback entrypoint#226

fix: use audius_core_root_dir in rollback entrypoint#226
raymondjacobson wants to merge 10 commits intomainfrom
fix/rollback-entrypoint-core-root-dir

raymondjacobson commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raymondjacobson commented May 5, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant