Skip to content

fix: blocks reexecutor panic recovery and shutdown error suppression#4531

Merged
eljobe merged 7 commits intomasterfrom
fix/blocks-reexecutor-shutdown-panic
Mar 25, 2026
Merged

fix: blocks reexecutor panic recovery and shutdown error suppression#4531
eljobe merged 7 commits intomasterfrom
fix/blocks-reexecutor-shutdown-panic

Conversation

@joshuacolvin0
Copy link
Copy Markdown
Member

  • Add handleContextOrFatal to suppress context errors during shutdown
  • Wrap AdvanceStateByBlock with recover() to convert panics from
    concurrent trie access races into errors, preventing database
    corruption from abnormal process termination
  • Add unit tests for handleContextOrFatal behavior

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Add handleContextOrFatal to suppress context errors during shutdown
- Wrap AdvanceStateByBlock with recover() to convert panics from
  concurrent trie access races into errors, preventing database
  corruption from abnormal process termination
- Add unit tests for handleContextOrFatal behavior

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.41%. Comparing base (53908cc) to head (ba8b67e).
⚠️ Report is 140 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4531      +/-   ##
==========================================
+ Coverage   32.68%   34.41%   +1.73%     
==========================================
  Files         497      497              
  Lines       58890    58897       +7     
==========================================
+ Hits        19247    20272    +1025     
+ Misses      36272    35009    -1263     
- Partials     3371     3616     +245     

@KolbyML
Copy link
Copy Markdown
Member

KolbyML commented Mar 19, 2026

Is this solving a known issue, we or someone we know has faced? If it isn't would you be fine in I got #4528 in first then I reviewed this. I am just asking because it would take me some time to properly review this, and unless I properly reviewed this I am not sure if this change makes sense. If this PR was fully driven by claude (not sure if it was) I would need to review these changes a lot more thoroughly I think

@joshuacolvin0 joshuacolvin0 assigned gligneul and unassigned KolbyML Mar 19, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 19, 2026

❌ 8 Tests Failed:

Tests completed Failed Passed Skipped
4554 8 4546 0
View the top 3 failed tests by shortest run time
TestPruningDBSizeReduction
Stack Traces | 0.000s run time
=== RUN   TestPruningDBSizeReduction
--- FAIL: TestPruningDBSizeReduction (0.00s)
TestRedisProduceComplex/one_producer,_all_consumers_are_active
Stack Traces | 1.230s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
�[36mDEBUG�[0m[03-23|01:20:31.556] consumer: xack                           �[36mcid�[0m=d1c8406f-8230-4bff-80cf-ea7677165153 �[36mmessageId�[0m=1774228830472-5
�[33mWARN �[0m[03-23|01:20:31.559] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-0
�[33mWARN �[0m[03-23|01:20:31.559] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-5
�[36mDEBUG�[0m[03-23|01:20:31.559] consumer: xdel                           �[36mcid�[0m=d1c8406f-8230-4bff-80cf-ea7677165153 �[36mmessageId�[0m=1774228830472-5
�[33mWARN �[0m[03-23|01:20:31.559] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-6
�[33mWARN �[0m[03-23|01:20:31.561] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-7
�[33mWARN �[0m[03-23|01:20:31.561] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-4
�[33mWARN �[0m[03-23|01:20:31.561] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-8
�[33mWARN �[0m[03-23|01:20:31.562] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830432-0
�[33mWARN �[0m[03-23|01:20:31.563] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830449-1
�[33mWARN �[0m[03-23|01:20:31.563] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830449-2
�[33mWARN �[0m[03-23|01:20:31.564] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830423-5
�[33mWARN �[0m[03-23|01:20:31.564] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830449-3
�[33mWARN �[0m[03-23|01:20:31.565] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830448-9
�[33mWARN �[0m[03-23|01:20:31.567] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830449-5
�[33mWARN �[0m[03-23|01:20:31.568] XClaimJustID returned empty response when indicating heartbeat �[33mmsgID�[0m=1774228830424-0
�[36mDEBUG�[0m[03-23|01:20:31.622] checkResponses                           �[36mresponded�[0m=81 �[36merrored�[0m=0 �[36mchecked�[0m=100
�[36mDEBUG�[0m[03-23|01:20:31.628] redis producer: check responses starting
�[36mDEBUG�[0m[03-23|01:20:31.644] checkResponses                           �[36mresponded�[0m=19 �[36merrored�[0m=0 �[36mchecked�[0m=19
--- FAIL: TestRedisProduceComplex/one_producer,_all_consumers_are_active (1.23s)
TestBroadcastClientConfirmedMessage
Stack Traces | 5.010s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
=== PAUSE TestBroadcastClientConfirmedMessage
=== CONT  TestBroadcastClientConfirmedMessage
    broadcastclient_test.go:348: broadcasting seq 0 message
INFO [03-23|01:19:50.782] arbitrum websocket broadcast server is listening address=[::]:43663
INFO [03-23|01:19:50.784] connecting to arbitrum inbox message broadcaster url=ws://127.0.0.1:43941/
INFO [03-23|01:19:50.783] arbitrum websocket broadcast server is listening address=[::]:33773
INFO [03-23|01:19:50.783] arbitrum websocket broadcast server is listening address=[::]:46465
INFO [03-23|01:19:50.787] connecting to arbitrum inbox message broadcaster url=ws://127.0.0.1:46465/
INFO [03-23|01:19:50.786] connecting to arbitrum inbox message broadcaster url=ws://127.0.0.1:43663/
INFO [03-23|01:19:50.787] connecting to arbitrum inbox message broadcaster url=ws://127.0.0.1:33773/
INFO [03-23|01:19:50.787] connecting to arbitrum inbox message broadcaster url=ws://127.0.0.1:43663/
INFO [03-23|01:19:50.791] Feed connected                           feedServerVersion=2 chainId=8744 requestedSeqNum=0
INFO [03-23|01:19:50.789] Feed connected                           feedServerVersion=2 chainId=8742 requestedSeqNum=0
INFO [03-23|01:19:50.790] Feed connected                           feedServerVersion=2 chainId=8742 requestedSeqNum=0
    broadcastclient_test.go:359: Received Message, Sequence Message: {0 {0xc0002063f0 0} &lt;nil&gt; [188 242 174 238 103 196 172 226 179 167 5 36 112 107 22 24 2 42 152 209 125 104 116 202 4 130 181 146 41 89 140 29 83 47 152 138 241 123 29 60 72 34 167 174 75 124 232 228 62 225 66 171 11 213 229 112 25 134 90 85 49 238 108 94 0] [] 0}
WARN [03-23|01:19:50.794] confirmed sequence number is past the end of stored messages "confirmed sequence number"=42 "last stored sequence number"=0
INFO [03-23|01:19:50.798] Feed connected                           feedServerVersion=2 chainId=9742 requestedSeqNum=0
INFO [03-23|01:19:50.800] Feed connected                           feedServerVersion=2 chainId=8744 requestedSeqNum=0
    broadcastclient_test.go:380: Client did not receive confirm message
--- FAIL: TestBroadcastClientConfirmedMessage (5.01s)

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

Comment on lines +387 to +394
if r := recover(); r != nil {
log.Error("panic during block re-execution", "block", blockToRecreate, "recover", r, "stack", string(debug.Stack()))
state = nil
err = fmt.Errorf("panic during block re-execution at block %d: %v", blockToRecreate, r)
}
}()
state, block, receipts, err = arbitrum.AdvanceStateByBlock(ctx, s.blockchain, state, blockToRecreate, prevHash, nil, vmConfig)
}()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool, can we write a test for this? maybe have something inside AdvanceStateByBlock be nil and calling that would cause a panic and test that advanceStateUpToBlock would recover and return the expected error

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

joshuacolvin0 and others added 2 commits March 22, 2026 15:49
# Conflicts:
#	blocks_reexecutor/blocks_reexecutor.go
…races

- Wrap AdvanceStateByBlock in recover() to convert panics from
  concurrent trie-cache eviction races into errors, preventing
  abnormal process termination
- Add unit tests for reportFatalErr (basic, channel-full, multiple
  error types) and panic recovery in advanceStateUpToBlock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
joshuacolvin0 and others added 3 commits March 22, 2026 16:54
…rReExecution tests

- Restructure error check in LaunchBlocksReExecution goroutine so the
  success log only emits when err is nil, not when ctx is cancelled
- Break Start's block-range loop on ctx cancellation, not just fatalReported
- Use exhaustive struct initialization in tests to satisfy custom linter
- Add unit tests for all three WaitForReExecution select branches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Config.Validate, Impl, and wrapFatalErr unit tests for blocks
reexecutor. Cover all Validate branches (mode, blocks JSON, block
ranges, room), Impl early-exit paths (fatalReported pre-set, no work),
WaitForReExecution with both channels ready, and wrapFatalErr error
wrapping.

Also log suppressed errors in goroutineErrorf so they are visible in
verbose test output when a background goroutine's error is suppressed
because the context is already cancelled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI linter requires all struct fields to be specified in composite
literals. Add the 5 missing zero-value fields (CommitStateToDisk,
MinBlocksPerThread, TrieCleanLimit, ValidateMultiGas, blocks) to all
Config literals in the test file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ganeshvanahalli ganeshvanahalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eljobe eljobe enabled auto-merge March 25, 2026 12:52
@eljobe eljobe added this pull request to the merge queue Mar 25, 2026
Merged via the queue into master with commit 71a01b3 Mar 25, 2026
27 checks passed
@eljobe eljobe deleted the fix/blocks-reexecutor-shutdown-panic branch March 25, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants