feat: non-blocking epoch change and fix deprecated try_next()#619
feat: non-blocking epoch change and fix deprecated try_next()#619
Conversation
51dec02 to
f9875a8
Compare
…ation - Implement non-blocking epoch change and fix deprecated try_next() - Populate EpochBlockInfo in sign_commit_vote for epoch change blocks - Populate EpochBlockInfo for suffix blocks via BBM - Scope EpochBlockInfo lifecycle to prevent stale info leakage - Support recovery by new ledger_info Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mpilation errors - Pass round and timestamp directly into EpochBlockInfo at block_buffer_manager level (handle_epoch_change_suffix_blocks), eliminating the need for resolve_epoch_block_info to patch these fields from end_epoch_timestamp - Add round field to BlockState::Ordered and set_ordered_blocks API - Remove epoch_start_timestamp_usecs > 0 guard in block_data.rs genesis construction, consistent with block.rs - Fix try_recv -> try_next for futures_channel::mpsc::UnboundedReceiver across test and production code - Fix clippy errors in block_buffer_manager (map_or -> is_some_and, Option field access) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f9875a8 to
5a86de1
Compare
keanji-x
left a comment
There was a problem hiding this comment.
Overall the non-blocking epoch change approach is solid. Left a few inline comments on specific concerns.
keanji-x
left a comment
There was a problem hiding this comment.
Overall the non-blocking epoch change approach is solid — returning dummy results for suffix blocks instead of dropping them is the right call. Left a few comments on specific concerns.
| } | ||
|
|
||
| if !suffix_keys.is_empty() { | ||
| info!( |
There was a problem hiding this comment.
Suffix blocks that are already Ordered when the epoch change is detected get StateComputeResult::new_dummy() here, but suffix blocks that arrive later via get_executed_res get epoch_change_block_info.compute_result.clone() — the actual epoch change block's result, which has has_reconfiguration() == true.
Same logical category of blocks, different compute results. Downstream code checking has_reconfiguration() could behave differently depending on timing. Should both paths use new_dummy() consistently?
There was a problem hiding this comment.
Fixed — both paths now use StateComputeResult::new_dummy() consistently. The old-epoch bypass branch in get_executed_res got the same treatment, and the now-unused compute_result field on EpochChangeCache was removed so future consumers can't accidentally depend on it.
| // a previous consume). Use latest_commit_block_number as fallback to | ||
| // prevent the caller from resetting start_ordered_block to 1. | ||
| let effective_block_number = if epoch_change_block_number == 0 { | ||
| block_state_machine.latest_commit_block_number |
There was a problem hiding this comment.
After consume_epoch_change resets latest_epoch_change_block_number to 0:
is_suffix_block()still returnstrue(checksepoch_change_block_infodirectly)get_epoch_change_block_info()returnsNone(requiresepoch_change_bn > 0)
This asymmetry means the reth side and consensus side disagree on what's a suffix block during the window between consume_epoch_change and release_inflight_blocks. If consensus calls resolve_epoch_block_info in this window, it won't find the EpochBlockInfo it needs.
Consider making both methods use the same predicate, or document why the divergence is safe.
There was a problem hiding this comment.
Good catch. Fixed — get_epoch_change_block_info now uses epoch_change_block_info as its source of truth (same predicate as is_suffix_block), so the reth and consensus sides agree across the consume_epoch_change → release_inflight_blocks window. Also rewrote the stale comment on consume_epoch_change — the old note claimed the reset was for get_committed_blocks, but that function never checked this field; the actual reason is release_inflight_blocks retain semantics.
| .then(|| self.get_quorum_cert_for_block(blocks_to_commit[i + 1].id())) | ||
| .flatten(); | ||
| match next_qc { | ||
| Some(qc) if qc.commit_info().id() == block.id() => { |
There was a problem hiding this comment.
The block-by-block commit logic with BATCH_COMMIT_SIZE is removed here but not mentioned in the PR description. Is this intentional? Any production deployments relying on that env var?
There was a problem hiding this comment.
Intentional, sorry for the missing context. BATCH_COMMIT_SIZE is only consumed by gravity_e2e/cluster_test_cases/single_node/test_batch_exec.py — no production deployment uses it. I've re-added the accumulation logic in a simpler form: finalize_order can already commit an N-block batch with just the last block's proof, so we just early-return from send_for_execution while the accumulated path is under the threshold and let path_from_ordered_root grow on the next call. Default behavior (env var unset) is unchanged.
| ); | ||
|
|
||
| // if the block exists between commit root and ordered root | ||
| // Step 1: If the block already exists between commit root and ordered root, |
There was a problem hiding this comment.
The has_missing_randomness guard was removed from this condition. Previously missing randomness would skip the "send commit proof" path and go straight to full sync. Now it always sends the commit proof first, then potentially also syncs. Intentional change or side effect of the refactor?
There was a problem hiding this comment.
Intentional — we need to guarantee the commit vote is sent regardless of randomness state. Will update the PR description to call this out.
| } | ||
| } else { | ||
| debug!( | ||
| round = round, |
There was a problem hiding this comment.
When the cache hits MAX_PENDING_COMMIT_PROOFS, new proofs are rejected. Since older proofs are more likely stale, consider evicting the oldest entry (pop_first() from the BTreeMap) instead of refusing new ones.
There was a problem hiding this comment.
Agreed, done. The cache now evicts the oldest pending proof via pop_first() when full — but only if the incoming proof is strictly newer than the oldest, otherwise we'd drop a newer proof to accept an older one.
- suffix block compute_result consistency: both the pre-computed (handle_epoch_change_suffix_blocks) and the on-demand (get_executed_res) paths now write StateComputeResult::new_dummy(), so downstream consumers can never see has_reconfiguration()==true leaked onto a suffix block. EpochChangeCache.compute_result is removed, and the old-epoch bypass branch gets the same treatment. - is_suffix_block / get_epoch_change_block_info asymmetry: get_epoch_change_block_info now reads epoch_change_block_info as its source of truth, matching is_suffix_block. The reth and consensus sides now agree across the consume_epoch_change → release_inflight_blocks window. Stale comment on consume_epoch_change rewritten to describe the real reason for the reset (retain semantics in release_inflight_blocks). - BATCH_COMMIT_SIZE re-added in simplified form for test_batch_exec.py: early-return from send_for_execution while path_from_ordered_root is under the threshold; finalize_order already commits N-block batches with just the last block's proof, so the previous one-by-one QC lookup is unnecessary. No-op when the env var is unset. - pending_commit_proofs eviction: when the cache is full, evict the oldest entry via pop_first() — but only if the incoming proof is strictly newer than the oldest, otherwise we'd drop a newer proof to accept an older one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7fb4cbe to
e91d0f3
Compare
keanji-x
left a comment
There was a problem hiding this comment.
Re-reviewed after fix commit e91d0f3d. 4 out of 5 previous issues are addressed — nice work.
Resolved:
- Inconsistent dummy results —
compute_resultremoved fromEpochChangeCache, all suffix block paths now useStateComputeResult::new_dummy()consistently. LGTM. is_suffix_blockvsget_epoch_change_block_infoasymmetry —get_epoch_change_block_infonow usesepoch_change_block_infodirectly instead oflatest_epoch_change_block_number, matchingis_suffix_block. LGTM.BATCH_COMMIT_SIZEremoval — Re-added with clear comment referencingtest_batch_exec.py. LGTM.pending_commit_proofseviction — Now evicts oldest entry viapop_first()with a guard against inserting older-than-oldest proofs. LGTM.
Still open:
5. has_missing_randomness guard removal in sync_to_highest_commit_cert — not addressed in the fix commit. Left an inline comment.
| ); | ||
|
|
||
| // if the block exists between commit root and ordered root | ||
| // Step 1: If the block already exists between commit root and ordered root, |
There was a problem hiding this comment.
This is the one remaining item from the previous review. The old code skipped the "send commit proof" branch entirely when has_missing_randomness was true, forcing a full sync. Now Step 1 always fires regardless of has_missing_randomness, and execution falls through to Step 2.
Is this intentional? If so, could you add a brief comment explaining why it's safe to send the commit proof even when randomness data is missing?
Summary
Implement non-blocking epoch change in the block buffer manager and fix deprecated
futures_channelAPI usage.Non-blocking Epoch Change
Previously, when an epoch change block was detected, subsequent blocks in the same epoch ("suffix blocks") were dropped via
release_inflight_blocks(), causing consensus to block waiting for execution results that would never arrive.Changes:
block_buffer_manager.rs—set_compute_res(): When an epoch change is detected, immediately find all subsequentOrderedblocks in the same epoch and setStateComputeResult::new_dummy()as their compute result. This allows consensus to consume these results instantly without blocking.block_buffer_manager.rs—get_committed_blocks(): Skip suffix blocks past thelatest_epoch_change_block_numberboundary. These blocks have dummy execution results and were never executed by reth, so they must not enter the reth commit path.Fix deprecated
try_next()→try_recv()Replace all deprecated
futures_channel::mpsc::UnboundedReceiver::try_next()calls withtry_recv()across the consensus crate.Dependency Updates
gaptosto branchfeat/add-epoch-block-info(types: add EpochBlockInfo to BlockInfo gravity-aptos#57)grethto branchfeat/add-epoch-block-info(chore: update gravity-aptos dep for EpochBlockInfo gravity-reth#305)Related PRs