Skip to content

feat: non-blocking epoch change and fix deprecated try_next()#619

Open
keanji-x wants to merge 5 commits intomainfrom
feat/non-blocking-epoch-change
Open

feat: non-blocking epoch change and fix deprecated try_next()#619
keanji-x wants to merge 5 commits intomainfrom
feat/non-blocking-epoch-change

Conversation

@keanji-x
Copy link
Copy Markdown
Contributor

Summary

Implement non-blocking epoch change in the block buffer manager and fix deprecated futures_channel API usage.

Non-blocking Epoch Change

Previously, when an epoch change block was detected, subsequent blocks in the same epoch ("suffix blocks") were dropped via release_inflight_blocks(), causing consensus to block waiting for execution results that would never arrive.

Changes:

  1. block_buffer_manager.rsset_compute_res(): When an epoch change is detected, immediately find all subsequent Ordered blocks in the same epoch and set StateComputeResult::new_dummy() as their compute result. This allows consensus to consume these results instantly without blocking.

  2. block_buffer_manager.rsget_committed_blocks(): Skip suffix blocks past the latest_epoch_change_block_number boundary. These blocks have dummy execution results and were never executed by reth, so they must not enter the reth commit path.

Fix deprecated try_next()try_recv()

Replace all deprecated futures_channel::mpsc::UnboundedReceiver::try_next() calls with try_recv() across the consensus crate.

Dependency Updates

Related PRs

ByteYue
ByteYue previously approved these changes Mar 20, 2026
Lchangliang
Lchangliang previously approved these changes Mar 20, 2026
Lchangliang and others added 2 commits April 10, 2026 10:18
…ation

- Implement non-blocking epoch change and fix deprecated try_next()
- Populate EpochBlockInfo in sign_commit_vote for epoch change blocks
- Populate EpochBlockInfo for suffix blocks via BBM
- Scope EpochBlockInfo lifecycle to prevent stale info leakage
- Support recovery by new ledger_info

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…mpilation errors

- Pass round and timestamp directly into EpochBlockInfo at block_buffer_manager
  level (handle_epoch_change_suffix_blocks), eliminating the need for
  resolve_epoch_block_info to patch these fields from end_epoch_timestamp
- Add round field to BlockState::Ordered and set_ordered_blocks API
- Remove epoch_start_timestamp_usecs > 0 guard in block_data.rs genesis
  construction, consistent with block.rs
- Fix try_recv -> try_next for futures_channel::mpsc::UnboundedReceiver
  across test and production code
- Fix clippy errors in block_buffer_manager (map_or -> is_some_and,
  Option field access)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Lchangliang Lchangliang force-pushed the feat/non-blocking-epoch-change branch from f9875a8 to 5a86de1 Compare April 10, 2026 03:53
Copy link
Copy Markdown
Contributor Author

@keanji-x keanji-x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the non-blocking epoch change approach is solid. Left a few inline comments on specific concerns.

Copy link
Copy Markdown
Contributor Author

@keanji-x keanji-x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the non-blocking epoch change approach is solid — returning dummy results for suffix blocks instead of dropping them is the right call. Left a few comments on specific concerns.

}

if !suffix_keys.is_empty() {
info!(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suffix blocks that are already Ordered when the epoch change is detected get StateComputeResult::new_dummy() here, but suffix blocks that arrive later via get_executed_res get epoch_change_block_info.compute_result.clone() — the actual epoch change block's result, which has has_reconfiguration() == true.

Same logical category of blocks, different compute results. Downstream code checking has_reconfiguration() could behave differently depending on timing. Should both paths use new_dummy() consistently?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — both paths now use StateComputeResult::new_dummy() consistently. The old-epoch bypass branch in get_executed_res got the same treatment, and the now-unused compute_result field on EpochChangeCache was removed so future consumers can't accidentally depend on it.

// a previous consume). Use latest_commit_block_number as fallback to
// prevent the caller from resetting start_ordered_block to 1.
let effective_block_number = if epoch_change_block_number == 0 {
block_state_machine.latest_commit_block_number
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After consume_epoch_change resets latest_epoch_change_block_number to 0:

  • is_suffix_block() still returns true (checks epoch_change_block_info directly)
  • get_epoch_change_block_info() returns None (requires epoch_change_bn > 0)

This asymmetry means the reth side and consensus side disagree on what's a suffix block during the window between consume_epoch_change and release_inflight_blocks. If consensus calls resolve_epoch_block_info in this window, it won't find the EpochBlockInfo it needs.

Consider making both methods use the same predicate, or document why the divergence is safe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed — get_epoch_change_block_info now uses epoch_change_block_info as its source of truth (same predicate as is_suffix_block), so the reth and consensus sides agree across the consume_epoch_changerelease_inflight_blocks window. Also rewrote the stale comment on consume_epoch_change — the old note claimed the reset was for get_committed_blocks, but that function never checked this field; the actual reason is release_inflight_blocks retain semantics.

.then(|| self.get_quorum_cert_for_block(blocks_to_commit[i + 1].id()))
.flatten();
match next_qc {
Some(qc) if qc.commit_info().id() == block.id() => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block-by-block commit logic with BATCH_COMMIT_SIZE is removed here but not mentioned in the PR description. Is this intentional? Any production deployments relying on that env var?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional, sorry for the missing context. BATCH_COMMIT_SIZE is only consumed by gravity_e2e/cluster_test_cases/single_node/test_batch_exec.py — no production deployment uses it. I've re-added the accumulation logic in a simpler form: finalize_order can already commit an N-block batch with just the last block's proof, so we just early-return from send_for_execution while the accumulated path is under the threshold and let path_from_ordered_root grow on the next call. Default behavior (env var unset) is unchanged.

);

// if the block exists between commit root and ordered root
// Step 1: If the block already exists between commit root and ordered root,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_missing_randomness guard was removed from this condition. Previously missing randomness would skip the "send commit proof" path and go straight to full sync. Now it always sends the commit proof first, then potentially also syncs. Intentional change or side effect of the refactor?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — we need to guarantee the commit vote is sent regardless of randomness state. Will update the PR description to call this out.

}
} else {
debug!(
round = round,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the cache hits MAX_PENDING_COMMIT_PROOFS, new proofs are rejected. Since older proofs are more likely stale, consider evicting the oldest entry (pop_first() from the BTreeMap) instead of refusing new ones.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done. The cache now evicts the oldest pending proof via pop_first() when full — but only if the incoming proof is strictly newer than the oldest, otherwise we'd drop a newer proof to accept an older one.

- suffix block compute_result consistency: both the pre-computed
  (handle_epoch_change_suffix_blocks) and the on-demand
  (get_executed_res) paths now write StateComputeResult::new_dummy(),
  so downstream consumers can never see has_reconfiguration()==true
  leaked onto a suffix block. EpochChangeCache.compute_result is
  removed, and the old-epoch bypass branch gets the same treatment.
- is_suffix_block / get_epoch_change_block_info asymmetry:
  get_epoch_change_block_info now reads epoch_change_block_info as its
  source of truth, matching is_suffix_block. The reth and consensus
  sides now agree across the consume_epoch_change →
  release_inflight_blocks window. Stale comment on consume_epoch_change
  rewritten to describe the real reason for the reset (retain semantics
  in release_inflight_blocks).
- BATCH_COMMIT_SIZE re-added in simplified form for
  test_batch_exec.py: early-return from send_for_execution while
  path_from_ordered_root is under the threshold; finalize_order already
  commits N-block batches with just the last block's proof, so the
  previous one-by-one QC lookup is unnecessary. No-op when the env var
  is unset.
- pending_commit_proofs eviction: when the cache is full, evict the
  oldest entry via pop_first() — but only if the incoming proof is
  strictly newer than the oldest, otherwise we'd drop a newer proof to
  accept an older one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Lchangliang Lchangliang force-pushed the feat/non-blocking-epoch-change branch from 7fb4cbe to e91d0f3 Compare April 10, 2026 07:01
Lchangliang
Lchangliang previously approved these changes Apr 10, 2026
Copy link
Copy Markdown
Contributor Author

@keanji-x keanji-x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after fix commit e91d0f3d. 4 out of 5 previous issues are addressed — nice work.

Resolved:

  1. Inconsistent dummy resultscompute_result removed from EpochChangeCache, all suffix block paths now use StateComputeResult::new_dummy() consistently. LGTM.
  2. is_suffix_block vs get_epoch_change_block_info asymmetryget_epoch_change_block_info now uses epoch_change_block_info directly instead of latest_epoch_change_block_number, matching is_suffix_block. LGTM.
  3. BATCH_COMMIT_SIZE removal — Re-added with clear comment referencing test_batch_exec.py. LGTM.
  4. pending_commit_proofs eviction — Now evicts oldest entry via pop_first() with a guard against inserting older-than-oldest proofs. LGTM.

Still open:
5. has_missing_randomness guard removal in sync_to_highest_commit_cert — not addressed in the fix commit. Left an inline comment.

);

// if the block exists between commit root and ordered root
// Step 1: If the block already exists between commit root and ordered root,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one remaining item from the previous review. The old code skipped the "send commit proof" branch entirely when has_missing_randomness was true, forcing a full sync. Now Step 1 always fires regardless of has_missing_randomness, and execution falls through to Step 2.

Is this intentional? If so, could you add a brief comment explaining why it's safe to send the commit proof even when randomness data is missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants