feat: non-blocking epoch change and fix deprecated try_next() by keanji-x · Pull Request #619 · Galxe/gravity-sdk

keanji-x · 2026-03-20T08:06:52Z

Summary

Implement non-blocking epoch change in the block buffer manager and fix deprecated futures_channel API usage.

Non-blocking Epoch Change

Previously, when an epoch change block was detected, subsequent blocks in the same epoch ("suffix blocks") were dropped via release_inflight_blocks(), causing consensus to block waiting for execution results that would never arrive.

Changes:

block_buffer_manager.rs — set_compute_res(): When an epoch change is detected, immediately find all subsequent Ordered blocks in the same epoch and set StateComputeResult::new_dummy() as their compute result. This allows consensus to consume these results instantly without blocking.
block_buffer_manager.rs — get_committed_blocks(): Skip suffix blocks past the latest_epoch_change_block_number boundary. These blocks have dummy execution results and were never executed by reth, so they must not enter the reth commit path.

Fix deprecated `try_next()` → `try_recv()`

Replace all deprecated futures_channel::mpsc::UnboundedReceiver::try_next() calls with try_recv() across the consensus crate.

Dependency Updates

Update gaptos to branch feat/add-epoch-block-info (types: add EpochBlockInfo to BlockInfo gravity-aptos#57)
Update greth to branch feat/add-epoch-block-info (chore: update gravity-aptos dep for EpochBlockInfo gravity-reth#305)

Related PRs

…ation - Implement non-blocking epoch change and fix deprecated try_next() - Populate EpochBlockInfo in sign_commit_vote for epoch change blocks - Populate EpochBlockInfo for suffix blocks via BBM - Scope EpochBlockInfo lifecycle to prevent stale info leakage - Support recovery by new ledger_info Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…mpilation errors - Pass round and timestamp directly into EpochBlockInfo at block_buffer_manager level (handle_epoch_change_suffix_blocks), eliminating the need for resolve_epoch_block_info to patch these fields from end_epoch_timestamp - Add round field to BlockState::Ordered and set_ordered_blocks API - Remove epoch_start_timestamp_usecs > 0 guard in block_data.rs genesis construction, consistent with block.rs - Fix try_recv -> try_next for futures_channel::mpsc::UnboundedReceiver across test and production code - Fix clippy errors in block_buffer_manager (map_or -> is_some_and, Option field access) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

keanji-x

Overall the non-blocking epoch change approach is solid. Left a few inline comments on specific concerns.

keanji-x

Overall the non-blocking epoch change approach is solid — returning dummy results for suffix blocks instead of dropping them is the right call. Left a few comments on specific concerns.

keanji-x · 2026-04-10T04:35:13Z

crates/block-buffer-manager/src/block_buffer_manager.rs

+        }
+
+        if !suffix_keys.is_empty() {
+            info!(


Suffix blocks that are already Ordered when the epoch change is detected get StateComputeResult::new_dummy() here, but suffix blocks that arrive later via get_executed_res get epoch_change_block_info.compute_result.clone() — the actual epoch change block's result, which has has_reconfiguration() == true.

Same logical category of blocks, different compute results. Downstream code checking has_reconfiguration() could behave differently depending on timing. Should both paths use new_dummy() consistently?

Fixed — both paths now use StateComputeResult::new_dummy() consistently. The old-epoch bypass branch in get_executed_res got the same treatment, and the now-unused compute_result field on EpochChangeCache was removed so future consumers can't accidentally depend on it.

keanji-x · 2026-04-10T04:35:13Z

crates/block-buffer-manager/src/block_buffer_manager.rs

+        // a previous consume). Use latest_commit_block_number as fallback to
+        // prevent the caller from resetting start_ordered_block to 1.
+        let effective_block_number = if epoch_change_block_number == 0 {
+            block_state_machine.latest_commit_block_number


After consume_epoch_change resets latest_epoch_change_block_number to 0:

is_suffix_block() still returns true (checks epoch_change_block_info directly)

get_epoch_change_block_info() returns None (requires epoch_change_bn > 0)

This asymmetry means the reth side and consensus side disagree on what's a suffix block during the window between consume_epoch_change and release_inflight_blocks. If consensus calls resolve_epoch_block_info in this window, it won't find the EpochBlockInfo it needs.

Consider making both methods use the same predicate, or document why the divergence is safe.

Good catch. Fixed — get_epoch_change_block_info now uses epoch_change_block_info as its source of truth (same predicate as is_suffix_block), so the reth and consensus sides agree across the consume_epoch_change → release_inflight_blocks window. Also rewrote the stale comment on consume_epoch_change — the old note claimed the reset was for get_committed_blocks, but that function never checked this field; the actual reason is release_inflight_blocks retain semantics.

keanji-x · 2026-04-10T04:35:13Z

aptos-core/consensus/src/block_storage/block_store.rs

-                        .then(|| self.get_quorum_cert_for_block(blocks_to_commit[i + 1].id()))
-                        .flatten();
-                    match next_qc {
-                        Some(qc) if qc.commit_info().id() == block.id() => {


The block-by-block commit logic with BATCH_COMMIT_SIZE is removed here but not mentioned in the PR description. Is this intentional? Any production deployments relying on that env var?

Intentional, sorry for the missing context. BATCH_COMMIT_SIZE is only consumed by gravity_e2e/cluster_test_cases/single_node/test_batch_exec.py — no production deployment uses it. I've re-added the accumulation logic in a simpler form: finalize_order can already commit an N-block batch with just the last block's proof, so we just early-return from send_for_execution while the accumulated path is under the threshold and let path_from_ordered_root grow on the next call. Default behavior (env var unset) is unchanged.

keanji-x · 2026-04-10T04:35:13Z

aptos-core/consensus/src/block_storage/sync_manager.rs

        );

-        // if the block exists between commit root and ordered root
+        // Step 1: If the block already exists between commit root and ordered root,


The has_missing_randomness guard was removed from this condition. Previously missing randomness would skip the "send commit proof" path and go straight to full sync. Now it always sends the commit proof first, then potentially also syncs. Intentional change or side effect of the refactor?

Intentional — we need to guarantee the commit vote is sent regardless of randomness state. Will update the PR description to call this out.

keanji-x · 2026-04-10T04:35:13Z

aptos-core/consensus/src/pipeline/buffer_manager.rs

+            }
+        } else {
+            debug!(
+                round = round,


When the cache hits MAX_PENDING_COMMIT_PROOFS, new proofs are rejected. Since older proofs are more likely stale, consider evicting the oldest entry (pop_first() from the BTreeMap) instead of refusing new ones.

Agreed, done. The cache now evicts the oldest pending proof via pop_first() when full — but only if the incoming proof is strictly newer than the oldest, otherwise we'd drop a newer proof to accept an older one.

- suffix block compute_result consistency: both the pre-computed (handle_epoch_change_suffix_blocks) and the on-demand (get_executed_res) paths now write StateComputeResult::new_dummy(), so downstream consumers can never see has_reconfiguration()==true leaked onto a suffix block. EpochChangeCache.compute_result is removed, and the old-epoch bypass branch gets the same treatment. - is_suffix_block / get_epoch_change_block_info asymmetry: get_epoch_change_block_info now reads epoch_change_block_info as its source of truth, matching is_suffix_block. The reth and consensus sides now agree across the consume_epoch_change → release_inflight_blocks window. Stale comment on consume_epoch_change rewritten to describe the real reason for the reset (retain semantics in release_inflight_blocks). - BATCH_COMMIT_SIZE re-added in simplified form for test_batch_exec.py: early-return from send_for_execution while path_from_ordered_root is under the threshold; finalize_order already commits N-block batches with just the last block's proof, so the previous one-by-one QC lookup is unnecessary. No-op when the env var is unset. - pending_commit_proofs eviction: when the cache is full, evict the oldest entry via pop_first() — but only if the incoming proof is strictly newer than the oldest, otherwise we'd drop a newer proof to accept an older one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

keanji-x

Re-reviewed after fix commit e91d0f3d. 4 out of 5 previous issues are addressed — nice work.

Resolved:

Inconsistent dummy results — compute_result removed from EpochChangeCache, all suffix block paths now use StateComputeResult::new_dummy() consistently. LGTM.
is_suffix_block vs get_epoch_change_block_info asymmetry — get_epoch_change_block_info now uses epoch_change_block_info directly instead of latest_epoch_change_block_number, matching is_suffix_block. LGTM.
BATCH_COMMIT_SIZE removal — Re-added with clear comment referencing test_batch_exec.py. LGTM.
pending_commit_proofs eviction — Now evicts oldest entry via pop_first() with a guard against inserting older-than-oldest proofs. LGTM.

Still open:
5. has_missing_randomness guard removal in sync_to_highest_commit_cert — not addressed in the fix commit. Left an inline comment.

keanji-x · 2026-04-10T07:40:48Z

aptos-core/consensus/src/block_storage/sync_manager.rs

        );

-        // if the block exists between commit root and ordered root
+        // Step 1: If the block already exists between commit root and ordered root,


This is the one remaining item from the previous review. The old code skipped the "send commit proof" branch entirely when has_missing_randomness was true, forcing a full sync. Now Step 1 always fires regardless of has_missing_randomness, and execution falls through to Step 2.

Is this intentional? If so, could you add a brief comment explaining why it's safe to send the commit proof even when randomness data is missing?

ByteYue previously approved these changes Mar 20, 2026

View reviewed changes

keanji-x dismissed ByteYue’s stale review via 56b9ee5 March 20, 2026 09:04

Lchangliang previously approved these changes Mar 20, 2026

View reviewed changes

keanji-x dismissed Lchangliang’s stale review via b67f139 March 23, 2026 12:26

Lchangliang force-pushed the feat/non-blocking-epoch-change branch from 51dec02 to f9875a8 Compare March 27, 2026 08:36

Lchangliang and others added 2 commits April 10, 2026 10:18

Lchangliang force-pushed the feat/non-blocking-epoch-change branch from f9875a8 to 5a86de1 Compare April 10, 2026 03:53

keanji-x commented Apr 10, 2026

View reviewed changes

Lchangliang force-pushed the feat/non-blocking-epoch-change branch from 7fb4cbe to e91d0f3 Compare April 10, 2026 07:01

Merge branch 'main' into feat/non-blocking-epoch-change

02ee8f2

Lchangliang previously approved these changes Apr 10, 2026

View reviewed changes

keanji-x commented Apr 10, 2026

View reviewed changes

fmt

ff71bdf

Lchangliang dismissed their stale review via ff71bdf April 10, 2026 10:29

Lchangliang approved these changes Apr 10, 2026

View reviewed changes

Conversation

keanji-x commented Mar 20, 2026

Summary

Non-blocking Epoch Change

Fix deprecated try_next() → try_recv()

Dependency Updates

Related PRs

Uh oh!

keanji-x left a comment

Choose a reason for hiding this comment

Uh oh!

keanji-x left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keanji-x left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix deprecated `try_next()` → `try_recv()`