Apply blocks mid-production when our branch is locked out by a strong QC#1887
Open
heifner wants to merge 1 commit intoAntelopeIO:mainfrom
Open
Apply blocks mid-production when our branch is locked out by a strong QC#1887heifner wants to merge 1 commit intoAntelopeIO:mainfrom
heifner wants to merge 1 commit intoAntelopeIO:mainfrom
Conversation
…ut by a strong QC Previously, while a node was inside its own producing slot, on_incoming_block would unconditionally early-return and defer apply_blocks until the slot ended. That avoids interrupting block production mid-block, but it also keeps the producer signing on a known-doomed branch when the rest of the network has already moved on. Add a new block_handle::locks_out_branch_of() helper that reports whether a given block carries a strong QC for a block not in another head's ancestry -- a condition that, under Savanna's strong-vote locking rule, means no future QC can ever form on that head's branch and the branch can never win fork-choice. Use it in on_incoming_block to bypass the in_producing_mode early-return only when the fork-database best head locks out the applied head's branch. Normal in-slot operation is unchanged; only the provably-doomed case falls through to apply_blocks immediately. The check runs on the application thread (where on_incoming_block executes via the read_write executor queue) so it can safely read both chain.fork_db_head() and chain.head(). The helper itself is also thread-safe: block_state_ptr is a shared_ptr and the underlying finality_core is immutable post-construction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While a node is producing its own round,
on_incoming_blockcurrently early-returns and defers applying any incoming blocks until the slot ends. That avoids interrupting block production mid-block, but it also keeps the producer signing on a branch that may already be doomed by fork-choice. This PR adds a narrow exception: when an incoming block proves (via a strong QC) that our applied head's branch can no longer win fork-choice, fall through the gate and apply blocks immediately.This was discussed previously; but we didn't get around to implementing it. The drop-late-blocks-during-our-slot behavior was kept implicitly when the in-producing-mode gate was added; but a QC-aware short-circuit was never implemented.
The matching change has shipped in wire-sysio: Wire-Network/wire-sysio#320
Why now -- observed on Vaulta
A BP operator on Vaulta reported missing exactly 11 of 12 blocks in their round, always 11, twice a day at random. Investigation showed the cause:
tracked_voteslog_missing_voteslines: every late-arriving block's QC was missing only the affected BP, meaning >=2/3 of finalizer weight had voted promptly).in_producing_mode()and the existing early-return deferred applying.apply_blocksfinally ran, the fork switched, all 12 of the slot's signed blocks were orphaned. The producer signed one more block on the new head (the boundary block whose timestamp was still in slot), then handed off.Net: 12 produced, 11 orphaned, 1 final. Always exactly 11 missed -- not a flake, not random, just the geometry of the in-producing-mode gate combined with strong-QC propagation arriving anywhere inside the production round.
The wire-side propagation gap is its own inbound-peering issue on the BP side. But the in-producing-mode gate amplifies the cost: even if the network arrival had been a couple seconds earlier, the producer still wouldn't have switched until slot end. This PR addresses the gate amplification.
Consensus details and why this trigger is correct
The correctness of this trigger rests on a formal proof of the Savanna protocol. The relevant property is established in the Savanna proofs document, Lemma 4 (Strong QC Conflict Impossibility):
Translated: once a strong QC exists for any block whose claim references B, no QC can ever form on a branch conflicting with B. The doc additionally states explicitly that "seeing a strong QC on a competing fork IS mathematically sufficient to know your fork cannot win" and that LIB advancement is not required for this conclusion. The full proof and surrounding context: https://docs.wire.network/docs/introduction/savanna-proofs
Mechanism: a strong QC for B implies >=2/3 of finalizer weight cast a strong vote on B. The safety rule for strong votes locks those finalizers on B, so no future QC can form on a branch that does not extend B. Therefore, if our applied head is on such a branch:
So the trigger is: a received block carries a strong QC for a block not in our applied head's ancestry, and our applied head is not on the QC carrier's branch. Both conditions are required to handle edge cases:
Why not LIB
LIB advancement past the fork point is also a sufficient signal of lockout, but it requires waiting for the QC-of-QC pattern, one extra block of finality formation, before it can fire. In practice the SQC trigger fires within milliseconds of when LIB-impossibility would also fire, but only paying for one block of latency rather than two. And the SQC condition is the cause of LIB advancement; using SQC directly is closer to the actual safety property the protocol guarantees, as established by Lemma 4 above.