Apply blocks mid-production when our branch is locked out by a strong QC by heifner · Pull Request #1887 · AntelopeIO/spring

heifner · 2026-04-30T20:58:02Z

Summary

While a node is producing its own round, on_incoming_block currently early-returns and defers applying any incoming blocks until the slot ends. That avoids interrupting block production mid-block, but it also keeps the producer signing on a branch that may already be doomed by fork-choice. This PR adds a narrow exception: when an incoming block proves (via a strong QC) that our applied head's branch can no longer win fork-choice, fall through the gate and apply blocks immediately.

This was discussed previously; but we didn't get around to implementing it. The drop-late-blocks-during-our-slot behavior was kept implicitly when the in-producing-mode gate was added; but a QC-aware short-circuit was never implemented.

The matching change has shipped in wire-sysio: Wire-Network/wire-sysio#320

Why now -- observed on Vaulta

A BP operator on Vaulta reported missing exactly 11 of 12 blocks in their round, always 11, twice a day at random. Investigation showed the cause:

The previous producer had a propagation gap to the affected node -- their blocks arrived 5-6 seconds late off the wire, even though the rest of the finalizer set was receiving them on time and forming QCs (verified via tracked_votes log_missing_votes lines: every late-arriving block's QC was missing only the affected BP, meaning >=2/3 of finalizer weight had voted promptly).
The affected node, not seeing those blocks, treated the previous slot as missed and started producing fill-in blocks at those heights with later timestamps and a stale QC claim (the last QC observed before the gap).
When the late chain finally arrived, it carried strong QCs for blocks on the canonical branch -- a comparator-better fork. Forkdb noted the fork switch, but the producer was inside in_producing_mode() and the existing early-return deferred applying.
The producer ran out the rest of its 12-block slot on the stale-QC fork. At slot end, apply_blocks finally ran, the fork switched, all 12 of the slot's signed blocks were orphaned. The producer signed one more block on the new head (the boundary block whose timestamp was still in slot), then handed off.

Net: 12 produced, 11 orphaned, 1 final. Always exactly 11 missed -- not a flake, not random, just the geometry of the in-producing-mode gate combined with strong-QC propagation arriving anywhere inside the production round.

The wire-side propagation gap is its own inbound-peering issue on the BP side. But the in-producing-mode gate amplifies the cost: even if the network arrival had been a couple seconds earlier, the producer still wouldn't have switched until slot end. This PR addresses the gate amplification.

Consensus details and why this trigger is correct

The correctness of this trigger rests on a formal proof of the Savanna protocol. The relevant property is established in the Savanna proofs document, Lemma 4 (Strong QC Conflict Impossibility):

"Suppose SQC(B+) exists with lqc(B+)=B... Then for any block C with C perp B, no QC(C) can exist."

Translated: once a strong QC exists for any block whose claim references B, no QC can ever form on a branch conflicting with B. The doc additionally states explicitly that "seeing a strong QC on a competing fork IS mathematically sufficient to know your fork cannot win" and that LIB advancement is not required for this conclusion. The full proof and surrounding context: https://docs.wire.network/docs/introduction/savanna-proofs

Mechanism: a strong QC for B implies >=2/3 of finalizer weight cast a strong vote on B. The safety rule for strong votes locks those finalizers on B, so no future QC can form on a branch that does not extend B. Therefore, if our applied head is on such a branch:

No future QC can ever form on a block that conflicts with B.
Every block we sign on that branch necessarily conflicts with B (different ID at heights >= the divergence point).
Our branch is mathematically locked out of fork-choice. The only outcome of continuing to produce is more orphaned blocks.

So the trigger is: a received block carries a strong QC for a block not in our applied head's ancestry, and our applied head is not on the QC carrier's branch. Both conditions are required to handle edge cases:

Same-branch ancestry on either side is safe (head extends to the QC target as it grows; head is itself an ancestor of the QC carrier).
The strong-QC requirement matters because weak QCs do not lock finalizers under the safety rule; they cannot be used to prove permanent lockout (per the same proofs doc -- weak votes do not contribute to the lock).
The non-ancestry check on the QC target rules out the case where the QC is for a shared ancestor (then both branches share the QC and neither is locked out).

Why not LIB

LIB advancement past the fork point is also a sufficient signal of lockout, but it requires waiting for the QC-of-QC pattern, one extra block of finality formation, before it can fire. In practice the SQC trigger fires within milliseconds of when LIB-impossibility would also fire, but only paying for one block of latency rather than two. And the SQC condition is the cause of LIB advancement; using SQC directly is closer to the actual safety property the protocol guarantees, as established by Lemma 4 above.

…ut by a strong QC Previously, while a node was inside its own producing slot, on_incoming_block would unconditionally early-return and defer apply_blocks until the slot ended. That avoids interrupting block production mid-block, but it also keeps the producer signing on a known-doomed branch when the rest of the network has already moved on. Add a new block_handle::locks_out_branch_of() helper that reports whether a given block carries a strong QC for a block not in another head's ancestry -- a condition that, under Savanna's strong-vote locking rule, means no future QC can ever form on that head's branch and the branch can never win fork-choice. Use it in on_incoming_block to bypass the in_producing_mode early-return only when the fork-database best head locks out the applied head's branch. Normal in-slot operation is unchanged; only the provably-doomed case falls through to apply_blocks immediately. The check runs on the application thread (where on_incoming_block executes via the read_write executor queue) so it can safely read both chain.fork_db_head() and chain.head(). The helper itself is also thread-safe: block_state_ptr is a shared_ptr and the underlying finality_core is immutable post-construction.

heifner mentioned this pull request May 1, 2026

[1.2.3] Apply blocks mid-production when our branch is locked out by a strong QC #1888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply blocks mid-production when our branch is locked out by a strong QC#1887

Apply blocks mid-production when our branch is locked out by a strong QC#1887
heifner wants to merge 1 commit intoAntelopeIO:mainfrom
heifner:fix-abort-production-on-strong-qc-lockout

heifner commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heifner commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why now -- observed on Vaulta

Consensus details and why this trigger is correct

Why not LIB

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heifner commented Apr 30, 2026 •

edited

Loading