Skip to content

Apply blocks mid-production when our branch is locked out by a strong QC#1887

Open
heifner wants to merge 1 commit intoAntelopeIO:mainfrom
heifner:fix-abort-production-on-strong-qc-lockout
Open

Apply blocks mid-production when our branch is locked out by a strong QC#1887
heifner wants to merge 1 commit intoAntelopeIO:mainfrom
heifner:fix-abort-production-on-strong-qc-lockout

Conversation

@heifner
Copy link
Copy Markdown
Contributor

@heifner heifner commented Apr 30, 2026

Summary

While a node is producing its own round, on_incoming_block currently early-returns and defers applying any incoming blocks until the slot ends. That avoids interrupting block production mid-block, but it also keeps the producer signing on a branch that may already be doomed by fork-choice. This PR adds a narrow exception: when an incoming block proves (via a strong QC) that our applied head's branch can no longer win fork-choice, fall through the gate and apply blocks immediately.

This was discussed previously; but we didn't get around to implementing it. The drop-late-blocks-during-our-slot behavior was kept implicitly when the in-producing-mode gate was added; but a QC-aware short-circuit was never implemented.

The matching change has shipped in wire-sysio: Wire-Network/wire-sysio#320

Why now -- observed on Vaulta

A BP operator on Vaulta reported missing exactly 11 of 12 blocks in their round, always 11, twice a day at random. Investigation showed the cause:

  1. The previous producer had a propagation gap to the affected node -- their blocks arrived 5-6 seconds late off the wire, even though the rest of the finalizer set was receiving them on time and forming QCs (verified via tracked_votes log_missing_votes lines: every late-arriving block's QC was missing only the affected BP, meaning >=2/3 of finalizer weight had voted promptly).
  2. The affected node, not seeing those blocks, treated the previous slot as missed and started producing fill-in blocks at those heights with later timestamps and a stale QC claim (the last QC observed before the gap).
  3. When the late chain finally arrived, it carried strong QCs for blocks on the canonical branch -- a comparator-better fork. Forkdb noted the fork switch, but the producer was inside in_producing_mode() and the existing early-return deferred applying.
  4. The producer ran out the rest of its 12-block slot on the stale-QC fork. At slot end, apply_blocks finally ran, the fork switched, all 12 of the slot's signed blocks were orphaned. The producer signed one more block on the new head (the boundary block whose timestamp was still in slot), then handed off.

Net: 12 produced, 11 orphaned, 1 final. Always exactly 11 missed -- not a flake, not random, just the geometry of the in-producing-mode gate combined with strong-QC propagation arriving anywhere inside the production round.

The wire-side propagation gap is its own inbound-peering issue on the BP side. But the in-producing-mode gate amplifies the cost: even if the network arrival had been a couple seconds earlier, the producer still wouldn't have switched until slot end. This PR addresses the gate amplification.

Consensus details and why this trigger is correct

The correctness of this trigger rests on a formal proof of the Savanna protocol. The relevant property is established in the Savanna proofs document, Lemma 4 (Strong QC Conflict Impossibility):

"Suppose SQC(B+) exists with lqc(B+)=B... Then for any block C with C perp B, no QC(C) can exist."

Translated: once a strong QC exists for any block whose claim references B, no QC can ever form on a branch conflicting with B. The doc additionally states explicitly that "seeing a strong QC on a competing fork IS mathematically sufficient to know your fork cannot win" and that LIB advancement is not required for this conclusion. The full proof and surrounding context: https://docs.wire.network/docs/introduction/savanna-proofs

Mechanism: a strong QC for B implies >=2/3 of finalizer weight cast a strong vote on B. The safety rule for strong votes locks those finalizers on B, so no future QC can form on a branch that does not extend B. Therefore, if our applied head is on such a branch:

  • No future QC can ever form on a block that conflicts with B.
  • Every block we sign on that branch necessarily conflicts with B (different ID at heights >= the divergence point).
  • Our branch is mathematically locked out of fork-choice. The only outcome of continuing to produce is more orphaned blocks.

So the trigger is: a received block carries a strong QC for a block not in our applied head's ancestry, and our applied head is not on the QC carrier's branch. Both conditions are required to handle edge cases:

  • Same-branch ancestry on either side is safe (head extends to the QC target as it grows; head is itself an ancestor of the QC carrier).
  • The strong-QC requirement matters because weak QCs do not lock finalizers under the safety rule; they cannot be used to prove permanent lockout (per the same proofs doc -- weak votes do not contribute to the lock).
  • The non-ancestry check on the QC target rules out the case where the QC is for a shared ancestor (then both branches share the QC and neither is locked out).

Why not LIB

LIB advancement past the fork point is also a sufficient signal of lockout, but it requires waiting for the QC-of-QC pattern, one extra block of finality formation, before it can fire. In practice the SQC trigger fires within milliseconds of when LIB-impossibility would also fire, but only paying for one block of latency rather than two. And the SQC condition is the cause of LIB advancement; using SQC directly is closer to the actual safety property the protocol guarantees, as established by Lemma 4 above.

…ut by a strong QC

Previously, while a node was inside its own producing slot, on_incoming_block
would unconditionally early-return and defer apply_blocks until the slot ended.
That avoids interrupting block production mid-block, but it also keeps the
producer signing on a known-doomed branch when the rest of the network has
already moved on.

Add a new block_handle::locks_out_branch_of() helper that reports whether a
given block carries a strong QC for a block not in another head's ancestry --
a condition that, under Savanna's strong-vote locking rule, means no future
QC can ever form on that head's branch and the branch can never win
fork-choice. Use it in on_incoming_block to bypass the in_producing_mode
early-return only when the fork-database best head locks out the applied
head's branch. Normal in-slot operation is unchanged; only the
provably-doomed case falls through to apply_blocks immediately.

The check runs on the application thread (where on_incoming_block executes
via the read_write executor queue) so it can safely read both
chain.fork_db_head() and chain.head(). The helper itself is also
thread-safe: block_state_ptr is a shared_ptr and the underlying
finality_core is immutable post-construction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant