Network sync fix 1 (real) by n13 · Pull Request #480 · Quantus-Network/chain

n13 · 2026-04-07T04:27:22Z

OG network sync fix

do not drop peers on network timeout
configurable network timeout for block requests (not hardcoded 20s)
request different block ranges on failure, cutting down request sizes by 1/2 until we only request a single block at a time
Use SingleState transaction pool as per Switch to SingleState mem pool #479 - we needed to integrate this for the final test

All this helps during network congestion and/or slow networks.

RUST_LOG=sync=debug,sub-libp2p=debug,import-queue=debug,network=debug ./quantus-node \
--validator --name nsxq14 \
--chain planck \
--node-key-file n2.key \
--rewards-inner-hash 0x15146c68374965405828743ce578d609af8b3c26b8532a4bbc710bb37449da35 \
--max-blocks-per-request 64 \
--sync full \
--base-path tmp14 \
--sync-max-timeouts-before-drop 20 \
--sync-block-request-timeout 40

Handling OutboundFailure::Io more generously

The OutboundFailure::Io(std::io::Error) is produced in exactly three scenarios from the libp2p handler:

codec.write_request() fails (line 193-194) -- error writing the request to the stream. This would be a genuine network I/O failure (broken pipe, connection reset, etc).
stream.close() fails (line 195) -- couldn't cleanly half-close the outbound stream. Similar genuine network issues.
codec.read_response() fails (line 196-197) -- error reading the response. This is where your error comes from. Looking at request_responses.rs, this read_response returns Err(io::Error) in these cases:

Response size exceeds limit (InvalidInput) -- your case, a misconfiguration
Varint decode error on the length prefix (InvalidInput) -- malformed/corrupt data from peer
io.read_exact() failure -- TCP-level read error (connection reset, broken pipe, etc)
4. Max sub-streams reached (line 210-213) -- local resource exhaustion, ErrorKind::Other.

rep::IO is -(1 << 10) = -1024. That's a mild penalty (same weight as TIMEOUT). The ban threshold is at 71% of i32::MIN (~-1.5 billion). So the reputation hit itself is harmless -- it would take ~1.5 million consecutive IO errors to hit the ban threshold through reputation alone.

The problem was never the rep::IO penalty. It was the unconditional disconnect_peer call which then fed into disconnected_peers.rs, which applies a fatal (i32::MIN) ban after just 3 disconnects-during-request.

So the gating fix is correct: it prevents the disconnect_peer call during major sync, which prevents the cascade into disconnected_peers's fatal ban. The rep::IO penalty is still applied when should_drop_peer is true, and it's appropriately mild.

To summarize: Io(_) covers transient network errors and local resource limits -- none are signals of a malicious peer worthy of a ban. The gating is the right approach, same as we do for Timeout, DialFailure, ConnectionClosed, and Refused.

if requests fail, we halve the requested blocks we don't drop peers quickly

n13 · 2026-04-07T04:29:17Z

n13 · 2026-04-07T04:29:39Z

this covers some other class of network error, like received too much data, and a few others, none of these malicious

n13 · 2026-04-07T05:32:28Z

Its a pre-existing pattern in syncing_service, just following how everything else works in this class

_This is Substrate's standard internal actor pattern. SyncingService is a lightweight, cloneable handle that other parts of the node hold (RPC, block import pipeline, etc.). It sends commands over a TracingUnboundedSender<ToServiceCommand> (a futures::mpsc::unbounded channel) to the SyncingEngine, which is the single async task that owns all the sync state and processes commands in its event loop.

unbounded_send is used (rather than bounded send().await) because these are fire-and-forget configuration commands and the caller shouldn't block. Every other method on SyncingService uses the exact same pattern -- on_block_finalized, announce_block, request_justification, etc. It's all local, in-process message passing within the same node process._

n13 · 2026-04-07T09:56:58Z

I rechecked the current workspace, and the build blocker is gone now: ./scripts/pre_checkin_checks.sh passes.

Findings

Medium: client/network-sync/src/engine.rs keeps peer_failures across disconnects. on_sync_peer_disconnected removes the peer from self.peers, but it never clears peer_failures, so a reconnecting peer can inherit old timeout history and get dropped much sooner than intended.
Medium: node/src/service.rs hardcodes TransactionPoolType::SingleState in new_partial, which silently changes the node’s mempool semantics away from the repo’s fork-aware default. That can regress transaction retention and revalidation on chains that fork or reorg, and it bypasses any future pool-type choice unless it is made explicit.

If you want, I can keep digging for any other behavioral regressions in the sync path.

* pull in network sync package modify to make it disconnect less often * bugfix * reset network sync * Network sync fix 1 (real) (#480) * original backoff code from PR 190 added if requests fail, we halve the requested blocks we don't drop peers quickly * configure block request timeout * format * use single state tx pool * taplo format * apply peer drop limit to OutboundFailure::Io this covers some other class of network error, like received too much data, and a few others, none of these malicious * renamed RequestSignature to SyncRequestParams * format * feature gate external packages tests --------- Co-authored-by: Nikolaus Heger <nheger@gmail.com>

* draft: Planck profile * feat: Runtime + boot nodes update * fix: json spec updated * feat: Chain-Spec: IP node for Planck * feat: Planck genesis refreshed * feat: Heisenberg refreshed * fix printout * rename rewards_preimage to rewards_inner_hash add nice error messages and checks for the inner hash format * fix: FMT * Add network sync crate (#482) * pull in network sync package modify to make it disconnect less often * bugfix * reset network sync * Network sync fix 1 (real) (#480) * original backoff code from PR 190 added if requests fail, we halve the requested blocks we don't drop peers quickly * configure block request timeout * format * use single state tx pool * taplo format * apply peer drop limit to OutboundFailure::Io this covers some other class of network error, like received too much data, and a few others, none of these malicious * renamed RequestSignature to SyncRequestParams * format * feature gate external packages tests --------- Co-authored-by: Nikolaus Heger <nheger@gmail.com> --------- Co-authored-by: Nikolaus Heger <nheger@gmail.com>

n13 added 3 commits April 7, 2026 11:14

original backoff code from PR 190 added

3a02499

if requests fail, we halve the requested blocks we don't drop peers quickly

configure block request timeout

42aa5a3

format

8fae95a

n13 changed the base branch from main to add-network-sync-crate April 7, 2026 04:28

n13 added 5 commits April 7, 2026 12:30

use single state tx pool

b72fa08

taplo format

88454ff

apply peer drop limit to OutboundFailure::Io

843fcd4

this covers some other class of network error, like received too much data, and a few others, none of these malicious

renamed RequestSignature to SyncRequestParams

13981fc

format

383dc77

feature gate external packages tests

283b8fb

czareko approved these changes Apr 7, 2026

View reviewed changes

czareko merged commit ed3be47 into add-network-sync-crate Apr 7, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network sync fix 1 (real)#480

Network sync fix 1 (real)#480
czareko merged 9 commits into
add-network-sync-cratefrom
network-sync-fix-1

n13 commented Apr 7, 2026 •

edited

Loading

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

n13 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Handling OutboundFailure::Io more generously

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Uh oh!

n13 commented Apr 7, 2026

Findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

n13 commented Apr 7, 2026 •

edited

Loading