Skip to content

Network sync fix 1 (real)#480

Merged
czareko merged 9 commits into
add-network-sync-cratefrom
network-sync-fix-1
Apr 7, 2026
Merged

Network sync fix 1 (real)#480
czareko merged 9 commits into
add-network-sync-cratefrom
network-sync-fix-1

Conversation

@n13
Copy link
Copy Markdown
Collaborator

@n13 n13 commented Apr 7, 2026

OG network sync fix

  • do not drop peers on network timeout
  • configurable network timeout for block requests (not hardcoded 20s)
  • request different block ranges on failure, cutting down request sizes by 1/2 until we only request a single block at a time
  • Use SingleState transaction pool as per Switch to SingleState mem pool #479 - we needed to integrate this for the final test

All this helps during network congestion and/or slow networks.

RUST_LOG=sync=debug,sub-libp2p=debug,import-queue=debug,network=debug ./quantus-node \
--validator --name nsxq14 \
--chain planck \
--node-key-file n2.key \
--rewards-inner-hash 0x15146c68374965405828743ce578d609af8b3c26b8532a4bbc710bb37449da35 \
--max-blocks-per-request 64 \
--sync full \
--base-path tmp14 \
--sync-max-timeouts-before-drop 20 \
--sync-block-request-timeout 40

Handling OutboundFailure::Io more generously

The OutboundFailure::Io(std::io::Error) is produced in exactly three scenarios from the libp2p handler:

  1. codec.write_request() fails (line 193-194) -- error writing the request to the stream. This would be a genuine network I/O failure (broken pipe, connection reset, etc).

  2. stream.close() fails (line 195) -- couldn't cleanly half-close the outbound stream. Similar genuine network issues.

  3. codec.read_response() fails (line 196-197) -- error reading the response. This is where your error comes from. Looking at request_responses.rs, this read_response returns Err(io::Error) in these cases:

Response size exceeds limit (InvalidInput) -- your case, a misconfiguration
Varint decode error on the length prefix (InvalidInput) -- malformed/corrupt data from peer
io.read_exact() failure -- TCP-level read error (connection reset, broken pipe, etc)
4. Max sub-streams reached (line 210-213) -- local resource exhaustion, ErrorKind::Other.

rep::IO is -(1 << 10) = -1024. That's a mild penalty (same weight as TIMEOUT). The ban threshold is at 71% of i32::MIN (~-1.5 billion). So the reputation hit itself is harmless -- it would take ~1.5 million consecutive IO errors to hit the ban threshold through reputation alone.

The problem was never the rep::IO penalty. It was the unconditional disconnect_peer call which then fed into disconnected_peers.rs, which applies a fatal (i32::MIN) ban after just 3 disconnects-during-request.

So the gating fix is correct: it prevents the disconnect_peer call during major sync, which prevents the cascade into disconnected_peers's fatal ban. The rep::IO penalty is still applied when should_drop_peer is true, and it's appropriately mild.

To summarize: Io(_) covers transient network errors and local resource limits -- none are signals of a malicious peer worthy of a ban. The gating is the right approach, same as we do for Timeout, DialFailure, ConnectionClosed, and Refused.

n13 added 3 commits April 7, 2026 11:14
if requests fail, we halve the requested blocks
we don't drop peers quickly
@n13 n13 changed the base branch from main to add-network-sync-crate April 7, 2026 04:28
@n13
Copy link
Copy Markdown
Collaborator Author

n13 commented Apr 7, 2026

image

@n13
Copy link
Copy Markdown
Collaborator Author

n13 commented Apr 7, 2026

image

n13 added 5 commits April 7, 2026 12:30
this covers some other class of network error, like received too much data, and a few others, none of these malicious
@n13
Copy link
Copy Markdown
Collaborator Author

n13 commented Apr 7, 2026

image

Its a pre-existing pattern in syncing_service, just following how everything else works in this class

_This is Substrate's standard internal actor pattern. SyncingService is a lightweight, cloneable handle that other parts of the node hold (RPC, block import pipeline, etc.). It sends commands over a TracingUnboundedSender<ToServiceCommand> (a futures::mpsc::unbounded channel) to the SyncingEngine, which is the single async task that owns all the sync state and processes commands in its event loop.

unbounded_send is used (rather than bounded send().await) because these are fire-and-forget configuration commands and the caller shouldn't block. Every other method on SyncingService uses the exact same pattern -- on_block_finalized, announce_block, request_justification, etc. It's all local, in-process message passing within the same node process._

@n13
Copy link
Copy Markdown
Collaborator Author

n13 commented Apr 7, 2026

I rechecked the current workspace, and the build blocker is gone now: ./scripts/pre_checkin_checks.sh passes.

Findings

  • Medium: client/network-sync/src/engine.rs keeps peer_failures across disconnects. on_sync_peer_disconnected removes the peer from self.peers, but it never clears peer_failures, so a reconnecting peer can inherit old timeout history and get dropped much sooner than intended.
  • Medium: node/src/service.rs hardcodes TransactionPoolType::SingleState in new_partial, which silently changes the node’s mempool semantics away from the repo’s fork-aware default. That can regress transaction retention and revalidation on chains that fork or reorg, and it bypasses any future pool-type choice unless it is made explicit.

If you want, I can keep digging for any other behavioral regressions in the sync path.

@czareko czareko merged commit ed3be47 into add-network-sync-crate Apr 7, 2026
4 checks passed
czareko added a commit that referenced this pull request Apr 7, 2026
* pull in network sync package

modify to make it disconnect less often

* bugfix

* reset network sync

* Network sync fix 1 (real) (#480)

* original backoff code from PR 190 added

if requests fail, we halve the requested blocks
we don't drop peers quickly

* configure block request timeout

* format

* use single state tx pool

* taplo format

* apply peer drop limit to OutboundFailure::Io

this covers some other class of network error, like received too much data, and a few others, none of these malicious

* renamed RequestSignature to SyncRequestParams

* format

* feature gate external packages tests

---------

Co-authored-by: Nikolaus Heger <nheger@gmail.com>
czareko added a commit that referenced this pull request Apr 8, 2026
* draft: Planck profile

* feat: Runtime + boot nodes update

* fix: json spec updated

* feat: Chain-Spec: IP node for Planck

* feat: Planck genesis refreshed

* feat: Heisenberg refreshed

* fix printout

* rename rewards_preimage to rewards_inner_hash

add nice error messages and checks for the inner hash format

* fix: FMT

* Add network sync crate (#482)

* pull in network sync package

modify to make it disconnect less often

* bugfix

* reset network sync

* Network sync fix 1 (real) (#480)

* original backoff code from PR 190 added

if requests fail, we halve the requested blocks
we don't drop peers quickly

* configure block request timeout

* format

* use single state tx pool

* taplo format

* apply peer drop limit to OutboundFailure::Io

this covers some other class of network error, like received too much data, and a few others, none of these malicious

* renamed RequestSignature to SyncRequestParams

* format

* feature gate external packages tests

---------

Co-authored-by: Nikolaus Heger <nheger@gmail.com>

---------

Co-authored-by: Nikolaus Heger <nheger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants