[Net] Massive network refactoring and speedup #1835

Fuzzbawls · 2020-09-03T03:05:34Z

This is a combination of multiple upstream PRs focused on optimizing the P2P networking flow after the introduction of CConnman encapsulation, and a few older PRs that were previously missed to support the later optimizations. The PRs are as follows:

net: Add test-before-evict discipline to addrman bitcoin/bitcoin#9037 - net: Add test-before-evict discipline to addrman
make CMessageHeader a dumb storage class bitcoin/bitcoin#5151 - make CMessageHeader a dumb storage class
log bytes recv/sent per command bitcoin/bitcoin#6589 - log bytes recv/sent per command
Move static global randomizer seeds into CConnman bitcoin/bitcoin#8688 - Move static global randomizer seeds into CConnman
net: make a few values immutable, and use deterministic randomness for the localnonce bitcoin/bitcoin#9050 - net: make a few values immutable, and use deterministic randomness for the localnonce
net: have CConnman handle message sending bitcoin/bitcoin#8708 - net: have CConnman handle message sending
net: Decouple CConnman and message serialization bitcoin/bitcoin#9128 - net: Decouple CConnman and message serialization
net: Consistent checksum handling bitcoin/bitcoin#8822 - net: Consistent checksum handling
Net: Massive speedup. Net locks overhaul bitcoin/bitcoin#9441 - Net: Massive speedup. Net locks overhaul
net: fix remaining net assertions bitcoin/bitcoin#9609 - net: fix remaining net assertions
Clean up a few CConnman cs_vNodes/CNode things bitcoin/bitcoin#9626 - Clean up a few CConnman cs_vNodes/CNode things
net: fix socket close race bitcoin/bitcoin#9698 - net: fix socket close race
Clean up all known races/platform-specific UB at the time PR was opened bitcoin/bitcoin#9708 - Clean up all known races/platform-specific UB at the time PR was opened
- Excluded bitcoin/bitcoin@512731b and bitcoin/bitcoin@d8f2b8a, to be done in a separate PR

Changes addrman to use the test-before-evict discipline in which an address is to be evicted from the tried table is first tested and if it is still online it is not evicted. Adds tests to provide test coverage for this change. This change was suggested as Countermeasure 3 in Eclipse Attacks on Bitcoin’s Peer-to-Peer Network, Ethan Heilman, Alison Kendler, Aviv Zohar, Sharon Goldberg. ePrint Archive Report 2015/263. March 2015.

Surprisingly this hasn't been causing me any issues while testing, probably because it requires lots of large blocks to be flying around. Send/Recv corks need tests!

This will be needed so that the message processor can cork incoming messages

These conditions are problematic to check without locking, and we shouldn't be relying on the refcount to disconnect.

when vRecvMsg becomes a private buffer, it won't make sense to allow other threads to mess with it anymore.

We'll soon no longer have access to vRecvMsg, and this is more intuitive anyway.

This allows locking to be pushed down to only where it's needed Also reuse the current time rather than checking multiple times.

This may be used publicly in the future

In order to sleep accurately, the message handler needs to know if _any_ node has more processing that it should do before the entire thread sleeps. Rather than returning a value that represents whether ProcessMessages encountered a message that should trigger a disconnnect, interpret the return value as whether or not that node has more work to do. Also, use a global fProcessWake value that can be set by other threads, which takes precedence (for one cycle) over the messagehandler's decision. Note that the previous behavior was to only process one message per loop (except in the case of a bad checksum or invalid header). That was changed in PR dashpay#3180. The only change here in that regard is that the current node now falls to the back of the processing queue for the bad checksum/invalid header cases.

This separates the storage of messages from the net and queued messages for processing, allowing the locks to be split.

Messages are dumped very quickly from the socket handler to the processor, so it's the depth of the processing queue that's interesting. The socket handler checks the process queue's size during the brief message hand-off and pauses if necessary, and the processor possibly un-pauses each time a message is popped off of its queue.

Also add a variadic CDataStream ctor for ease-of-use.

The changes here are dense and subtle, but hopefully all is more explicit than before. - CConnman is now in charge of sending data rather than the nodes themselves. This is necessary because many decisions need to be made with all nodes in mind, and a model that requires the nodes calling up to their manager quickly turns to spaghetti. - The per-node-serializer (ssSend) has been replaced with a (quasi-)const send-version. Since the send version for serialization can only change once per connection, we now explicitly tag messages with INIT_PROTO_VERSION if they are sent before the handshake. With this done, there's no need to lock for access to nSendVersion. Also, a new stream is used for each message, so there's no need to lock during the serialization process. - This takes care of accounting for optimistic sends, so the nOptimisticBytesWritten hack can be removed. - -dropmessagestest and -fuzzmessagestest have not been preserved, as I suspect they haven't been used in years.

Drop all of the old stuff.

This is now handled properly in realtime.

This way we're not relying on messages going out after fDisconnect has been set. This should not cause any real behavioral changes, though feelers should arguably disconnect earlier in the process. That can be addressed in a later functional change.

Also, send reject messages earlier in SendMessages(), so that disconnections are processed earlier. These changes combined should ensure that no message is ever sent after fDisconnect is set.

CVectorWriter is useful for overwriting or appending an existing byte vector. CNetMsgMaker is a shortcut for creating messages on-the-fly which are suitable for pushing to CConnman.

furszy

Here we go :) , big awesome step in the area🚂 🚂

Code initial review up until b79e416.
The PR syncs properly on testnet 👌 , syncing mainnet now.

in 701b578 the commit comment is wrongly pointing to a dash PR when the PR is from btc.
In 1ce349f can remove the unused method CNode ::GetTotalRecvSize.

Will continue adding more comments while move forward with the review.

src/masternode-sync.cpp

furszy · 2020-09-03T19:04:04Z

src/masternode-sync.cpp

@@ -311,7 +311,7 @@ bool CMasternodeSync::SyncWithNode(CNode* pnode, bool isRegTestNet)
        if (pnode->HasFulfilledRequest("getspork")) return true;
        pnode->FulfilledRequest("getspork");

-        pnode->PushMessage(NetMsgType::GETSPORKS); //get current network sporks
+        g_connman->PushMessageWithVersion(pnode, INIT_PROTO_VERSION, NetMsgType::GETSPORKS); //get current network sporks


Same as above, this message is sent after the handshake, shouldn't be using INIT_PROTO_VERSION.

Because SyncWithNode is called from a thread other than the messagehandler thread, there actually isn't any guarantee that these two message pushes are done after the handshake is complete. thats the reason i've used INIT_PROTO_VERSION here

hmm a race, could be but then all of the messages over the masternode sync process are under the same race umbrella and not only the getsporks message.

Side from that, two other points:

Peers will not answer any message until the handshake is completed (check ProcessMessage(), if there is no version message then the node will increase the ban score - misbehaving peer - and not process the message, line 5438 of main.cpp.).
Peers shouldn't request data to an uninitialized peer.

INIT_PROTO_VERSION is not a wildcard version that the peer can send at any time, the remote side will not understand what is going on if it's send after the handshake (different versions).

What we could do to fix the possible race is skip the peers that haven't set the version yet on this thread and done (we will have peers at this point anyway, the tier two sync only starts when the chain is synced).

So, something like this could fix it:

g_connman->ForEachNodeContinueIf([sync, isRegTestNet](CNode* pnode){ // skip nodes that haven't finished the handshake yet. if (pfrom->nVersion == 0) { return true; } return sync->SyncWithNode(pnode, isRegTestNet); });

(nVersion is atomic, so no problem on accessing here)

src/spork.cpp

furszy

code review ACK dc10100.
Only the initial review comment popped up for me. Other than that, it's looking great 👌.
Once that one is fixed, can ACK the PR :) .

src/addrman.cpp

Move the initial `GetSporks` message further down so as to be after the `VerAck`, and guard the masternodeman thread from sending any messages to peers that haven't completed the connection handshake.

furszy

code ACK 30d5c66 , testnet sync from scratch went well and tested with #1829 on top as well and all good.

Going to let it sync on mainnet now

ambassador000

Functionality tested, working as intended.

Sync from scratch on testnet till fully synchronized took 40 minutes and 48 seconds.
First 1 million blocks took exactly 25 minutes to sync; 39 mins and 19 secs to load all the blocks, and remaining time to synchronize masternodes, budgets, etc.

Every time after testing a newer, better version, it's hard to force myself to test an old version for comparison. Anyways, this is lightning fast sync now, great work Fuzzbawls! 🥇

furszy

mainnet sync went fine, ACK 30d5c66 .

random-zebra

Code looking pretty good. Few nits (that can be addressed later) and a question.
Will run some testing.

random-zebra · 2020-09-06T15:52:51Z

src/net.cpp

@@ -709,7 +709,7 @@ bool CNode::ReceiveMsgBytes(const char* pch, unsigned int nBytes, bool& complete
        // get current incomplete message, or create a new one
        if (vRecvMsg.empty() ||
            vRecvMsg.back().complete())
-            vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, nRecvVersion));
+            vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION));


nit: could use

vRecvMsg.emplace_back(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION);

and construct the CNetMessage() in place.

random-zebra · 2020-09-06T17:02:53Z

src/main.cpp

+        // Checksum
+        CDataStream& vRecv = msg.vRecv;
+        uint256 hash = Hash(vRecv.begin(), vRecv.begin() + nMessageSize);
+        if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
+        {
+            LogPrintf("%s(%s, %u bytes): CHECKSUM ERROR expected %s was %s\n", __func__,
+               SanitizeString(strCommand), nMessageSize,
+               HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
+               HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
+            return fMoreWork;
+        }


nit: indentation

random-zebra · 2020-09-06T17:05:26Z

src/net.cpp

                                TRY_LOCK(pnode->cs_inventory, lockInv);
                                if (lockInv)
                                    fDelete = true;


nit: indentation

random-zebra · 2020-09-06T17:41:52Z

src/net.cpp

 }

 bool CConnman::DisconnectNode(const std::string& strNode)
 {
+    LOCK(cs_vNodes);


Since FindNode already locks cs_vNodes, wouldn't it be better, to avoid recursive locking, to lock this inside the if below?

random-zebra

ACK 30d5c66 and merging...

borris345 · 2020-09-07T13:49:15Z

How fast is mainnet sync with this?

random-zebra · 2020-09-07T14:26:01Z

It depends on the connection quality. Here it takes little less than 4 hours.

Ethan Heilman and others added 30 commits September 2, 2020 00:41

net: fix typo causing the wrong receive buffer size

d2d71ba

Surprisingly this hasn't been causing me any issues while testing, probably because it requires lots of large blocks to be flying around. Send/Recv corks need tests!

net: make vRecvMsg a list so that we can use splice()

229697a

net: make GetReceiveFloodSize public

1b0beb6

This will be needed so that the message processor can cork incoming messages

net: only disconnect if fDisconnect has been set

6e3f71b

These conditions are problematic to check without locking, and we shouldn't be relying on the refcount to disconnect.

net: wait until the node is destroyed to delete its recv buffer

32ab0c0

when vRecvMsg becomes a private buffer, it won't make sense to allow other threads to mess with it anymore.

net: remove redundant max sendbuffer size check

cc24eff

net: make CMessageHeader a dumb storage class

d2b8e0a

net: set message deserialization version when it's time to deserialize

754400e

We'll soon no longer have access to vRecvMsg, and this is more intuitive anyway.

net: log bytes recv/sent per command

8cee696

net: handle message accounting in ReceiveMsgBytes

ffd4859

This allows locking to be pushed down to only where it's needed Also reuse the current time rather than checking multiple times.

net: record bytes written before notifying the message processor

47ea844

net: Add a simple function for waking the message handler

7e55dbf

This may be used publicly in the future

net: add a new message queue for the message processor

5581b47

This separates the storage of messages from the net and queued messages for processing, allowing the locks to be split.

Move static global randomizer seeds into CConnman

34050a3

net: constify a few CNode vars to indicate that they're threadsafe

de1ad13

net: Use deterministic randomness for CNode's nonce, and make it const

01ea667

serialization: teach serializers variadics

f558bb7

Also add a variadic CDataStream ctor for ease-of-use.

net: switch all callers to connman for pushing messages

9f939f3

Drop all of the old stuff.

drop the optimistic write counter hack

681c62d

This is now handled properly in realtime.

net: remove now-unused ssSend and Fuzz

40a6c5d

net: construct CNodeStates in place

04d39c8

net: handle version push in InitializeNode

f88c06c

net: don't send any messages before handshake or after fdisconnect

07d8c7b

Also, send reject messages earlier in SendMessages(), so that disconnections are processed earlier. These changes combined should ensure that no message is ever sent after fDisconnect is set.

net: No need to check individually for disconnection anymore

63c51d3

net: add CVectorWriter and CNetMsgMaker

b79e416

CVectorWriter is useful for overwriting or appending an existing byte vector. CNetMsgMaker is a shortcut for creating messages on-the-fly which are suitable for pushing to CConnman.

TheBlueMatt and others added 5 commits September 2, 2020 00:42

Make nStartingHeight atomic

8a66add

Make nServices atomic

d816a86

Move [clean|str]SubVer writes/copyStats into a lock

35365e1

Move CNode::addrName accesses behind locked accessors

470482f

Move CNode::addrLocal access behind locked accessors

2ae76aa

Fuzzbawls added P2P Upstream Network labels Sep 3, 2020

Fuzzbawls added this to the 5.0.0 milestone Sep 3, 2020

Fuzzbawls self-assigned this Sep 3, 2020

[bugfix] Making tier two thread interruptable.

dc10100

Fuzzbawls force-pushed the 2020_network-speedup branch from bb4a7d0 to dc10100 Compare September 3, 2020 09:02

furszy reviewed Sep 3, 2020

View reviewed changes

furszy reviewed Sep 4, 2020

View reviewed changes

random-zebra reviewed Sep 4, 2020

View reviewed changes

src/addrman.cpp Outdated Show resolved Hide resolved

furszy mentioned this pull request Sep 4, 2020

Tier two network sync new architecture, regtest support + MN activation functional test. #1829

Merged

Fuzzbawls added 2 commits September 4, 2020 17:25

Don't send layer2 messages to peers that haven't completed the handshake

8a2b7fe

Move the initial `GetSporks` message further down so as to be after the `VerAck`, and guard the masternodeman thread from sending any messages to peers that haven't completed the connection handshake.

net: correct addrman logging

30d5c66

furszy reviewed Sep 5, 2020

View reviewed changes

ambassador000 approved these changes Sep 6, 2020

View reviewed changes

furszy approved these changes Sep 6, 2020

View reviewed changes

random-zebra reviewed Sep 6, 2020

View reviewed changes

random-zebra approved these changes Sep 7, 2020

View reviewed changes

random-zebra merged commit cbd9271 into PIVX-Project:master Sep 7, 2020

This was referenced Sep 7, 2020

Make CBlock a vector of shared_ptr of CTransactions #1815

Merged

[Refactor] Masternode Budget first refactoring and cleanup #1826

Merged

random-zebra modified the milestones: 5.0.0, 4.3.0 Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Net] Massive network refactoring and speedup #1835

[Net] Massive network refactoring and speedup #1835

Fuzzbawls commented Sep 3, 2020

furszy left a comment •

edited

Loading

furszy Sep 3, 2020

Fuzzbawls Sep 3, 2020

furszy Sep 4, 2020 •

edited

Loading

furszy left a comment •

edited

Loading

furszy left a comment

ambassador000 left a comment

furszy left a comment

random-zebra left a comment

random-zebra Sep 6, 2020

random-zebra Sep 6, 2020

random-zebra Sep 6, 2020

random-zebra Sep 6, 2020

random-zebra left a comment

borris345 commented Sep 7, 2020

random-zebra commented Sep 7, 2020

[Net] Massive network refactoring and speedup #1835

[Net] Massive network refactoring and speedup #1835

Conversation

Fuzzbawls commented Sep 3, 2020

furszy left a comment • edited Loading

Choose a reason for hiding this comment

furszy Sep 3, 2020

Choose a reason for hiding this comment

Fuzzbawls Sep 3, 2020

Choose a reason for hiding this comment

furszy Sep 4, 2020 • edited Loading

Choose a reason for hiding this comment

furszy left a comment • edited Loading

Choose a reason for hiding this comment

furszy left a comment

Choose a reason for hiding this comment

ambassador000 left a comment

Choose a reason for hiding this comment

furszy left a comment

Choose a reason for hiding this comment

random-zebra left a comment

Choose a reason for hiding this comment

random-zebra Sep 6, 2020

Choose a reason for hiding this comment

random-zebra Sep 6, 2020

Choose a reason for hiding this comment

random-zebra Sep 6, 2020

Choose a reason for hiding this comment

random-zebra Sep 6, 2020

Choose a reason for hiding this comment

random-zebra left a comment

Choose a reason for hiding this comment

borris345 commented Sep 7, 2020

random-zebra commented Sep 7, 2020

furszy left a comment •

edited

Loading

furszy Sep 4, 2020 •

edited

Loading

furszy left a comment •

edited

Loading