-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Net] Massive network refactoring and speedup #1835
[Net] Massive network refactoring and speedup #1835
Conversation
Changes addrman to use the test-before-evict discipline in which an address is to be evicted from the tried table is first tested and if it is still online it is not evicted. Adds tests to provide test coverage for this change. This change was suggested as Countermeasure 3 in Eclipse Attacks on Bitcoin’s Peer-to-Peer Network, Ethan Heilman, Alison Kendler, Aviv Zohar, Sharon Goldberg. ePrint Archive Report 2015/263. March 2015.
Surprisingly this hasn't been causing me any issues while testing, probably because it requires lots of large blocks to be flying around. Send/Recv corks need tests!
This will be needed so that the message processor can cork incoming messages
These conditions are problematic to check without locking, and we shouldn't be relying on the refcount to disconnect.
when vRecvMsg becomes a private buffer, it won't make sense to allow other threads to mess with it anymore.
We'll soon no longer have access to vRecvMsg, and this is more intuitive anyway.
This allows locking to be pushed down to only where it's needed Also reuse the current time rather than checking multiple times.
This may be used publicly in the future
In order to sleep accurately, the message handler needs to know if _any_ node has more processing that it should do before the entire thread sleeps. Rather than returning a value that represents whether ProcessMessages encountered a message that should trigger a disconnnect, interpret the return value as whether or not that node has more work to do. Also, use a global fProcessWake value that can be set by other threads, which takes precedence (for one cycle) over the messagehandler's decision. Note that the previous behavior was to only process one message per loop (except in the case of a bad checksum or invalid header). That was changed in PR dashpay#3180. The only change here in that regard is that the current node now falls to the back of the processing queue for the bad checksum/invalid header cases.
This separates the storage of messages from the net and queued messages for processing, allowing the locks to be split.
Messages are dumped very quickly from the socket handler to the processor, so it's the depth of the processing queue that's interesting. The socket handler checks the process queue's size during the brief message hand-off and pauses if necessary, and the processor possibly un-pauses each time a message is popped off of its queue.
Also add a variadic CDataStream ctor for ease-of-use.
The changes here are dense and subtle, but hopefully all is more explicit than before. - CConnman is now in charge of sending data rather than the nodes themselves. This is necessary because many decisions need to be made with all nodes in mind, and a model that requires the nodes calling up to their manager quickly turns to spaghetti. - The per-node-serializer (ssSend) has been replaced with a (quasi-)const send-version. Since the send version for serialization can only change once per connection, we now explicitly tag messages with INIT_PROTO_VERSION if they are sent before the handshake. With this done, there's no need to lock for access to nSendVersion. Also, a new stream is used for each message, so there's no need to lock during the serialization process. - This takes care of accounting for optimistic sends, so the nOptimisticBytesWritten hack can be removed. - -dropmessagestest and -fuzzmessagestest have not been preserved, as I suspect they haven't been used in years.
Drop all of the old stuff.
This is now handled properly in realtime.
This way we're not relying on messages going out after fDisconnect has been set. This should not cause any real behavioral changes, though feelers should arguably disconnect earlier in the process. That can be addressed in a later functional change.
Also, send reject messages earlier in SendMessages(), so that disconnections are processed earlier. These changes combined should ensure that no message is ever sent after fDisconnect is set.
CVectorWriter is useful for overwriting or appending an existing byte vector. CNetMsgMaker is a shortcut for creating messages on-the-fly which are suitable for pushing to CConnman.
bb4a7d0
to
dc10100
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we go :) , big awesome step in the area🚂 🚂
Code initial review up until b79e416.
The PR syncs properly on testnet 👌 , syncing mainnet now.
- in 701b578 the commit comment is wrongly pointing to a dash PR when the PR is from btc.
- In 1ce349f can remove the unused method
CNode ::GetTotalRecvSize
.
Will continue adding more comments while move forward with the review.
src/masternode-sync.cpp
Outdated
@@ -311,7 +311,7 @@ bool CMasternodeSync::SyncWithNode(CNode* pnode, bool isRegTestNet) | |||
if (pnode->HasFulfilledRequest("getspork")) return true; | |||
pnode->FulfilledRequest("getspork"); | |||
|
|||
pnode->PushMessage(NetMsgType::GETSPORKS); //get current network sporks | |||
g_connman->PushMessageWithVersion(pnode, INIT_PROTO_VERSION, NetMsgType::GETSPORKS); //get current network sporks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, this message is sent after the handshake, shouldn't be using INIT_PROTO_VERSION.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because SyncWithNode
is called from a thread other than the messagehandler thread, there actually isn't any guarantee that these two message pushes are done after the handshake is complete. thats the reason i've used INIT_PROTO_VERSION
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm a race, could be but then all of the messages over the masternode sync process are under the same race umbrella and not only the getsporks message.
Side from that, two other points:
-
Peers will not answer any message until the handshake is completed (check
ProcessMessage()
, if there is no version message then the node will increase the ban score - misbehaving peer - and not process the message, line 5438 of main.cpp.).
Peers shouldn't request data to an uninitialized peer. -
INIT_PROTO_VERSION
is not a wildcard version that the peer can send at any time, the remote side will not understand what is going on if it's send after the handshake (different versions).
What we could do to fix the possible race is skip the peers that haven't set the version yet on this thread and done (we will have peers at this point anyway, the tier two sync only starts when the chain is synced).
So, something like this could fix it:
g_connman->ForEachNodeContinueIf([sync, isRegTestNet](CNode* pnode){
// skip nodes that haven't finished the handshake yet.
if (pfrom->nVersion == 0) { return true; }
return sync->SyncWithNode(pnode, isRegTestNet);
});
(nVersion
is atomic, so no problem on accessing here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code review ACK dc10100.
Only the initial review comment popped up for me. Other than that, it's looking great 👌.
Once that one is fixed, can ACK the PR :) .
Move the initial `GetSporks` message further down so as to be after the `VerAck`, and guard the masternodeman thread from sending any messages to peers that haven't completed the connection handshake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality tested, working as intended.
Sync from scratch on testnet till fully synchronized took 40 minutes and 48 seconds.
First 1 million blocks took exactly 25 minutes to sync; 39 mins and 19 secs to load all the blocks, and remaining time to synchronize masternodes, budgets, etc.
Every time after testing a newer, better version, it's hard to force myself to test an old version for comparison. Anyways, this is lightning fast sync now, great work Fuzzbawls! 🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainnet sync went fine, ACK 30d5c66 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looking pretty good. Few nits (that can be addressed later) and a question.
Will run some testing.
@@ -709,7 +709,7 @@ bool CNode::ReceiveMsgBytes(const char* pch, unsigned int nBytes, bool& complete | |||
// get current incomplete message, or create a new one | |||
if (vRecvMsg.empty() || | |||
vRecvMsg.back().complete()) | |||
vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, nRecvVersion)); | |||
vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could use
vRecvMsg.emplace_back(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION);
and construct the CNetMessage()
in place.
src/main.cpp
Outdated
// Checksum | ||
CDataStream& vRecv = msg.vRecv; | ||
uint256 hash = Hash(vRecv.begin(), vRecv.begin() + nMessageSize); | ||
if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0) | ||
{ | ||
LogPrintf("%s(%s, %u bytes): CHECKSUM ERROR expected %s was %s\n", __func__, | ||
SanitizeString(strCommand), nMessageSize, | ||
HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE), | ||
HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE)); | ||
return fMoreWork; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation
TRY_LOCK(pnode->cs_inventory, lockInv); | ||
if (lockInv) | ||
fDelete = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indentation
} | ||
|
||
bool CConnman::DisconnectNode(const std::string& strNode) | ||
{ | ||
LOCK(cs_vNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FindNode
already locks cs_vNodes
, wouldn't it be better, to avoid recursive locking, to lock this inside the if
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 30d5c66 and merging...
How fast is mainnet sync with this? |
It depends on the connection quality. Here it takes little less than 4 hours. |
This is a combination of multiple upstream PRs focused on optimizing the P2P networking flow after the introduction of CConnman encapsulation, and a few older PRs that were previously missed to support the later optimizations. The PRs are as follows: