Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Net] Massive network refactoring and speedup #1835

Merged
merged 56 commits into from
Sep 7, 2020

Conversation

Fuzzbawls
Copy link
Collaborator

This is a combination of multiple upstream PRs focused on optimizing the P2P networking flow after the introduction of CConnman encapsulation, and a few older PRs that were previously missed to support the later optimizations. The PRs are as follows:

Ethan Heilman and others added 30 commits September 2, 2020 00:41
Changes addrman to use the test-before-evict discipline in which an
address is to be evicted from the tried table is first tested and if
it is still online it is not evicted.

Adds tests to provide test coverage for this change.

This change was suggested as Countermeasure 3 in
Eclipse Attacks on Bitcoin’s Peer-to-Peer Network, Ethan Heilman,
Alison Kendler, Aviv Zohar, Sharon Goldberg. ePrint Archive Report
2015/263. March 2015.
Surprisingly this hasn't been causing me any issues while testing, probably
because it requires lots of large blocks to be flying around.

Send/Recv corks need tests!
This will be needed so that the message processor can cork incoming messages
These conditions are problematic to check without locking, and we shouldn't be
relying on the refcount to disconnect.
when vRecvMsg becomes a private buffer, it won't make sense to allow other
threads to mess with it anymore.
We'll soon no longer have access to vRecvMsg, and this is more intuitive
anyway.
This allows locking to be pushed down to only where it's needed

Also reuse the current time rather than checking multiple times.
This may be used publicly in the future
In order to sleep accurately, the message handler needs to know if _any_
node has more processing that it should do before the entire thread
sleeps.

Rather than returning a value that represents whether ProcessMessages
encountered a message that should trigger a disconnnect, interpret the
return value as whether or not that node has more work to do.

Also, use a global fProcessWake value that can be set by other threads,
which takes precedence (for one cycle) over the messagehandler's
decision.

Note that the previous behavior was to only process one message per loop
(except in the case of a bad checksum or invalid header). That was
changed in PR dashpay#3180.

The only change here in that regard is that the current node now falls
to the back of the processing queue for the bad checksum/invalid header
cases.
This separates the storage of messages from the net and queued messages
for processing, allowing the locks to be split.
Messages are dumped very quickly from the socket handler to the
processor, so it's the depth of the processing queue that's interesting.

The socket handler checks the process queue's size during the brief
message hand-off and pauses if necessary, and the processor possibly
un-pauses each time a message is popped off of its queue.
Also add a variadic CDataStream ctor for ease-of-use.
The changes here are dense and subtle, but hopefully all is more explicit
than before.

- CConnman is now in charge of sending data rather than the nodes themselves.
  This is necessary because many decisions need to be made with all nodes in
  mind, and a model that requires the nodes calling up to their manager quickly
  turns to spaghetti.

- The per-node-serializer (ssSend) has been replaced with a (quasi-)const
  send-version. Since the send version for serialization can only change once
  per connection, we now explicitly tag messages with INIT_PROTO_VERSION if
  they are sent before the handshake. With this done, there's no need to lock
  for access to nSendVersion.

  Also, a new stream is used for each message, so there's no need to lock
  during the serialization process.

- This takes care of accounting for optimistic sends, so the
  nOptimisticBytesWritten hack can be removed.

- -dropmessagestest and -fuzzmessagestest have not been preserved, as I suspect
  they haven't been used in years.
This is now handled properly in realtime.
This way we're not relying on messages going out after fDisconnect has been
set.

This should not cause any real behavioral changes, though feelers should
arguably disconnect earlier in the process. That can be addressed in a later
functional change.
Also, send reject messages earlier in SendMessages(), so that
disconnections are processed earlier.

These changes combined should ensure that no message is ever sent after
fDisconnect is set.
CVectorWriter is useful for overwriting or appending an existing byte vector.

CNetMsgMaker is a shortcut for creating messages on-the-fly which are suitable
for pushing to CConnman.
@Fuzzbawls Fuzzbawls added this to the 5.0.0 milestone Sep 3, 2020
@Fuzzbawls Fuzzbawls self-assigned this Sep 3, 2020
Copy link

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go :) , big awesome step in the area🚂 🚂

Code initial review up until b79e416.
The PR syncs properly on testnet 👌 , syncing mainnet now.

  • in 701b578 the commit comment is wrongly pointing to a dash PR when the PR is from btc.
  • In 1ce349f can remove the unused method CNode ::GetTotalRecvSize.

Will continue adding more comments while move forward with the review.

src/masternode-sync.cpp Outdated Show resolved Hide resolved
@@ -311,7 +311,7 @@ bool CMasternodeSync::SyncWithNode(CNode* pnode, bool isRegTestNet)
if (pnode->HasFulfilledRequest("getspork")) return true;
pnode->FulfilledRequest("getspork");

pnode->PushMessage(NetMsgType::GETSPORKS); //get current network sporks
g_connman->PushMessageWithVersion(pnode, INIT_PROTO_VERSION, NetMsgType::GETSPORKS); //get current network sporks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this message is sent after the handshake, shouldn't be using INIT_PROTO_VERSION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because SyncWithNode is called from a thread other than the messagehandler thread, there actually isn't any guarantee that these two message pushes are done after the handshake is complete. thats the reason i've used INIT_PROTO_VERSION here

Copy link

@furszy furszy Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm a race, could be but then all of the messages over the masternode sync process are under the same race umbrella and not only the getsporks message.

Side from that, two other points:

  1. Peers will not answer any message until the handshake is completed (check ProcessMessage(), if there is no version message then the node will increase the ban score - misbehaving peer - and not process the message, line 5438 of main.cpp.).
    Peers shouldn't request data to an uninitialized peer.

  2. INIT_PROTO_VERSION is not a wildcard version that the peer can send at any time, the remote side will not understand what is going on if it's send after the handshake (different versions).

What we could do to fix the possible race is skip the peers that haven't set the version yet on this thread and done (we will have peers at this point anyway, the tier two sync only starts when the chain is synced).

So, something like this could fix it:

g_connman->ForEachNodeContinueIf([sync, isRegTestNet](CNode* pnode){
        // skip nodes that haven't finished the handshake yet.
        if (pfrom->nVersion == 0) { return true; }
        return sync->SyncWithNode(pnode, isRegTestNet);
    });

(nVersion is atomic, so no problem on accessing here)

src/spork.cpp Outdated Show resolved Hide resolved
Copy link

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code review ACK dc10100.
Only the initial review comment popped up for me. Other than that, it's looking great 👌.
Once that one is fixed, can ACK the PR :) .

src/addrman.cpp Outdated Show resolved Hide resolved
Move the initial `GetSporks` message further down so as to be after the
`VerAck`, and guard the masternodeman thread from sending any messages
to peers that haven't completed the connection handshake.
Copy link

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code ACK 30d5c66 , testnet sync from scratch went well and tested with #1829 on top as well and all good.

Going to let it sync on mainnet now

Copy link

@ambassador000 ambassador000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality tested, working as intended.

Sync from scratch on testnet till fully synchronized took 40 minutes and 48 seconds.
First 1 million blocks took exactly 25 minutes to sync; 39 mins and 19 secs to load all the blocks, and remaining time to synchronize masternodes, budgets, etc.

Every time after testing a newer, better version, it's hard to force myself to test an old version for comparison. Anyways, this is lightning fast sync now, great work Fuzzbawls! 🥇

Copy link

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainnet sync went fine, ACK 30d5c66 .

Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looking pretty good. Few nits (that can be addressed later) and a question.
Will run some testing.

@@ -709,7 +709,7 @@ bool CNode::ReceiveMsgBytes(const char* pch, unsigned int nBytes, bool& complete
// get current incomplete message, or create a new one
if (vRecvMsg.empty() ||
vRecvMsg.back().complete())
vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, nRecvVersion));
vRecvMsg.push_back(CNetMessage(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could use

vRecvMsg.emplace_back(Params().MessageStart(), SER_NETWORK, INIT_PROTO_VERSION);

and construct the CNetMessage() in place.

src/main.cpp Outdated
Comment on lines 6162 to 6172
// Checksum
CDataStream& vRecv = msg.vRecv;
uint256 hash = Hash(vRecv.begin(), vRecv.begin() + nMessageSize);
if (memcmp(hash.begin(), hdr.pchChecksum, CMessageHeader::CHECKSUM_SIZE) != 0)
{
LogPrintf("%s(%s, %u bytes): CHECKSUM ERROR expected %s was %s\n", __func__,
SanitizeString(strCommand), nMessageSize,
HexStr(hash.begin(), hash.begin()+CMessageHeader::CHECKSUM_SIZE),
HexStr(hdr.pchChecksum, hdr.pchChecksum+CMessageHeader::CHECKSUM_SIZE));
return fMoreWork;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

Comment on lines 1088 to 1090
TRY_LOCK(pnode->cs_inventory, lockInv);
if (lockInv)
fDelete = true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation

}

bool CConnman::DisconnectNode(const std::string& strNode)
{
LOCK(cs_vNodes);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FindNode already locks cs_vNodes, wouldn't it be better, to avoid recursive locking, to lock this inside the if below?

Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 30d5c66 and merging...

@borris345
Copy link

How fast is mainnet sync with this?

@random-zebra
Copy link

It depends on the connection quality. Here it takes little less than 4 hours.

@random-zebra random-zebra modified the milestones: 5.0.0, 4.3.0 Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants