New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beast P2P Network #3261

Merged
merged 31 commits into from May 27, 2018

Conversation

Projects
None yet
8 participants
@bytemaster
Contributor

bytemaster commented May 22, 2018

This pull request introduces an alternative P2P network protocol based upon boost::beast websockets.

Major Features:

  1. multi-threaded - the networking code can use as many threads as necessary, number configured via command line
  2. less than 1000 lines of code
  3. uses a simple yet reliable syncing algorithm
  4. auto reconnect
  5. graceful connection shutdown w/out delay
  6. prioritizes blocks over transactions without creating large queues

Tested at 1000 TPS with 3 nodes on localhost
Tested multiple syncing peers at same time as 1000 TPS

Missing Features:

  1. No RPC interface for dynamically adding/removing peers
  2. no attempts made to minimize bandwidth consumed via pre-acks

This net plugin could "in theory" work side-by-side with the existing net plugin and is entirely independent / optional.

bytemaster added some commits May 21, 2018

Progress on bnet
- auto reconnect
- transactions propagate
- fix bugs preventing replay after crash

@bytemaster bytemaster added this to the Version 1.0 milestone May 22, 2018

@abourget

This comment has been minimized.

Contributor

abourget commented May 22, 2018

Will it negotiate a different network version than the current p2p protocol?

@wanderingbort

We should separate this into two PRs, one which contains the changes for "allow_dirty" which should probably remain in 1.0 and another which implements/fixes the beast based net prototype and should probably not be part of 1.0.

There is a low probability that we can harden and find enough of the edges in the beast based plugin to even list it as an "experimental" part of the release and it would be more appropriate for a later version where we can actually present a decent alternative p2p protocol to users.

}
void run( const string& peer ) {
auto c = peer.find(':');

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

This will fail if given raw IPv6 addresses

* the LIB or up to the last block known by the remote peer.
*/
void on_new_lib( block_state_ptr s ) {
if( !_strand.running_in_this_thread() ) { elog( "wrong strand" ); }

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

is this an actual error case? If so, is it safe to print to the log and continue?

}
}
if( _transaction_status.size() > 1000 ) {
idump((_transaction_status.size()));

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

This seems like debugging spew that is not informative nor easily filterable. It should probably be removed OR include a better preamble

ids.emplace_back(control.get_block_id_for_num(i));
}
}
self->_ios.post( boost::asio::bind_executor(

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

In prolonged forks this vector could grow quite large and we are copying it into the lambda. We should consider either std::move or allocating the vector in the heap and capturing a std::shared_ptr to it.

}
));
} catch ( const fc::exception& e ) {

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

The patterns doesn't seem to guarantee that the callback will always be called and the usage of the callbacks seems to imply that they will creating a situation where it exceptions are thrown that certain parts of the implied state machine will be broken

//wlog( "received trx ${id}", ("id",id) );
auto itr = _transaction_status.find( id );
if( itr != _transaction_status.end() ) {
_transaction_status.modify( itr, [&]( auto& stat ) {

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

Given a well-behaved peer, the only way this entry will be in the _transaction_status seems to be if we sent it to them at some point. If they have sent it to us and we have sent it to them, can we erase instead of re-upping the expiration? Or would that create a failure mode if we are swapping back and forth with forks?

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

it looks like the applied transaction channel should only be emitted once even WRT forks so, there seems to be little reason with the code as is to not consider this transaction settled

channels::irreversible_block::channel_type::handle _on_irb_handle;
channels::accepted_block::channel_type::handle _on_accepted_block_handle;
channels::rejected_block::channel_type::handle _on_bad_block_handle;
channels::accepted_transaction::channel_type::handle _on_appled_trx_handle;

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

appled transactions sound delicious: "applied" 😄

/// add some random delay so that all my peers don't attempt to reconnect to me
/// at the same time after shutting down.. TODO: verify srand is called
_timer->expires_from_now( boost::posix_time::microseconds( 1000000*(10+rand()%5) ) );
_timer->async_wait([=](const boost::system::error_code& ec) {

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

There is a very small edge case where this handler will be called with a stale this pointer. IF the delivery of a "successful" timer callback is in-flight when we shutdown the plugin. We will destroy the plugin and call cancel on the timer and reset on the container which will also destroy the deadline_timer and finally we will destroy the plugin.

The issue is that the timer cannot revoke the "in-flight" success at that point so it will eventually flush out of the io_service when we destroy that. In this invocation, ec indicates success so we will dereference the stale this and call on_reconnect_peers on it.

All this is to say that, we should probably capture a weak_ptr to this and lock it in addition to checking the EC.

}
void do_read() {
_ws->async_read( _in_buffer,

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

Can we provide any "guidelines" to the limits of a websocket payload size? Are their any implicit limits that we need to be aware of? Can I craft a bad websocket message that will attempt to buffer 2^64 bytes before calling a handler in our app-space?

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

for future reference, it looks like the default max is 16mb and is configurable via a call to ::read_message_max(size_t)

https://www.boost.org/doc/libs/develop/libs/beast/doc/html/beast/ref/boost__beast__websocket__stream/read_message_max/overload1.html

@@ -202,22 +208,38 @@ class producer_plugin_impl {
});
// push the new block
chain.push_block(block);
bool except = false;

This comment has been minimized.

@wanderingbort

wanderingbort May 22, 2018

Contributor

The existing net code relies on exceptions being thrown from this handler so, this would break that contract and make the two paths incompatible as implemented. We can throw as well as publish to a channel.

@bytemaster bytemaster modified the milestones: Version 1.0, Verison 1.1 May 22, 2018

@spoonincode

This comment has been minimized.

Contributor

spoonincode commented May 23, 2018

Boost 1.67 release notes mention

This version fixes significant defects in websocket::stream which can lead to asserts or undefined behavior. Users are encouraged to update to the latest Boost release.

IMO we should move the build scripts to 1.67 before taking this

Merge branch 'master' into bnet
- clean up bnet logic
- validate fork switching works
- validate chain id
- enable option to skip trx as node

@bytemaster bytemaster modified the milestones: Verison 1.1, Version 1.0 May 24, 2018

cfg.add_options()
("bnet_endpoint", bpo::value<string>()->default_value("0.0.0.0:4321"), "the endpoint upon which to listen for incoming connections" )
("bnet_threads", bpo::value<uint32_t>(), "the number of threads to use to process network messages" )
("bnet_connect", bpo::value<vector<string>>()->composing(), "the endpoint upon which to listen for incoming connections" )

This comment has been minimized.

@spoonincode

spoonincode May 24, 2018

Contributor

wrong help text here

void bnet_plugin::set_program_options(options_description& cli, options_description& cfg) {
cfg.add_options()
("bnet_endpoint", bpo::value<string>()->default_value("0.0.0.0:4321"), "the endpoint upon which to listen for incoming connections" )

This comment has been minimized.

@spoonincode

spoonincode May 24, 2018

Contributor

Other options use hyphen, not underscore

@wanderingbort

This comment has been minimized.

Contributor

wanderingbort commented on plugins/producer_plugin/producer_plugin.cpp in 01b0ffa May 24, 2018

This is not an expected return value from this function and we'd need to audit callers of the methods/channels it backs to see if they can handle a null shared pointer return. @heifner has identified that at least the RPC endpoint was not expecting this and handling it poorly.

We can either return a newly created empty trace with the except field set to the appropriate exception for duplicates and avoid the cost of getting into chain controller OR just throw the appropriate exception from here and avoid the cost.

bytemaster and others added some commits May 24, 2018

Beast Network pre-ack blocks
- update cli params
- remove ack enforcement as it interferred with syncing
- send pre-acks when block header confirmed to reduce network traffic
- fixed potential race conditions / clean up send next block algo
- producer plugin returns a trace with an exception rather than throw on
duplicate

bytemaster and others added some commits May 25, 2018

@wanderingbort

This comment has been minimized.

Contributor

wanderingbort commented on plugins/bnet_plugin/bnet_plugin.cpp in 06562cc May 25, 2018

This doesn't pass the smell test. After inspecting the body of on_session_close I can see that this is, in fact, safe in the current code

However, as a future engineer who only sees on_session_close I will have no idea that the pointer is guaranteed to be stale.

We should probably address the sessions by a type or value that implies it cannot be used as a session * OR make sure that this logic path is not called from a destructor

PaulCalabrese and others added some commits May 25, 2018

@gleehokie

This comment has been minimized.

Contributor

gleehokie commented May 26, 2018

Master was merged (89de674) which brought in boost 1.67 for all platforms, however, AWS and CentOS are not passing tests. Initial analysis shows that nod EOS is segfault’ing immediately upon execution.

Don’t merge this PR to master until that has been fixed.

gleehokie and others added some commits May 26, 2018

Merge pull request #3427 from EOSIO/bnet-launch-support
Add bnet support to launcher

@bytemaster bytemaster merged commit f72c09c into master May 27, 2018

14 checks passed

buildkite/eosio Build #1459 passed (41 minutes, 38 seconds)
Details
buildkite/eosio/aws-build Passed (7 minutes, 39 seconds)
Details
buildkite/eosio/aws-tests Passed (14 minutes, 2 seconds)
Details
buildkite/eosio/centos-build Passed (7 minutes, 40 seconds)
Details
buildkite/eosio/centos-tests Passed (14 minutes, 12 seconds)
Details
buildkite/eosio/darwin-build Passed (7 minutes, 10 seconds)
Details
buildkite/eosio/darwin-tests Passed (25 minutes, 40 seconds)
Details
buildkite/eosio/fedora-build Passed (8 minutes, 2 seconds)
Details
buildkite/eosio/fedora-tests Passed (13 minutes, 43 seconds)
Details
buildkite/eosio/pipeline Passed (4 seconds)
Details
buildkite/eosio/ubuntu-18-dot-04-build Passed (6 minutes, 9 seconds)
Details
buildkite/eosio/ubuntu-18-dot-04-tests Passed (13 minutes, 46 seconds)
Details
buildkite/eosio/ubuntu-build Passed (6 minutes, 47 seconds)
Details
buildkite/eosio/ubuntu-tests Passed (15 minutes, 20 seconds)
Details

@arhag arhag deleted the bnet branch Jun 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment