Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed reconnection issue in the dev cluster with AWS cluster #1915

Merged
merged 19 commits into from
Jun 4, 2024

Conversation

xgreenx
Copy link
Collaborator

@xgreenx xgreenx commented May 29, 2024

Closes #1592

This change improves and simplifies some aspects of the P2P service. It also fixes the issue of not reconnecting to the reserved nodes when the reserved node is restarted and got new IP.

  • The change moves the reconnection handling into the PeerReport behavior. It removes ping-ponging reserved peers between the primary behavior and the PeerReport behavior and encapsulates the logic inside the PeerReport. Also, it eliminates the timer and replaces it with the queue of reconnections, reducing noise in logs(before, we had much more trash errors).
  • Added logs for cases when the dial fails. They are very helpful to debug issues with connection.
  • Simplified initialization of the ConnectionTracker and FuelAuthenticated. It allows the reuse of libp2p built-in connections builder.
  • Removed the usage of the Mplex since it doesn't have a backpressure mechanism. Now we use Yamux by default. It is breaking the change since nodes with the old codebase can't connect to new nodes.
  • Propagated max_concurrent_streams for request-response protocol.

Checklist

  • Breaking changes are clearly marked as such in the PR description and changelog

Before requesting review

  • I have reviewed the code myself

…It also fixes the issue of not reconnecting to the reserved nodes when the reserved node is restarted and a new IP.

- The change moves the reconnection handling into the `PeerReport` behavior. It removes ping-ponging reserved peers between the primary behavior and the `PeerReport` behavior and encapsulates the logic inside the `PeerReport`. Also, it eliminates the timer and replaces it with the queue of reconnections, reducing noise in logs(before, we had much more trash errors).
- Added logs for cases when the dial fails. They are very helpful to debug issues with connection.
- Simplified initialization of the `ConnectionTracker` and `FuelAuthenticated`. It allows the reuse of libp2p built-in connections builder.
- Removed the usage of the Mplex since it doesn't have a backpressure mechanism. Now we use Yamux by default. It is breaking the change since nodes with the old codebase can't connect to new nodes.
- Propagated `max_concurrent_streams` for yamux.
@xgreenx xgreenx added the breaking A breaking api change label May 29, 2024
@xgreenx xgreenx requested a review from a team May 29, 2024 22:19
@xgreenx xgreenx self-assigned this May 29, 2024
.with_tcp(
tcp_config,
transport_function,
libp2p::yamux::Config::default,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking change: We don't support Mplex anymore.

if instant.elapsed() > Duration::from_secs(HEALTH_CHECK_INTERVAL_IN_SECONDS) {
let peer_id = *peer_id;
self.reserved_nodes_to_connect.pop_front();
let multiaddrs = self
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual fix for the main reconnection problem that I faced with the dev cluster. The initial DNS address was replaced with a real IP, but when the sentry crashed, the IP was another one. Using initial multiaddrs here allows you to reconnect and get a new IP again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a comment somewhere here explaining this.

Base automatically changed from feature/fixed-dead-lock to master May 30, 2024 07:47
@xgreenx xgreenx requested a review from Dentosal June 3, 2024 11:56
@xgreenx xgreenx enabled auto-merge (squash) June 4, 2024 18:49
@xgreenx xgreenx merged commit 0a1d591 into master Jun 4, 2024
28 checks passed
@xgreenx xgreenx deleted the feature/fixed-p2p-reconnection-issue branch June 4, 2024 19:12
@xgreenx xgreenx mentioned this pull request Jun 6, 2024
xgreenx added a commit that referenced this pull request Jun 7, 2024
## Version v0.28.0

### Changed
- [#1934](#1934): Updated
benchmark for the `aloc` opcode to be `DependentCost`. Updated
`vm_initialization` benchmark to exclude growing of memory(It is handled
by VM reuse).
- [#1916](#1916): Speed up
synchronisation of the blocks for the `fuel-core-sync` service.
- [#1888](#1888):
optimization: Reuse VM memory across executions.

#### Breaking

- [#1934](#1934): Changed
`GasCosts` endpoint to return `DependentCost` for the `aloc` opcode via
`alocDependentCost`.
- [#1934](#1934): Updated
default gas costs for the local testnet configuration. All opcodes
became cheaper.
- [#1924](#1924):
`dry_run_opt` has new `gas_price: Option<u64>` argument
- [#1888](#1888): Upgraded
`fuel-vm` to `0.51.0`. See
[release](https://github.com/FuelLabs/fuel-vm/releases/tag/v0.51.0) for
more information.

### Added
- [#1939](#1939): Added API
functions to open a RocksDB in different modes.
- [#1929](#1929): Added
support of customization of the state transition version in the
`ChainConfig`.

### Removed
- [#1913](#1913): Removed dead
code from the project.

### Fixed
- [#1921](#1921): Fixed
unstable `gossipsub_broadcast_tx_with_accept` test.
- [#1915](#1915): Fixed
reconnection issue in the dev cluster with AWS cluster.
- [#1914](#1914): Fixed
halting of the node during synchronization in PoA service.

## What's Changed
* Removed dead code by @xgreenx in
#1913
* Added backward and forward compatibility integration tests for
forkless upgrades by @xgreenx in
#1895
* Fixed halting of the node in rare conditions by @xgreenx in
#1914
* Weekly `cargo update` by @github-actions in
#1928
* Fixed logging of the WASM executor by @xgreenx in
#1930
* Added support of customization of the state transition version in the
`ChainConfig` by @xgreenx in
#1929
* Document wasm toolchain installation, add rust-toolchain.toml by
@Dentosal in #1932
* Add optional `gas_price` argument to `dry_run_opt` by @hal3e in
#1924
* Reuse VM memory across executions by @Dentosal in
#1888
* Fixed reconnection issue in the dev cluster with AWS cluster by
@xgreenx in #1915
* Speeds up synchronisation of the blocks for the `fuel-core-sync`
service by @xgreenx in #1916
* Fixed unstable `gossipsub_broadcast_tx_with_accept` test by @xgreenx
in #1921
* Added API functions to open a RocksDB in different modes by @xgreenx
in #1939
* Use `DependentCost` for `aloc` opcode by @xgreenx in
#1934

## New Contributors
* @hal3e made their first contribution in
#1924

**Full Changelog**:
v0.27.0...v0.28.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking A breaking api change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants