Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p2p: robust intra-cluster broadcasting #635

Closed
corverroos opened this issue May 31, 2022 · 1 comment
Closed

p2p: robust intra-cluster broadcasting #635

corverroos opened this issue May 31, 2022 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@corverroos
Copy link
Contributor

corverroos commented May 31, 2022

Problem to be solved

DutyAttester and DutyProposer doesn't work in a charon cluster with one or more nodes being unavailable/down.

This is due to parsigex (and probably other libp2p components) synchronously and sequentially sending a message to all other peers, one after the other. If sending to one node errors, then the error is caught and sending to the next node is continued. This is fine. But networking is IO, and IO can block. So sending to a down node, can and does result in the call to send blocking for multiple seconds. When the duty times out during this period, the context is cancelled and we log "partial send success" without sending to all nodes.

The other issue is that we calculate "success count" incorrectly, so it looks like we sent all available nodes. But we actually only sent to a few other nodes, then blocked, then exitted.

A related problem is warning/error log storm generated when a single node is down. In large clusters, when multiple nodes are down, these warnings will spam the logs.

Proposed solution

Introduce a "libp2p-sender", a thing that solves both these problems (name TBD):

  • Async sending: components like parsigex uses libp2p-sender to send async messages. sender.SendAsync(ctx, targetPeer, protocol, msg)
  • This will prevent sends blocking other sends when a node is down and IO blocks.
  • Warn log filtering: It doesn't log warnings when sending to a known down node fails.
  • It keeps track of higher level networking metrics, and logs summaries and state changes only.
  • Wire this thing into all components doing p2p sending.
@corverroos corverroos added the enhancement New feature or request label May 31, 2022
@dB2510 dB2510 added this to the Devnet 2 milestone May 31, 2022
@corverroos
Copy link
Contributor Author

Out of scope (follow up PR): Introduce a Broadcast method to the p2psender so each component doesn't need to duplicate broadcasting logic.

obol-bulldozer bot pushed a commit that referenced this issue Jun 5, 2022
Implement v0 of the p2p sender. It doesn't do log filtering yet. This does however solve the issue of one node being down.

category: feature
ticket: #635
obol-bulldozer bot pushed a commit that referenced this issue Jun 8, 2022
Implement `p2p.Sender` that filters p2p sending logs per peer based on state changes.

Still need to wire and refactor other components to also use `p2p.Sender.`

category: feature
ticket: #635
obol-bulldozer bot pushed a commit that referenced this issue Jun 8, 2022
Make the interface `p2p.SendFunc` and the implementations `p2p.Sender.Send` and `p2p.Sender.SendAsync` more explicit.

category: refactor
ticket: #635
obol-bulldozer bot pushed a commit that referenced this issue Jun 8, 2022
Decrease "not connected yet errors" to debug.
Add periodic "connected to M of N peers" info logs.

category: refactor
ticket: #635
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants