-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
p2p: robust intra-cluster broadcasting #635
Labels
enhancement
New feature or request
Milestone
Comments
Out of scope (follow up PR): Introduce a |
obol-bulldozer bot
pushed a commit
that referenced
this issue
Jun 5, 2022
Implement v0 of the p2p sender. It doesn't do log filtering yet. This does however solve the issue of one node being down. category: feature ticket: #635
obol-bulldozer bot
pushed a commit
that referenced
this issue
Jun 8, 2022
Implement `p2p.Sender` that filters p2p sending logs per peer based on state changes. Still need to wire and refactor other components to also use `p2p.Sender.` category: feature ticket: #635
obol-bulldozer bot
pushed a commit
that referenced
this issue
Jun 8, 2022
Make the interface `p2p.SendFunc` and the implementations `p2p.Sender.Send` and `p2p.Sender.SendAsync` more explicit. category: refactor ticket: #635
obol-bulldozer bot
pushed a commit
that referenced
this issue
Jun 8, 2022
Decrease "not connected yet errors" to debug. Add periodic "connected to M of N peers" info logs. category: refactor ticket: #635
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem to be solved
DutyAttester and DutyProposer doesn't work in a charon cluster with one or more nodes being unavailable/down.
This is due to
parsigex
(and probably other libp2p components) synchronously and sequentially sending a message to all other peers, one after the other. If sending to one node errors, then the error is caught and sending to the next node is continued. This is fine. But networking is IO, and IO can block. So sending to a down node, can and does result in the call to send blocking for multiple seconds. When the duty times out during this period, the context is cancelled and we log "partial send success" without sending to all nodes.The other issue is that we calculate "success count" incorrectly, so it looks like we sent all available nodes. But we actually only sent to a few other nodes, then blocked, then exitted.
A related problem is warning/error log storm generated when a single node is down. In large clusters, when multiple nodes are down, these warnings will spam the logs.
Proposed solution
Introduce a "libp2p-sender", a thing that solves both these problems (name TBD):
sender.SendAsync(ctx, targetPeer, protocol, msg)
The text was updated successfully, but these errors were encountered: