p2p: robust intra-cluster broadcasting #635

corverroos · 2022-05-31T10:35:30Z

Problem to be solved

DutyAttester and DutyProposer doesn't work in a charon cluster with one or more nodes being unavailable/down.

This is due to parsigex (and probably other libp2p components) synchronously and sequentially sending a message to all other peers, one after the other. If sending to one node errors, then the error is caught and sending to the next node is continued. This is fine. But networking is IO, and IO can block. So sending to a down node, can and does result in the call to send blocking for multiple seconds. When the duty times out during this period, the context is cancelled and we log "partial send success" without sending to all nodes.

The other issue is that we calculate "success count" incorrectly, so it looks like we sent all available nodes. But we actually only sent to a few other nodes, then blocked, then exitted.

A related problem is warning/error log storm generated when a single node is down. In large clusters, when multiple nodes are down, these warnings will spam the logs.

Proposed solution

Introduce a "libp2p-sender", a thing that solves both these problems (name TBD):

Async sending: components like parsigex uses libp2p-sender to send async messages. sender.SendAsync(ctx, targetPeer, protocol, msg)
This will prevent sends blocking other sends when a node is down and IO blocks.
Warn log filtering: It doesn't log warnings when sending to a known down node fails.
It keeps track of higher level networking metrics, and logs summaries and state changes only.
Wire this thing into all components doing p2p sending.

The text was updated successfully, but these errors were encountered:

corverroos · 2022-06-01T14:40:48Z

Out of scope (follow up PR): Introduce a Broadcast method to the p2psender so each component doesn't need to duplicate broadcasting logic.

Implement v0 of the p2p sender. It doesn't do log filtering yet. This does however solve the issue of one node being down. category: feature ticket: #635

Implement `p2p.Sender` that filters p2p sending logs per peer based on state changes. Still need to wire and refactor other components to also use `p2p.Sender.` category: feature ticket: #635

Make the interface `p2p.SendFunc` and the implementations `p2p.Sender.Send` and `p2p.Sender.SendAsync` more explicit. category: refactor ticket: #635

Decrease "not connected yet errors" to debug. Add periodic "connected to M of N peers" info logs. category: refactor ticket: #635

corverroos added the enhancement New feature or request label May 31, 2022

dB2510 added this to the Devnet 2 milestone May 31, 2022

Battenfield added the Size: 5 label Jun 1, 2022

corverroos mentioned this issue Jun 3, 2022

p2p: implement naive p2p sender #666

Merged

obol-bulldozer bot pushed a commit that referenced this issue Jun 5, 2022

p2p: implement naive p2p sender (#666)

f9cfca3

Implement v0 of the p2p sender. It doesn't do log filtering yet. This does however solve the issue of one node being down. category: feature ticket: #635

corverroos mentioned this issue Jun 7, 2022

p2p: implement sender log filter #675

Merged

corverroos mentioned this issue Jun 8, 2022

p2p: refactor send function interface #680

Merged

obol-bulldozer bot pushed a commit that referenced this issue Jun 8, 2022

p2p: refactor send function interface (#680)

c82601c

Make the interface `p2p.SendFunc` and the implementations `p2p.Sender.Send` and `p2p.Sender.SendAsync` more explicit. category: refactor ticket: #635

corverroos closed this as completed Jun 8, 2022

corverroos mentioned this issue Jun 8, 2022

dkg: improve connection logging #681

Merged

obol-bulldozer bot pushed a commit that referenced this issue Jun 8, 2022

dkg: improve connection logging (#681)

f9296cd

Decrease "not connected yet errors" to debug. Add periodic "connected to M of N peers" info logs. category: refactor ticket: #635

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p2p: robust intra-cluster broadcasting #635

p2p: robust intra-cluster broadcasting #635

corverroos commented May 31, 2022 •

edited

Loading

corverroos commented Jun 1, 2022

p2p: robust intra-cluster broadcasting #635

p2p: robust intra-cluster broadcasting #635

Comments

corverroos commented May 31, 2022 • edited Loading

Problem to be solved

Proposed solution

corverroos commented Jun 1, 2022

corverroos commented May 31, 2022 •

edited

Loading