-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightningd: avoid thundering herd on restart. #2885
lightningd: avoid thundering herd on restart. #2885
Conversation
@rustyrussell glad we figured it out, thanks for taking time to debug with me :) |
The reason lnd was sending sync error was that we were taking more than 30 seconds to send the channel_reestablish after connect. That's understandable on my test node under valgrind, but shouldn't happen normally. However, it seems it has at least once, (see ElementsProject#2847) : space out startup so it's less likely to happen. Suggested-by: @cfromknecht Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
aec6219
to
25ea743
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me ACK.
(though I might have made it more symmetric. If the first 5 reconnects happened directly the next five should have happened in the next second.)
Regarding the original bug do I understand it correctly that if I have a network error during reestablishment (e. G. Somebody pulls the cable) that the channel will end up force closed as or the peer will fail the channel / connection?
Not particularly, as, we only force-close if we receive an actual |
lightningd/peer_control.c
Outdated
@@ -1308,15 +1308,29 @@ static void activate_peer(struct peer *peer) | |||
u8 *msg; | |||
struct channel *channel; | |||
struct lightningd *ld = peer->ld; | |||
/* Avoid thundering herd: after first five, delay by 1 second. */ | |||
int delay = -5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you intend to make this variable static
, or define it in activate_peers
and pass it in by pointer to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err... yes... I was just checking that review was working? 😆
Thanks, you, um, passed the test! Fixed...
This is an approximation. In practice there's a spread in response times, so the first 5 won't actually arrive all at once. An ideal implementation would stop reconnecting until we're idle, and prioritize reconnections based on where HTLCs want to go. But in practice it's just a few seconds and nobody cares, really. But it makes it more reasonable to run 100 channels on an rPi. The real bottleneck is still gossipd: after 0.7.2, I'll revisit our gossip catchup heuristics so we do selective probing rather than asking for a complete gossip dump when we're missing something. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 587b421
The reason lnd was sending sync error was that we were taking more than
30 seconds to send the channel_reestablish after connect. That's
understandable on my test node under valgrind, but shouldn't happen normally.
However, it seems it has at least once,
(see #2847)
: space out startup so it's less likely to happen.
Suggested-by: @cfromknecht
Signed-off-by: Rusty Russell rusty@rustcorp.com.au