Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: avoid thundering herd on restart. #2885

Merged
merged 2 commits into from Aug 1, 2019

Conversation

@rustyrussell
Copy link
Contributor

commented Jul 31, 2019

The reason lnd was sending sync error was that we were taking more than
30 seconds to send the channel_reestablish after connect. That's
understandable on my test node under valgrind, but shouldn't happen normally.

However, it seems it has at least once,
(see #2847)
: space out startup so it's less likely to happen.

Suggested-by: @cfromknecht
Signed-off-by: Rusty Russell rusty@rustcorp.com.au

@cfromknecht

This comment has been minimized.

Copy link

commented Jul 31, 2019

@rustyrussell glad we figured it out, thanks for taking time to debug with me :)

The reason lnd was sending sync error was that we were taking more than
30 seconds to send the channel_reestablish after connect.  That's
understandable on my test node under valgrind, but shouldn't happen normally.

However, it seems it has at least once,
(see #2847)
: space out startup so it's less likely to happen.

Suggested-by: @cfromknecht
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the rustyrussell:startup-backoff branch from aec6219 to 25ea743 Jul 31, 2019
Copy link
Collaborator

left a comment

Looks good to me ACK.

(though I might have made it more symmetric. If the first 5 reconnects happened directly the next five should have happened in the next second.)

Regarding the original bug do I understand it correctly that if I have a network error during reestablishment (e. G. Somebody pulls the cable) that the channel will end up force closed as or the peer will fail the channel / connection?

@ZmnSCPxj

This comment has been minimized.

Copy link
Collaborator

commented Jul 31, 2019

Regarding the original bug do I understand it correctly that if I have a network error during reestablishment (e. G. Somebody pulls the cable) that the channel will end up force closed as or the peer will fail the channel / connection?

Not particularly, as, we only force-close if we receive an actual error, which can only occur if we are actually connected. So the original bug seems to be more tied to trying to do too many things at the same time during reconnection

@@ -1308,15 +1308,29 @@ static void activate_peer(struct peer *peer)
u8 *msg;
struct channel *channel;
struct lightningd *ld = peer->ld;
/* Avoid thundering herd: after first five, delay by 1 second. */
int delay = -5;

This comment has been minimized.

Copy link
@ZmnSCPxj

ZmnSCPxj Jul 31, 2019

Collaborator

Did you intend to make this variable static, or define it in activate_peers and pass it in by pointer to this function?

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Jul 31, 2019

Author Contributor

Err... yes... I was just checking that review was working? 😆

Thanks, you, um, passed the test! Fixed...

@rustyrussell

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

(though I might have made it more symmetric. If the first 5 reconnects happened directly the next five should have happened in the next second.)

This is an approximation. In practice there's a spread in response times, so the first 5 won't actually arrive all at once. An ideal implementation would stop reconnecting until we're idle, and prioritize reconnections based on where HTLCs want to go. But in practice it's just a few seconds and nobody cares, really. But it makes it more reasonable to run 100 channels on an rPi.

The real bottleneck is still gossipd: after 0.7.2, I'll revisit our gossip catchup heuristics so we do selective probing rather than asking for a complete gossip dump when we're missing something.

Copy link
Collaborator

left a comment

ACK 587b421

@rustyrussell rustyrussell merged commit 2255dd4 into ElementsProject:master Aug 1, 2019
2 checks passed
2 checks passed
ackbot PR ack'd by ZmnSCPxj
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.