Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightningd: avoid thundering herd on restart. #2885

Merged
merged 2 commits into from Aug 1, 2019

Conversation

@rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Jul 31, 2019

The reason lnd was sending sync error was that we were taking more than
30 seconds to send the channel_reestablish after connect. That's
understandable on my test node under valgrind, but shouldn't happen normally.

However, it seems it has at least once,
(see #2847)
: space out startup so it's less likely to happen.

Suggested-by: @cfromknecht
Signed-off-by: Rusty Russell rusty@rustcorp.com.au

@cfromknecht
Copy link

@cfromknecht cfromknecht commented Jul 31, 2019

@rustyrussell glad we figured it out, thanks for taking time to debug with me :)

The reason lnd was sending sync error was that we were taking more than
30 seconds to send the channel_reestablish after connect.  That's
understandable on my test node under valgrind, but shouldn't happen normally.

However, it seems it has at least once,
(see ElementsProject#2847)
: space out startup so it's less likely to happen.

Suggested-by: @cfromknecht
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Copy link
Collaborator

@renepickhardt renepickhardt left a comment

Looks good to me ACK.

(though I might have made it more symmetric. If the first 5 reconnects happened directly the next five should have happened in the next second.)

Regarding the original bug do I understand it correctly that if I have a network error during reestablishment (e. G. Somebody pulls the cable) that the channel will end up force closed as or the peer will fail the channel / connection?

@ZmnSCPxj
Copy link
Collaborator

@ZmnSCPxj ZmnSCPxj commented Jul 31, 2019

Regarding the original bug do I understand it correctly that if I have a network error during reestablishment (e. G. Somebody pulls the cable) that the channel will end up force closed as or the peer will fail the channel / connection?

Not particularly, as, we only force-close if we receive an actual error, which can only occur if we are actually connected. So the original bug seems to be more tied to trying to do too many things at the same time during reconnection

@@ -1308,15 +1308,29 @@ static void activate_peer(struct peer *peer)
u8 *msg;
struct channel *channel;
struct lightningd *ld = peer->ld;
/* Avoid thundering herd: after first five, delay by 1 second. */
int delay = -5;
Copy link
Collaborator

@ZmnSCPxj ZmnSCPxj Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you intend to make this variable static, or define it in activate_peers and pass it in by pointer to this function?

Copy link
Contributor Author

@rustyrussell rustyrussell Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err... yes... I was just checking that review was working? 😆

Thanks, you, um, passed the test! Fixed...

@rustyrussell
Copy link
Contributor Author

@rustyrussell rustyrussell commented Jul 31, 2019

(though I might have made it more symmetric. If the first 5 reconnects happened directly the next five should have happened in the next second.)

This is an approximation. In practice there's a spread in response times, so the first 5 won't actually arrive all at once. An ideal implementation would stop reconnecting until we're idle, and prioritize reconnections based on where HTLCs want to go. But in practice it's just a few seconds and nobody cares, really. But it makes it more reasonable to run 100 channels on an rPi.

The real bottleneck is still gossipd: after 0.7.2, I'll revisit our gossip catchup heuristics so we do selective probing rather than asking for a complete gossip dump when we're missing something.

Copy link
Collaborator

@ZmnSCPxj ZmnSCPxj left a comment

ACK 587b421

@rustyrussell rustyrussell merged commit 2255dd4 into ElementsProject:master Aug 1, 2019
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants