-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fix potential race condition in the sync manager - gossip syncer registration #9891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Addresses issues found in: #9886 |
// | ||
// NOTE: We need to do this synchronously to make sure a gossip peer | ||
// is registered before we signal that the peer is active because | ||
// otherwise the is a race which can happen that we try to remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what guarantees that the peer does not flake and we shutdown before the gossiper is registered:
Currently we try to register the syncer async here:
Line 2031 in 7e44422
p.initGossipSync() |
However we could already hit the disconnect case here:
because we signal ready here:
Line 4597 in 7e44422
close(ready) |
and could hit the removal of the gossiper here:
Line 4653 in 7e44422
s.authGossiper.PruneSyncState(p.PubKey()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no guarantee that the addition is before the removal or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea make sense - the only thing is whether this initGossipSync
will ever be blocking or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by blocking you mean blocking for a short period of time ? I mean depends on the busyness of the sync_manager or, I think that is why we need to make it sync in the first place because we cannot be sure that it registers in time before the peer might go offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking as in whether it will delay the startup or even create a deadlock somewhere - previously it's async so I think it's fine, now it's sync, and I've noticed a few places where the initGossipSync
will wait for done
signal.
and what happens before if the registration is not finished before the removal of the gossiper? i think it will be a noop? will the new registration still register the syncer afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will double check.
40c6278
to
a20c4b3
Compare
@ziggie1984, remember to re-request review from reviewers when ready |
My first analysis was wrong I made here: We always make sure before we remove the gossiper here: Line 4661 in a5c4a7c
that the peer and therefore the waitGroupCounter has reached zero: Line 4640 in a5c4a7c
especially here: Line 1588 in a5c4a7c
so it would never be the case that we remove a gossiper before it was added. Thanks for your careful analysis @yyforyongyu 🙏 |
Fix a potential race condition when adding/removing the
gossip_syncer
to thesync_manager
described below ...