-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respond to standup at login correctly. #1877
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change is simple enough and LGTM 👍 Have some questions.
However, I think that just puts us in a different bad state in the case we are forked from the node that does the late connect. Now it will respond to the LOGIN with a STANDUP_RESPONSE but this will be a DENY which is just as bad.
EDIT: Looking at the code, we won't actually DENY due to a hash mismatch, only if we think some other node is leading. So a node that's forked from us (but hasn't realized that yet) can still approve the standup, and it will abstain if it does realize it's forked.
This is not actually a problem due to the EDIT:
portion, right?
I think we should adjust the logic to not clear the _forkedFrom list until we are actually LEADING or FOLLOWING. I can imagine some scenario where clearing this in STANDINGUP gets us to reset a node that was forked and no longer is, i.e., it was just restored from backup, and it was the final node required to reach quorum. In this case, we would still not form a cluster, we'd need to restart the node that is attempting to stand up.
Can you elaborate a bit and state if you still think this is necessary? I'm not understanding how the scenario above would cause us to not form a cluster
I read everything a few times, but the one thing I don't understand is this scenario. Can you break it down? Regardless, the change looks good to me and it seem like a simplification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, just a few questions so I can better understand how bedrock works
if (otherPeer->state == SQLiteNodeState::STANDINGUP || otherPeer->state == SQLiteNodeState::LEADING || otherPeer->state == SQLiteNodeState::STANDINGDOWN) { | ||
// We need to contest this standup | ||
response["Response"] = "deny"; | ||
response["Reason"] = "peer '" + otherPeer->name + "' is '" + stateName(otherPeer->state) + "'"; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not related to the changes you're applying, but just wanted to understand this part. In the code above, if a peer is trying to standup and we're standing down, we add a warn (line 2762) but don't really set anything in the response. But in this block, if we see another peer standing down, we actually deny it.
Why do we have that different behavior between both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just an oversight where we've used a definition of leading in this case that includes standingup and standingdown. We could probably exclude it here as well and only use deny
if the peer is standingup or leading.
|
||
// If we're STANDINGUP when a peer connects, send them a STATE message so they know they need to APPROVE or DENY the standup. | ||
// Otherwise we will wait for their response that's not coming,and can eventually time out the standup. | ||
if (_state == SQLiteNodeState::STANDINGUP) { | ||
SData state("STATE"); | ||
state["StateChangeCount"] = to_string(_stateChangeCount); | ||
state["State"] = stateName(_state); | ||
state["Priority"] = SToStr(_priority); | ||
_sendToPeer(peer, state); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're not using any locks here, would it be possible that we have a state while sending login and another when sending state?
And in that case, considering that the _changeState would have been called somewhere else, would everything still work out fine since _changeState also sends a STATE message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been a bug if state changed between these two messages, though that couldn't happen any more since this change as we only send one now, right?
But state couldn't change between these two messages as only the sync thread can change state, and only the sync thread calls _onConnect
.
Does that adequately answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure if i 100% understand, but the code changes here LGTM. Is there some way to simulate postPoll and peer state changes in our vm?
Yes, that seems to be the case, this isn't actually a problem.
Ok, let me quote myself for quick reference:
I believe I intended the "edit" block to apply to this, meaning that we can ignore the "I think we should... " sentence. But let's clarify the rest. Let's say we are 1.sjc, and we are connected to 1.lax. Neither node is forked from the other, but 1.sjc thinks it is forked from 2.sjc. There is no cluster, say all other nodes are offline. When 2.sjc comes back up (and is no longer forked), does 1.sjc exclude it from approving the standup because it's forked? If so, we will never reach "LEADING" because we will never have our standup approved. However, if we clear the list of forked nodes at STANDINGUP, then we can count 2.sjc's approval because it is no longer forked. I don't think this actually matters though, since we don't send DENY due to being forked. |
So the investigation for that was actually in this comment: https://github.com/Expensify/Expensify/issues/426227 2.sjc tries to stand up. If we did not remove 1.sjc from the list of forked peers, we wouldn't try to choose it as a sync peer and reconnect to it. I think that's complete. |
I've going to merge, feel free to leave more comments if anything was inadequately answered. |
Details
See investigation in comments here: https://github.com/Expensify/Expensify/issues/426227
When we connect to a peer, if we're
STANDINGUP
we intend to send aSTATE
message indicating that the remote node should approve or deny our standup:Bedrock/sqlitecluster/SQLiteNode.cpp
Lines 1810 to 1831 in cedf037
However, because the
LOGIN
andSTATE
messages are sent back to back, they will always have the same state (which isSTANDINGUP
).This means that when processing the
STATE
message we'll always hit this code:Bedrock/sqlitecluster/SQLiteNode.cpp
Lines 1364 to 1368 in cedf037
Rather than this code:
Bedrock/sqlitecluster/SQLiteNode.cpp
Line 1424 in cedf037
We should probably send
STANDUP_RESPONSE
in response toLOGIN
when the node isSTANDINGUP
and remove the secondSTATE
message here.This will fix the race condition where a late login during standup prevents the cluster from forming which happens because of this line:
Bedrock/sqlitecluster/SQLiteNode.cpp
Line 2033 in cedf037
However, I think that just puts us in a different bad state in the case we are forked from the node that does the late connect. Now it will respond to the
LOGIN
with aSTANDUP_RESPONSE
but this will be aDENY
which is just as bad.I think we should adjust the logic to not clear the
_forkedFrom
list until we are actuallyLEADING
orFOLLOWING
. I can imagine some scenario where clearing this inSTANDINGUP
gets us to reset a node that was forked and no longer is, i.e., it was just restored from backup, and it was the final node required to reach quorum. In this case, we would still not form a cluster, we'd need to restart the node that is attempting to stand up.Thoughts on this?
EDIT: Looking at the code, we won't actually
DENY
due to a hash mismatch, only if we think some other node is leading. So a node that's forked from us (but hasn't realized that yet) can still approve the standup, and it will abstain if it does realize it's forked.NOTE: The code in
_sendStandupResponse
is exactly the block deleted from_onMESSAGE
with no changes except indentation.Fixed Issues
Fixes https://github.com/Expensify/Expensify/issues/426227
Tests
Internal Testing Reminder: when changing bedrock, please compile auth against your new changes