New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
paxos fixes #457
paxos fixes #457
Conversation
there is a set of logs referenced in the bug, #5698 |
@@ -370,9 +394,13 @@ void Paxos::handle_last(MMonPaxos *last) | |||
return; | |||
} | |||
|
|||
assert(g_conf->paxos_kill_at != 1); | |||
|
|||
// store any committed values if any are specified in the message | |||
store_state(last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to guard this function from old values too, not just the changes to our uncommitted values down below. store_state() looks like it tramples over whatever's there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
share/store_state are only dealing with committed values, so if they trample over an uncommitted value it is with good reason. it's also only writing new stuff outside of [first_committed,last_committed]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good enough.
Okay, this all looks good to me modulo a couple commit message typos. |
Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
We may have an uncommitted value from our perspective (it is our lc + 1) when the collector has a much larger lc (because we have been out for the last few rounds). Only share an uncommitted value if it is in fact the next value. Signed-off-by: Sage Weil <sage@inktank.com>
If an older peer sends an uncommitted value, make sure we only take it if it is in the future, and at least as new as any current uncommitted value. (Prior to the previous patch, peers could send values from long-past rounds. The pn values are also bogus.) Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
During the collect/last exchange, peers share any uncommitted values with the leader. They are supposed to also share the pn under which that value was accepted, but were instead using the just-accepted pn value. This effectively meant that we *always* took the uncommitted value; if there were multiples, which one we accepted depended on what order the LAST messages arrived, not which pn the values were generated under. The specific failure sequence I observed: - collect - learned uncommitted value for 262 from myself - send collect with pn 901 - got last with pn 901 (incorrect) for 200 (old) from peer - discard our own value, remember the other - finish collect phase - ignore old uncommitted value Fix this by storing a pending_v and pending_pn value whenever we accept a value. Use this to send an appropriate pn value in the LAST reply so that the leader can make it's decision about which uncommitted value to accept based on accurate information. Also use it when we learn the uncommitted value from ourselves. We could probably be more clever about storing less information here, for example by omitting pending_v and clearing pending_pn at the appropriate point, but that would be more fragile. Similarly, we could store a pn for *every* commit if we wanted to lay some groundwork for having multiple uncommitted proposals in flight, but I don't want to speculate about what is necessary or sufficient for a correct solution there. Fixes: #5698 Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
paxos fixes Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
mgr/prometheus: Fix regression with OSD/host details/overview dashboards Reviewed-by: Michael Fritch <mfritch@suse.com>
remove tests that fail on boto3's parameter validation
No description provided.