-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible lost updates with partitions and membership changes #17
Comments
Update: I've made several improvements to the robustness of membership tests, and we can reliably reproduce these issues with Jepsen 39a4b0ff3006237b2e00fab253035f43fcfc6403, running
Here's a particularly egregious case, where keys 3, 4, 5, 6, 7, 8, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 28, 32, 37, 38, 39, 40, 45, 46, 47, 48, 49, 50, .... looks like most of the test hit data loss, really. |
In the history from the previous comment, these operations, on key 4, are particularly interesting. I've included the full nemesis history as well. tl;dr: it looks like there's a bug in the joint consensus mechanism for membership changes which allows a leader to declare a membership change complete, even if it hasn't actually communicated that change to anyone else. That allows the cluster to get into a Real Nasty Split-Brain, by allowing two leaders to decide they each have enough votes to be the leader for the same (or arbitrary) term. We begin by partitioning n1, the current primary, from all other nodes. 2020-03-06 20:09:05,300{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :primaries
2020-03-06 20:09:05,509{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}, "n1" #{"n2" "n5" "n4" "n3"}}] n2 becomes a new leader, but n1 hasn't caught on that it's isolated yet. We try to remove node n2 by asking n1 to remove it. 2020-03-06 20:09:15,614{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n2"
2020-03-06 20:09:15,733{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :leader, :node "n1"}
{:id 647672153, :node "n2", :role :leader}
{:id 77395493, :node "n4", :role :follower}
{:id 1982509157, :node "n5", :role :follower}
{:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:15,943{GMT} INFO [jepsen node n1] jepsen.redis.db: n1 :removing n2 (id: 647672153)
2020-03-06 20:09:16,045{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n2" {"n1" "OK"}] The partition ends. 2020-03-06 20:09:26,046{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
2020-03-06 20:09:26,251{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed n4 becomes the leader. We ask n4 to remove itself, which fails. 2020-03-06 20:09:36,356{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n4"
2020-03-06 20:09:36,472{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :follower, :node "n1"}
{:id 647672153, :node "n2", :role :follower}
{:id 77395493, :node "n4", :role :leader}
{:id 1982509157, :node "n5", :role :follower}
{:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:36,784{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n4" {"n4" "ERR cannot remove leader node"}] We ask n4 to remove n5. We're still waiting on the removal of n2 to complete. 2020-03-06 20:09:46,889{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n5"
2020-03-06 20:09:47,006{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :follower, :node "n1"}
{:id 647672153, :node "n2", :role :follower}
{:id 77395493, :node "n4", :role :leader}
{:id 1982509157, :node "n5", :role :follower}
{:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:47,216{GMT} INFO [jepsen node n4] jepsen.redis.db: n4 :removing n5 (id: 1982509157) Still waiting for n1's removal of n2 to complete, we ask n4 to remove n2. 2020-03-06 20:09:57,423{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n2"
2020-03-06 20:09:57,542{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :follower, :node "n1"}
{:id 647672153, :node "n2", :role :follower}
{:id 77395493, :node "n4", :role :leader}
{:id 1728106831, :node "n3", :role :follower}
{:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:09:57,852{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n2" {"n4" "OK"}]
2020-03-06 20:09:57,978{GMT} INFO [clojure-agent-send-off-pool-50] jepsen.redis.db: :removed n2 (id: 647672153 This time, the removal completes. Open question: why did n1's removal of n2 not propagate once the partition was resolved? We partition n4 away from all other nodes. 2020-03-06 20:10:07,853{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition :primaries
2020-03-06 20:10:08,062{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :start-partition [:isolated {"n2" #{"n4"}, "n5" #{"n4"}, "n1" #{"n4"}, "n3" #{"n4"}, "n4" #{"n2" "n5" "n1" "n3"}}] Operations begin on n4. Most fail due to a lack of leader. 2020-03-06 20:10:13,853{GMT} INFO [jepsen worker 4] jepsen.util: 9 :invoke :txn [[:r 4 nil]]
2020-03-06 20:10:13,854{GMT} INFO [jepsen worker 4] jepsen.util: 9 :fail :txn [[:r 4 nil]] :noleader Operations begin to succeed, as a new leader has been elected. Our first successful append occurs against node n1, the current leader. 2020-03-06 20:10:16,150{GMT} INFO [jepsen worker 0] jepsen.util: 15 :ok :txn [[:r 4 []] [:r 3 [5]]]
...
2020-03-06 20:10:17,965{GMT} INFO [jepsen worker 0] jepsen.util: 15 :ok :txn [[:append 4 31]] We ask n4 to remove n1. 2020-03-06 20:10:18,167{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n1"
2020-03-06 20:10:18,306{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :leader, :node "n1"}
{:id 77395493, :node "n4", :role :leader}
{:id 1728106831, :node "n3", :role :follower}
{:id 647672153, :role :candidate, :node "n2"}
{:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:18,514{GMT} INFO [jepsen node n4] jepsen.redis.db: n4 :removing n1 (id: 490819827)
2020-03-06 20:10:18,616{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n1" {"n4" "OK"}] n4, of course, can't talk to anyone, so n1 continues happily along. 2020-03-06 20:10:21,551{GMT} INFO [jepsen worker 0] jepsen.util: 15 :ok :txn [[:r 5 [2 3 4]] [:r 4 [31 42]] [:append 5 15] [:append 3 62]] Other nodes, like n3, can proxy transactions to n1 as well: 2020-03-06 20:10:24,353{GMT} INFO [jepsen worker 2] jepsen.util: 12 :ok :txn [[:r 3 [5 23 37 53 49 62 79]] [:append 4 78]] We ask n4 to remove n3. 2020-03-06 20:10:28,721{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n3"
2020-03-06 20:10:28,757{GMT} INFO [jepsen worker 2] jepsen.util: 12 :ok :txn [[:r 4 [31 42 64 78 85 86 91 113 106]]]
2020-03-06 20:10:28,836{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :leader, :node "n1"}
{:id 77395493, :node "n4", :role :leader}
{:id 1728106831, :node "n3", :role :follower}
{:id 647672153, :role :candidate, :node "n2"}
{:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:29,145{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n3" {"n4" "OK"}] n4 decides that it's completed this removal, despite not being able to talk to anyone else. Here's n4's RAFT.INFO 2020-03-06 20:10:29,263{GMT} INFO [clojure-agent-send-off-pool-50] jepsen.redis.db: :done-waiting-for-node-removal {:raft
{:role :leader,
:num_voting_nodes "1",
:leader_id 77395493,
:is_voting "yes",
:node_id 77395493,
:num_nodes 1,
:state :up,
:nodes nil,
:current_term 3},
... n4 decides n3 is gone... 2020-03-06 20:10:29,264{GMT} INFO [clojure-agent-send-off-pool-50] jepsen.redis.db: :removed n3 (id: 1728106831 And n1 too. This is a bit awkward, because n1 was (last we checked) the leader of the majority component, but we HAVE been waiting on its removal for some time. 2020-03-06 20:10:29,754{GMT} INFO [clojure-agent-send-off-pool-53] jepsen.redis.db: :removed n1 (id: 490819827 We're... oh THIS is weird. We're executing txns on node n4. n4 is still (to my knowledge!) partitioned away from all the other nodes! This is DEFINITELY weird, but it kind of makes sense when you realize that n4 thinks it's the only node in the cluster. Of COURSE it can execute operations! 2020-03-06 20:10:30,247{GMT} INFO [jepsen worker 3] jepsen.util: 8 :ok :txn [[:append 4 122] [:r 5 []] [:append 4 123] [:append 4 124]] Simultaneously, node n1 is still executing its own transactions. Note that it doesn't observe node n4's append of 122 here. This is a classic case of split-brain. This suggests to me that there's been a failure in joint consensus--perhaps n4 completed its local membership change before getting confirmation that a majority of the cluster had adopted the change? 2020-03-06 20:10:30,667{GMT} INFO [jepsen worker 0] jepsen.util: 15 :ok :txn [[:r 4 [31 42 64 78 85 86 91 113 106 119 120 121]]] n4 steams on ahead with its own timeline for key 4 2020-03-06 20:10:31,791{GMT} INFO [jepsen worker 3] jepsen.util: 8 :ok :txn [[:append 4 141] [:r 4 [122 123 124 141]] [:append 5 105]] While n3 et al keep writing to (presumably) primary n1:
We ask n5 to leave (again). Some nodes still think it's here--but the request fails, because n4 thinks it already removed n5. 2020-03-06 20:10:39,250{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave "n5"
2020-03-06 20:10:39,369{GMT} INFO [jepsen nemesis] jepsen.redis.nemesis: Current membership
({:id 490819827, :role :leader, :node "n1"}
{:id 77395493, :node "n4", :role :leader}
{:id 1728106831, :node "n3", :role :follower}
{:id 647672153, :role :candidate, :node "n2"}
{:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:39,578{GMT} INFO [jepsen node n4] jepsen.redis.db: n4 :removing n5 (id: 1982509157)
2020-03-06 20:10:39,680{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :leave ["n5" {"n4" "node id does not exist"}] n4 is still going! 2020-03-06 20:10:40,349{GMT} INFO [jepsen worker 3] jepsen.util: 8 :ok :txn [[:r 4 [122 123 124 141 195 196]] [:append 5 190] [:r 5 [105 139 142 157 190]] [:r 4 [122 123 124 141 195 196]]]
...
2020-03-06 20:10:43,987{GMT} INFO [jepsen worker 3] jepsen.util: 8 :ok :txn [[:r 4 [122 123 124 141 195 196]] [:append 6 15] [:r 4 [122 123 124 141 195 196]]] While n1 diverges farther: 2020-03-06 20:10:44,141{GMT} INFO [jepsen worker 0] jepsen.util: 15 :ok :txn [[:r 4 [31 42 64 78 85 86 91 113 106 119 120 121 152 161 162 163 189 180 202 236]] [:append 6 11] [:append 6 12] [:r 4 [31 42 64 78 85 86 91 113 106 119 120 121 152 161 162 163 189 180 202 236]]] We resolve the partition 2020-03-06 20:10:49,681{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
2020-03-06 20:10:49,887{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition :network-healed ... and things proceed from there. |
This isn't just a problem with follower-proxies; it happens even when only leaders serve requests. In addition, it looks like not every case involves complete data loss; sometimes there's a shared prefix of the history. That makes sense--it's consistent with what you'd expect from a split-brain scenario. I'm kind of surprised we haven't seen shared prefixes more often, though. :incompatible-order ({:key 9,
:values [[9 215 225]
[9 90 105]]}
{:key 10,
:values [[147]
[167 168]]})}}, |
@aphyr I think most, if not all of this traces back to having inconsistent information about node membership. I've tried to identify the point in time where the first violation happens and possibly cascades into everything else it's hard to tell based on this info only. If you can run with Membership changes are done according to the Raft paper, so basically every change is propagated as log entries and becomes effective when it is considered committed. This is why in the beginning of the sequence, removing n2 on the isolated n1 should never converge: when n1 re-joins the cluster as a follower it should reconcile its log and invalidate this entry. |
I've refined the test case; we can reproduce the issue reliably in 10 seconds now. I've also introduced
As I described above, this does not appear to be the case! Completely isolated nodes are perfectly happy to remove other nodes, will claim those nodes are no longer members of the cluster, and proceed to execute transactions without ever talking to the rest of the cluster. Here, for instance, is node n1's local RAFT.INFO state, as observed after n1 was placed into its own network component and asked to remove n2, n3, n4, and n5: {:role :leader,
:num_voting_nodes "1",
:leader_id 651090454,
:is_voting "yes",
:node_id 651090454,
:num_nodes 1,
:state :up,
:nodes nil,
:current_term 1}, As you can see, it reports no other nodes, and thinks there's only one voting node in the cluster--itself! Perhaps nodes are applying membership changes before receiving confirmation from a majority of their peers? Or failing to enter joint consensus in the first place? |
It looks like this problem is independent of whether or not we use MULTI/EXEC transactions--it affects single reads and writes as well. |
Traced this to a bug in the Raft library code, which failed to enforce that remove changes are handled one at a time (i.e. require joint consensus per operation). |
Fix for multiple concurrent config changes (issue #17)
Nice! |
It looks like this is fixed as of 8da0c77--though initially I had some issues with tests failing, I think because the submodule directory wasn't getting rebuilt properly. Right now we run
Do we... also need to make clean in deps individually, or should that be handled by the makefile? Or is there something else I'm missing? |
Makefile should take care of that, but you'll need to do |
current_leader points to the node object corresponding to the leader node. But we are about to free() it. to prevent an access after free, have to unset it here.
This code is all REAL new so I'm not confident that this isn't some bug in Jepsen--treat this as tentative for now. It looks like we can lose short windows of updates when a membership change occurs during a network partition. This occurred with redis f88f866 (redisraft d589127), and proxy followers enabled.
For instance, in this run, a partition begins at 10:56:41, isolating n1 from all other nodes.
Process 34 appends 53 to the end of key 20, and happens to read it, observing [...48 50 53]:
Process 34 appends 55
We remove node n4 from the cluster. Both n1 and n4 believe they're currently leaders, so we ask n1 to do it. Note that n1 is presently isolated from the rest of the cluster, but returns "OK" for the part request. We then kill n4 and wipe its data files, which means the only node which could have recent data would be... n5.
Ops fail for a bit, then we join n3 to the cluster. Some nodes still think n4 is a part of the cluster--likely because we told n1 to remove it, and n1 hasn't had a chance to inform anyone else, because it's partitioned. Node n1 still thinks it's a leader.
Things go on in noleader/nocluster for a bit, until we resolve a network partition:
We have more noleader/timeout/conn-refused failures, then process 50, just before the end of the test, manages to execute a read of key 20
It observes key 20 = [... 48 63]--we lost the updates of 50 and 53!
I suspect, based on this behavior, that node removal is asynchronous; that OK response only indicates local leader acknowledgement of the membership change, NOT a transition to joint consensus mode. I'm not sure whether that's actually a problem or not, because even if we hadn't gotten to joint consensus yet, n4 is down, and should have been required for quorum operations on the original cohort. Maybe earlier nemesis changes got us into a messed-up state? I'll keep digging.
We encountered this problem using jepsen.redis 6f4253e905588f51eb1b1e186266607a39fdf635:
but I haven't been able to reproduce it since. Still exploring.
The text was updated successfully, but these errors were encountered: