Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible lost updates with partitions and membership changes #17

Closed
aphyr opened this issue Mar 6, 2020 · 10 comments
Closed

Possible lost updates with partitions and membership changes #17

aphyr opened this issue Mar 6, 2020 · 10 comments

Comments

@aphyr
Copy link

aphyr commented Mar 6, 2020

This code is all REAL new so I'm not confident that this isn't some bug in Jepsen--treat this as tentative for now. It looks like we can lose short windows of updates when a membership change occurs during a network partition. This occurred with redis f88f866 (redisraft d589127), and proxy followers enabled.

For instance, in this run, a partition begins at 10:56:41, isolating n1 from all other nodes.

2020-03-06 10:56:41,004{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	[:isolated {"n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}, "n1" #{"n2" "n5" "n4" "n3"}}]

Process 34 appends 53 to the end of key 20, and happens to read it, observing [...48 50 53]:

2020-03-06 10:56:46,922{GMT}	INFO	[jepsen worker 4] jepsen.util: 34	:ok	:txn	[[:append 20 53] [:r 20 [12 21 22 25 39 29 30 31 42 43 44 45 46 47 48 50 53]] [:append 19 119]]
11:23

Process 34 appends 55

2020-03-06 10:56:49,171{GMT}	INFO	[jepsen worker 4] jepsen.util: 34	:ok	:txn	[[:append 19 120] [:r 19 [7 2 32 43 59 53 64 67 75 81 90 106 107 108 109 112 113 115 116 117 118 119 120]] [:append 20 55]]

We remove node n4 from the cluster. Both n1 and n4 believe they're currently leaders, so we ask n1 to do it. Note that n1 is presently isolated from the rest of the cluster, but returns "OK" for the part request. We then kill n4 and wipe its data files, which means the only node which could have recent data would be... n5.

2020-03-06 10:56:51,109{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n4"
2020-03-06 10:56:51,222{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 1583530776, :role :leader, :node "n1"}
 {:id 1211184293, :node "n5", :role :follower}
 {:id 1939788657, :node "n4", :role :leader})
...
2020-03-06 10:56:53,863{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n4" {"n1" "OK"}]

Ops fail for a bit, then we join n3 to the cluster. Some nodes still think n4 is a part of the cluster--likely because we told n1 to remove it, and n1 hasn't had a chance to inform anyone else, because it's partitioned. Node n1 still thinks it's a leader.

2020-03-06 10:57:03,968{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	"n3"
...
2020-03-06 10:57:04,080{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 1583530776, :role :leader, :node "n1"}
 {:id 1211184293, :node "n5", :role :candidate}
 {:id 1939788657, :node "n4"})
...
2020-03-06 10:57:05,496{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	["n3" {"n3" "OK"}]

Things go on in noleader/nocluster for a bit, until we resolve a network partition:

2020-03-06 10:57:15,497{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	nil
...
2020-03-06 10:57:15,702{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	:network-healed

We have more noleader/timeout/conn-refused failures, then process 50, just before the end of the test, manages to execute a read of key 20

2020-03-06 10:57:23,497{GMT}	INFO	[jepsen worker 0] jepsen.util: 50	:invoke	:txn	[[:r 22 nil] [:append 21 53] [:r 20 nil] [:append 21 54]]
...
2020-03-06 10:57:25,516{GMT}	INFO	[jepsen worker 0] jepsen.util: 50	:ok	:txn	[[:r 22 []] [:append 21 53] [:r 20 [12 21 22 25 39 29 30 31 42 43 44 45 46 47 48 63]] [:append 21 54]]

It observes key 20 = [... 48 63]--we lost the updates of 50 and 53!

I suspect, based on this behavior, that node removal is asynchronous; that OK response only indicates local leader acknowledgement of the membership change, NOT a transition to joint consensus mode. I'm not sure whether that's actually a problem or not, because even if we hadn't gotten to joint consensus yet, n4 is down, and should have been required for quorum operations on the original cohort. Maybe earlier nemesis changes got us into a messed-up state? I'll keep digging.

We encountered this problem using jepsen.redis 6f4253e905588f51eb1b1e186266607a39fdf635:

lein run test-all --raft-version d589127 -w append --time-limit 600 --nemesis-interval 10 --follower-proxy --nemesis member,partition

but I haven't been able to reproduce it since. Still exploring.

@aphyr
Copy link
Author

aphyr commented Mar 7, 2020

Update: I've made several improvements to the robustness of membership tests, and we can reliably reproduce these issues with Jepsen 39a4b0ff3006237b2e00fab253035f43fcfc6403, running

lein run test-all --raft-version d589127 -w append --time-limit 600 --nemesis-interval 10 --follower-proxy --nemesis member,partition --test-count 20

Here's a particularly egregious case, where keys 3, 4, 5, 6, 7, 8, 15, 16, 17, 18, 19, 20, 21, 23, 24, 25, 26, 28, 32, 37, 38, 39, 40, 45, 46, 47, 48, 49, 50, .... looks like most of the test hit data loss, really.

20200306T200843.000-0500.zip

@aphyr
Copy link
Author

aphyr commented Mar 7, 2020

In the history from the previous comment, these operations, on key 4, are particularly interesting. I've included the full nemesis history as well. tl;dr: it looks like there's a bug in the joint consensus mechanism for membership changes which allows a leader to declare a membership change complete, even if it hasn't actually communicated that change to anyone else. That allows the cluster to get into a Real Nasty Split-Brain, by allowing two leaders to decide they each have enough votes to be the leader for the same (or arbitrary) term.

We begin by partitioning n1, the current primary, from all other nodes.

2020-03-06 20:09:05,300{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	:primaries
2020-03-06 20:09:05,509{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	[:isolated {"n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}, "n1" #{"n2" "n5" "n4" "n3"}}]

n2 becomes a new leader, but n1 hasn't caught on that it's isolated yet. We try to remove node n2 by asking n1 to remove it.

2020-03-06 20:09:15,614{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n2"
2020-03-06 20:09:15,733{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :leader, :node "n1"}
 {:id 647672153, :node "n2", :role :leader}
 {:id 77395493, :node "n4", :role :follower}
 {:id 1982509157, :node "n5", :role :follower}
 {:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:15,943{GMT}	INFO	[jepsen node n1] jepsen.redis.db: n1 :removing n2 (id: 647672153)
2020-03-06 20:09:16,045{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n2" {"n1" "OK"}]

The partition ends.

2020-03-06 20:09:26,046{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	nil
2020-03-06 20:09:26,251{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	:network-healed

n4 becomes the leader. We ask n4 to remove itself, which fails.

2020-03-06 20:09:36,356{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n4"
2020-03-06 20:09:36,472{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :follower, :node "n1"}
 {:id 647672153, :node "n2", :role :follower}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1982509157, :node "n5", :role :follower}
 {:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:36,784{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n4" {"n4" "ERR cannot remove leader node"}]

We ask n4 to remove n5. We're still waiting on the removal of n2 to complete.

2020-03-06 20:09:46,889{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n5"
2020-03-06 20:09:47,006{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :follower, :node "n1"}
 {:id 647672153, :node "n2", :role :follower}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1982509157, :node "n5", :role :follower}
 {:id 1728106831, :node "n3", :role :follower})
2020-03-06 20:09:47,216{GMT}	INFO	[jepsen node n4] jepsen.redis.db: n4 :removing n5 (id: 1982509157)

Still waiting for n1's removal of n2 to complete, we ask n4 to remove n2.

2020-03-06 20:09:57,423{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n2"
2020-03-06 20:09:57,542{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :follower, :node "n1"}
 {:id 647672153, :node "n2", :role :follower}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1728106831, :node "n3", :role :follower}
 {:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:09:57,852{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n2" {"n4" "OK"}]
2020-03-06 20:09:57,978{GMT}	INFO	[clojure-agent-send-off-pool-50] jepsen.redis.db: :removed n2 (id: 647672153

This time, the removal completes. Open question: why did n1's removal of n2 not propagate once the partition was resolved?

We partition n4 away from all other nodes.

2020-03-06 20:10:07,853{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	:primaries
2020-03-06 20:10:08,062{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	[:isolated {"n2" #{"n4"}, "n5" #{"n4"}, "n1" #{"n4"}, "n3" #{"n4"}, "n4" #{"n2" "n5" "n1" "n3"}}]

Operations begin on n4. Most fail due to a lack of leader.

2020-03-06 20:10:13,853{GMT}	INFO	[jepsen worker 4] jepsen.util: 9	:invoke	:txn	[[:r 4 nil]]
2020-03-06 20:10:13,854{GMT}	INFO	[jepsen worker 4] jepsen.util: 9	:fail	:txn	[[:r 4 nil]]	:noleader

Operations begin to succeed, as a new leader has been elected. Our first successful append occurs against node n1, the current leader.

2020-03-06 20:10:16,150{GMT}	INFO	[jepsen worker 0] jepsen.util: 15	:ok	:txn	[[:r 4 []] [:r 3 [5]]]
...
2020-03-06 20:10:17,965{GMT}	INFO	[jepsen worker 0] jepsen.util: 15	:ok	:txn	[[:append 4 31]]

We ask n4 to remove n1.

2020-03-06 20:10:18,167{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n1"
2020-03-06 20:10:18,306{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :leader, :node "n1"}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1728106831, :node "n3", :role :follower}
 {:id 647672153, :role :candidate, :node "n2"}
 {:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:18,514{GMT}	INFO	[jepsen node n4] jepsen.redis.db: n4 :removing n1 (id: 490819827)
2020-03-06 20:10:18,616{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n1" {"n4" "OK"}]

n4, of course, can't talk to anyone, so n1 continues happily along.

2020-03-06 20:10:21,551{GMT}	INFO	[jepsen worker 0] jepsen.util: 15	:ok	:txn	[[:r 5 [2 3 4]] [:r 4 [31 42]] [:append 5 15] [:append 3 62]]

Other nodes, like n3, can proxy transactions to n1 as well:

2020-03-06 20:10:24,353{GMT}	INFO	[jepsen worker 2] jepsen.util: 12	:ok	:txn	[[:r 3 [5 23 37 53 49 62 79]] [:append 4 78]]

We ask n4 to remove n3.

2020-03-06 20:10:28,721{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n3"
2020-03-06 20:10:28,757{GMT}	INFO	[jepsen worker 2] jepsen.util: 12	:ok	:txn	[[:r 4 [31 42 64 78 85 86 91 113 106]]]
2020-03-06 20:10:28,836{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :leader, :node "n1"}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1728106831, :node "n3", :role :follower}
 {:id 647672153, :role :candidate, :node "n2"}
 {:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:29,145{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n3" {"n4" "OK"}]

n4 decides that it's completed this removal, despite not being able to talk to anyone else. Here's n4's RAFT.INFO

2020-03-06 20:10:29,263{GMT}	INFO	[clojure-agent-send-off-pool-50] jepsen.redis.db: :done-waiting-for-node-removal {:raft
 {:role :leader,
  :num_voting_nodes "1",
  :leader_id 77395493,
  :is_voting "yes",
  :node_id 77395493,
  :num_nodes 1,
  :state :up,
  :nodes nil,
  :current_term 3},
  ...

n4 decides n3 is gone...

2020-03-06 20:10:29,264{GMT}	INFO	[clojure-agent-send-off-pool-50] jepsen.redis.db: :removed n3 (id: 1728106831

And n1 too. This is a bit awkward, because n1 was (last we checked) the leader of the majority component, but we HAVE been waiting on its removal for some time.

2020-03-06 20:10:29,754{GMT}	INFO	[clojure-agent-send-off-pool-53] jepsen.redis.db: :removed n1 (id: 490819827

We're... oh THIS is weird. We're executing txns on node n4. n4 is still (to my knowledge!) partitioned away from all the other nodes! This is DEFINITELY weird, but it kind of makes sense when you realize that n4 thinks it's the only node in the cluster. Of COURSE it can execute operations!

2020-03-06 20:10:30,247{GMT}	INFO	[jepsen worker 3] jepsen.util: 8	:ok	:txn	[[:append 4 122] [:r 5 []] [:append 4 123] [:append 4 124]]

Simultaneously, node n1 is still executing its own transactions. Note that it doesn't observe node n4's append of 122 here. This is a classic case of split-brain. This suggests to me that there's been a failure in joint consensus--perhaps n4 completed its local membership change before getting confirmation that a majority of the cluster had adopted the change?

2020-03-06 20:10:30,667{GMT}	INFO	[jepsen worker 0] jepsen.util: 15	:ok	:txn	[[:r 4 [31 42 64 78 85 86 91 113 106 119 120 121]]]

n4 steams on ahead with its own timeline for key 4

2020-03-06 20:10:31,791{GMT}	INFO	[jepsen worker 3] jepsen.util: 8	:ok	:txn	[[:append 4 141] [:r 4 [122 123 124 141]] [:append 5 105]]

While n3 et al keep writing to (presumably) primary n1:

2020-03-06 20:10:33,117{GMT}	INFO	[jepsen worker 2] jepsen.util: 12	:ok	:txn	[[:append 5 114] [:append 3 170] [:append 3 171] [:append 4 152]]
...
2020-03-06 20:10:35,278{GMT}	INFO	[jepsen worker 2] jepsen.util: 12	:ok	:txn	[[:r 5 [2 3 4 15 24 40 56 91 114]] [:r 3 [5 23 37 53 49 62 79 91 92 111 112 124 125 151 170 171]] [:r 5 [2 3 4 15 24 40 56 91 114]] [:r 4 [31 42 64 78 85 86 91 113 106 119 120 121 152 161 162 163]]]

We ask n5 to leave (again). Some nodes still think it's here--but the request fails, because n4 thinks it already removed n5.

2020-03-06 20:10:39,250{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n5"
2020-03-06 20:10:39,369{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 490819827, :role :leader, :node "n1"}
 {:id 77395493, :node "n4", :role :leader}
 {:id 1728106831, :node "n3", :role :follower}
 {:id 647672153, :role :candidate, :node "n2"}
 {:id 1982509157, :role :candidate, :node "n5"})
2020-03-06 20:10:39,578{GMT}	INFO	[jepsen node n4] jepsen.redis.db: n4 :removing n5 (id: 1982509157)
2020-03-06 20:10:39,680{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n5" {"n4" "node id does not exist"}]

n4 is still going!

2020-03-06 20:10:40,349{GMT}	INFO	[jepsen worker 3] jepsen.util: 8	:ok	:txn	[[:r 4 [122 123 124 141 195 196]] [:append 5 190] [:r 5 [105 139 142 157 190]] [:r 4 [122 123 124 141 195 196]]]
...
2020-03-06 20:10:43,987{GMT}	INFO	[jepsen worker 3] jepsen.util: 8	:ok	:txn	[[:r 4 [122 123 124 141 195 196]] [:append 6 15] [:r 4 [122 123 124 141 195 196]]]

While n1 diverges farther:

2020-03-06 20:10:44,141{GMT}	INFO	[jepsen worker 0] jepsen.util: 15	:ok	:txn	[[:r 4 [31 42 64 78 85 86 91 113 106 119 120 121 152 161 162 163 189 180 202 236]] [:append 6 11] [:append 6 12] [:r 4 [31 42 64 78 85 86 91 113 106 119 120 121 152 161 162 163 189 180 202 236]]]

We resolve the partition

2020-03-06 20:10:49,681{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	nil
2020-03-06 20:10:49,887{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	:network-healed

... and things proceed from there.

@aphyr
Copy link
Author

aphyr commented Mar 7, 2020

This isn't just a problem with follower-proxies; it happens even when only leaders serve requests. In addition, it looks like not every case involves complete data loss; sometimes there's a shared prefix of the history. That makes sense--it's consistent with what you'd expect from a split-brain scenario. I'm kind of surprised we haven't seen shared prefixes more often, though.

20200307T094114.000-0500.zip

                        :incompatible-order ({:key 9,
                                              :values [[9 215 225]
                                                       [9 90 105]]}
                                             {:key 10,
                                              :values [[147]
                                                       [167 168]]})}},

@yossigo
Copy link
Collaborator

yossigo commented Mar 7, 2020

@aphyr I think most, if not all of this traces back to having inconsistent information about node membership. I've tried to identify the point in time where the first violation happens and possibly cascades into everything else it's hard to tell based on this info only.

If you can run with loglevel=debug as an additional Raft module argument we'll have a lot more to work with.

Membership changes are done according to the Raft paper, so basically every change is propagated as log entries and becomes effective when it is considered committed. This is why in the beginning of the sequence, removing n2 on the isolated n1 should never converge: when n1 re-joins the cluster as a follower it should reconcile its log and invalidate this entry.

@aphyr
Copy link
Author

aphyr commented Mar 9, 2020

I've refined the test case; we can reproduce the issue reliably in 10 seconds now. I've also introduced loglevel=debug; here's a test run with those logs: 20200309T134137.000-0400.zip

Membership changes are done according to the Raft paper, so basically every change is propagated as log entries and becomes effective when it is considered committed.

As I described above, this does not appear to be the case! Completely isolated nodes are perfectly happy to remove other nodes, will claim those nodes are no longer members of the cluster, and proceed to execute transactions without ever talking to the rest of the cluster. Here, for instance, is node n1's local RAFT.INFO state, as observed after n1 was placed into its own network component and asked to remove n2, n3, n4, and n5:

 {:role :leader,
  :num_voting_nodes "1",
  :leader_id 651090454,
  :is_voting "yes",
  :node_id 651090454,
  :num_nodes 1,
  :state :up,
  :nodes nil,
  :current_term 1},

As you can see, it reports no other nodes, and thinks there's only one voting node in the cluster--itself! Perhaps nodes are applying membership changes before receiving confirmation from a majority of their peers? Or failing to enter joint consensus in the first place?

@aphyr
Copy link
Author

aphyr commented Mar 10, 2020

It looks like this problem is independent of whether or not we use MULTI/EXEC transactions--it affects single reads and writes as well.

@yossigo
Copy link
Collaborator

yossigo commented Mar 10, 2020

Traced this to a bug in the Raft library code, which failed to enforce that remove changes are handled one at a time (i.e. require joint consensus per operation).

yossigo added a commit that referenced this issue Mar 10, 2020
Fix for multiple concurrent config changes (issue #17)
@aphyr
Copy link
Author

aphyr commented Mar 10, 2020

Nice!

@aphyr
Copy link
Author

aphyr commented Mar 11, 2020

It looks like this is fixed as of 8da0c77--though initially I had some issues with tests failing, I think because the submodule directory wasn't getting rebuilt properly. Right now we run

git submodule init
git submodule update
make clean
make

Do we... also need to make clean in deps individually, or should that be handled by the makefile? Or is there something else I'm missing?

@yossigo
Copy link
Collaborator

yossigo commented Mar 11, 2020

Makefile should take care of that, but you'll need to do make cleanall for that. Should have indicated that before, my bad.

@aphyr aphyr closed this as completed Mar 12, 2020
sjpotter added a commit to sjpotter/redisraft that referenced this issue Oct 7, 2021
current_leader points to the node object corresponding to the leader node.  But we are about to free() it.  to prevent an access after free, have to unset it here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants