Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty reads with membership changes, crashes #30

Closed
aphyr opened this issue Mar 24, 2020 · 4 comments
Closed

Empty reads with membership changes, crashes #30

aphyr opened this issue Mar 24, 2020 · 4 comments

Comments

@aphyr
Copy link
Collaborator

aphyr commented Mar 24, 2020

Still working on narrowing this down, but I have two cases with version f88f866 where it looks like Redis started returning empty values for LRANGE reads after starting up all nodes. This is fairly infrequent, so these clusters have been through the wringer. I'm trying to narrow down behavior and get a more reproducible test pattern.

20200323T201814.000-0400.zip

@aphyr aphyr changed the title Empty reads with membership changes, partitions, crashes, pauses Empty reads with membership changes, crashes Mar 24, 2020
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 24, 2020

Here's another case: http://jepsen.io.s3.amazonaws.com/analyses/redis-raft-1b3fbf6/20200323T181253.000-0400.zip.

  • 13:06 Kill all nodes
  • 13:13 Start all nodes
  • 13:20 Ask n5 to remove n3; completes OK
  • 13:22 Kill n3, n4, n5
  • 13:26 Start all nodes
  • 13:28 Ask n5 to remove n1
  • 13:35 Kill primary n5
  • 13:39 Start all nodes
  • 13:40 Kill n2, n3, n5
  • 13:46 Start all nodes
  • 13:47 Kill n1, n2, n5
  • 13:51 Ask n4 to remove n5
  • 13:52 Start all nodes
  • 13:53 Ask n4 to remove n2
  • 13:55 Join n3
  • 13:59 Join n1
  • 14:06 Kill all nodes
  • 14:11.073 Start all nodes
  • 13:11.813 Inconsistency!

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 24, 2020

I've been trying to narrow down the conditions under which this happens, and I've got a smaller failing case. This one's a G-single-realtime--another empty stale read, it looks like.

20200324T162838.000-0400.zip

G-single-realtime #0
Let:
T1 = {:type :ok, :f :txn, :value [[:r 37 []]], :process 88, :time 41914364653, :index 6197}
T2 = {:type :ok, :f :txn, :value [[:append 33 88] [:append 37 5]], :process 78, :time 36407315109, :index 6060}
T3 = {:type :ok, :f :txn, :value [[:r 33 [6 9 11 12 15 17 19 20 25 26 28 29 30 32 33 34 38 39 41 42 45 47 48 54 58 60 61 62 63 64 65 69 70 71 72 73 74 76 78 79 82 85 86 87 88]] [:append 36 56]], :process 93, :time 36626681784, :index 6102}

Then:

  • T1 < T2, because T1 observed the initial (nil) state of 37, which T2 created by appending 5.
  • T2 < T3, because T3 observed T2's append of 88 to key 33.
  • However, T3 < T1, because T3 completed at index 6102, 5.263 seconds before the invocation of T1, at index 6187: a contradiction!

It also looks like this problem was transient, because we went back to reading what looks like correct values afterwards:

2020-03-24 16:29:27,309{GMT}    INFO    [jepsen worker 13] jepsen.util: 63      :ok     :txn    [[:r 37 [5 9]]]
2020-03-24 16:29:27,383{GMT}    INFO    [jepsen worker 13] jepsen.util: 63      :ok     :txn    [[:r 37 [5 9 13 1 19 20]] [:append 36 55] [:append 4 9]]
2020-03-24 16:29:32,686{GMT}    INFO    [jepsen worker 13] jepsen.util: 88      :ok     :txn    [[:r 37 []]]
2020-03-24 16:29:32,802{GMT}    INFO    [jepsen worker 3] jepsen.util: 103      :ok     :txn    [[:r 37 [5 9 13 1 19 20]] [:r 33 [6 9 11 12 15 17 19 20 25 26 28 29 30 32 33 34 38 39 41 42 45 47 48 54 58 60 61 62 63 64 65 69 70 71 72 73 74 76 78 79 82 85 86 87 88]] [:r 33 [6 9 11 12 15 17 19 20 25 26 28 29 30 32 33 34 38 39 41 42 45 47 48 54 58 60 61 62 63 64 65 69 70 71 72 73 74 76 78 79 82 85 86 87 88]] [:append 37 46]]

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 25, 2020

Another small case. Managed to rule out the kill of n1, n2, n5:

20200325T025726.000-0400.zip

Screenshot from 2020-03-25 09-36-38

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 26, 2020

This looks resolved as of 73ad833!

@aphyr aphyr closed this as completed Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant