Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale reads in normal operation #19

Closed
aphyr opened this issue Mar 10, 2020 · 7 comments
Closed

Stale reads in normal operation #19

aphyr opened this issue Mar 10, 2020 · 7 comments

Comments

@aphyr
Copy link
Collaborator

aphyr commented Mar 10, 2020

It looks as though redis f88f866 & redisraft d589127 exhibit stale reads when processes are allowed to pause. For instance, take a look at this test run

lein run test-all --raft-version d589127 -w append --time-limit 120 --nemesis-interval 1 --nemesis pause --test-count 5

Screenshot from 2020-03-09 21-45-58

Let:
T1 = {:type :ok, :f :txn, :value [[:r 3 [10]]], :process 3, :time 17433307265, :index 1248}
T2 = {:type :ok, :f :txn, :value [[:r 3 [10]] [:r 3 [10]] [:append 3 17]], :process 4, :time 16707553261, :index 1182}

Then:

  • T1 < T2, because T1 did not observe T2's append of 17 to 3.
  • However, T2 < T1, because T2 completed at index 1182, 0.724 seconds before the invocation of T1, at index 1247: a contradiction!
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 10, 2020

Just to make sure this wasn't related to MULTI/EXEC, I set up a variant with only one key and no transactions at all; it still shows pretty common instances of stale reads.

20200309T215158.000-0400.zip

Screenshot from 2020-03-09 21-58-08

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 10, 2020

Huh, this... happens in normal operation too, without any faults!

20200309T215905.000-0400.zip

With jepsen.redis 9a1a944...

lein run test-all --raft-version d589127 -w append --time-limit 120 --nemesis-interval 1 --nemesis none --test-count 5 --key-count 1 --max-txn-length 1

Let:
T1 = {:type :ok, :f :txn, :value [[:r 1 [5 8 9]]], :process 3, :time 14427256011, :index 1174}
T2 = {:type :ok, :f :txn, :value [[:append 1 11]], :process 0, :time 11170240824, :index 1077}

Then:

  • T1 < T2, because T1 did not observe T2's append of 11 to 1.
  • However, T2 < T1, because T2 completed at index 1077, 3.255 seconds before the invocation of T1, at index 1173: a contradiction!

This one's particularly weird because it's like... 3+ seconds later, and we did get to see the write completed a few hundred millis prior by the same process!

2020-03-09 21:57:11,565{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:invoke	:txn	[[:append 1 9]]
2020-03-09 21:57:11,579{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 1 9]]
...
2020-03-09 21:57:11,696{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:invoke	:txn	[[:append 1 11]]
...
2020-03-09 21:57:11,714{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 1 11]]
...
2020-03-09 21:57:14,969{GMT}	INFO	[jepsen worker 3] jepsen.util: 3	:invoke	:txn	[[:r 1 nil]]
2020-03-09 21:57:14,971{GMT}	INFO	[jepsen worker 3] jepsen.util: 3	:ok	:txn	[[:r 1 [5 8 9]]]

@aphyr aphyr changed the title Stale reads with process pauses? Stale reads in normal operation Mar 10, 2020
@yossigo
Copy link
Collaborator

yossigo commented Mar 10, 2020

The logs tell the same story, apparently an issue with the commit index right after re-election. The same issue may indeed be the cause for #18 as well - yet to be seen.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 11, 2020

This looks fixed in 8da0c77, at least so far!

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 11, 2020

Well, frequency is diminished, but I'm still seeing stale reads with process pauses: 20200311T132956.000-0400.zip

@yossigo
Copy link
Collaborator

yossigo commented Mar 16, 2020

Did this still resurface after dfd91d4?

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 16, 2020

OK, yeah, we're still seeing what look like stale reads with pauses, partitions, etc, but I think whatever bug caused them to happen under normal operation may have been resolved. I'm gonna flag this one closed as of dfd91d4.

@aphyr aphyr closed this as completed Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants