Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient empty values on process restart #18

Closed
aphyr opened this issue Mar 9, 2020 · 7 comments
Closed

Transient empty values on process restart #18

aphyr opened this issue Mar 9, 2020 · 7 comments

Comments

@aphyr
Copy link
Collaborator

aphyr commented Mar 9, 2020

It looks as though redis-raft (redis f88f866, redisraft d589127) can return empty reads when processes crash and restart. For instance, in this test, killing all nodes, then restarting all nodes, then killing n1, n3, and n4, allows n2, five seconds later, to return an empty list [] for key 0, rather than [4 5 7 26 30 31 48 49 84 86 87 89]:

2020-03-09 17:50:31,885{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 0 89] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89]]]
2020-03-09 17:50:32,155{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 0 97]]
2020-03-09 17:50:32,157{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	:all
2020-03-09 17:50:34,368{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:35,368{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	:all
2020-03-09 17:50:35,575{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:36,575{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	:majority
2020-03-09 17:50:38,784{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	{"n1" "", "n3" "", "n4" ""}
2020-03-09 17:50:39,784{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	:all
2020-03-09 17:50:39,991{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:40,418{GMT}	INFO	[jepsen worker 1] jepsen.util: 6	:ok	:txn	[[:r 0 []]]

The correct values show up in immediately following reads on the same node, so I don't think the data is gone per se. It might just be that Redis is allowed to serve requests before Raft initialization has completed?

2020-03-09 17:50:40,662{GMT}	INFO	[jepsen worker 1] jepsen.util: 6	:ok	:txn	[[:append 2 115] [:r 1 [14 20 29 30 37 38 44 47 50 52 53 54 56 77 82 96 97 129]] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89 97]] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89 97]]]
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 9, 2020

OK, I've narrowed this down to process kills. We can reproduce it with jepsen.redis version 17b860bce6bab33cd1eb68e8b647fd94722c284d by running something like...

lein run test-all --raft-version d589127 -w append --time-limit 200 --nemesis-interval 1 --nemesis kill --test-count 5

All of the anomalies we observe hinge on observing an empty, rather than a stale, state, which makes me suspect this is something to do with the initialization process.

20200309T190930.000-0400.zip

@aphyr aphyr changed the title Transient empty values Transient empty values on process restart Mar 9, 2020
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 10, 2020

I'm also seeing what look like stale reads here, so it's possible process crashes and restarts can give you stale states--but I can't distinguish that behavior from #19. I think it's just something redisraft does normally.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 10, 2020

I can also confirm this behavior occurs in single-operation transactions--e.g. without using MULTI/EXEC.

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 11, 2020

It looks a little less frequent now, but I'm still seeing this behavior in 8da0c77 with process kills: 20200311T132956.000-0400 (1).zip

@yossigo
Copy link
Collaborator

yossigo commented Mar 16, 2020

Did this still resurface after dfd91d4?

@aphyr
Copy link
Collaborator Author

aphyr commented Mar 16, 2020

We're definitely still seeing stale reads, but it's less frequent now, and could be caused by #26--I'm OK closing this, since you did make some changes, and opening a fresh ticket if we can confirm stale reads later.

@aphyr aphyr closed this as completed Mar 16, 2020
@aphyr
Copy link
Collaborator Author

aphyr commented Mar 17, 2020

Yeah, I've had the chance to run a longer test series with process kills, and this bug looks definitively squashed with dfd91d4. Good work!

sjpotter pushed a commit to sjpotter/redisraft that referenced this issue Oct 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants