New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server crashes with callRaftPeriodic: Assertion `ret == 0' failed. #42
Comments
yossigo
added a commit
that referenced
this issue
May 8, 2020
When a snapshot is taken the log is rewritten to begin right after the last applied entry. The process of switching the log got its index truncated, making all existing entries inaccessible - an attempt to access them would assert. Note: truncation *is* generally desired because the log and the index cannot be atomically updated in the filesystem. For this reason, we generally truncate and re-build the index when opening the log on process start. In this case however the index should have been reused (or at least re-built, although there's really no reason to do that). This seems to be the root cause for #42 and possibly #43 as well.
It looks like this is also present in 2d1cf30: 20200509T130823.000-0400.zip |
I've narrowed this down to partitions and crashes alone: |
I haven't reproduced this in e0123a9--I think it might be fixed! |
Nope, spoke too soon. This recurred in later testing with higher concurrency on e0123a9: With jepsen.redis ddb68c8961a8f2b1cb0b37f263bd1eb0930ab5b8, try:
|
yossigo
added a commit
to yossigo/redisraft
that referenced
this issue
Jul 14, 2020
The API specifies the index passed on reset is the one that's going to be used for the *next* entry. As a result, the truncated log was created with an off-by-one header in handleLoadSnapshot(). However, RaftLogCreate() was immediately called with the proper index, masking the problem. If the process was restarted in-between, the corrupt header would be read causing the log index to be ahead of the real log by a single entry. Next, raft_periodic() would be called, attempt to read the last log entry, fail and trigger an assert. This fixes the last (known?) case that manifested as RedisLabs#42.
Merged
yossigo
added a commit
that referenced
this issue
Jul 15, 2020
The API specifies the index passed on reset is the one that's going to be used for the *next* entry. As a result, the truncated log was created with an off-by-one header in handleLoadSnapshot(). However, RaftLogCreate() was immediately called with the proper index, masking the problem. If the process was restarted in-between, the corrupt header would be read causing the log index to be ahead of the real log by a single entry. Next, raft_periodic() would be called, attempt to read the last log entry, fail and trigger an assert. This fixes the last (known?) case that manifested as #42.
sjpotter
added a commit
to sjpotter/redisraft
that referenced
this issue
Oct 7, 2021
before, purposefully, virtraft had simplex partitions (i.e. traffic was only blocked from flowing one way). however, for read_queue tests and for correctness of them, we have to ensure that its duplex (i.e. neither side can talk to each other). otherwise, we can end up in a situation, where leader heartbeated every node, the majority of the nodes got the ping, but the majority couldn't reply back. If there's a bug in libraft's calculation, by simply looking at the other nodes, we would think everything is fine, as they got the heartbeat
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
aphyr commentedMay 5, 2020
Under normal operation (e.g. without membership changes, partitions, etc) and with proxy mode enabled, redis f88f866 (redis-raft b9ee410) servers can spontaneously crash, logging:
It looks like this node was a leader at 05:45:16, took a snapshot, discovered another node won an election and stepped down, just prior to the ret == 0 assertion failing. The node crashes at that point; active clients get an EOF on the socket, and subsequent connections are refused. Here's a complete log from Jepsen 6c063da0503e9430a0354d37da8a4da430d1671e:
20200505T084148.000-0400.zip
This particular line suggests that maybe... the snapshot process exited with some sort of unexpected state. I wonder if maybe the stepdown occurring immediately after/during the snapshot might have been to blame...
redisraft/raft.c
Line 784 in b9ee410
The text was updated successfully, but these errors were encountered: