Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server crashes with callRaftPeriodic: Assertion `ret == 0' failed. #42

Closed
aphyr opened this issue May 5, 2020 · 4 comments · Fixed by #60
Closed

Server crashes with callRaftPeriodic: Assertion `ret == 0' failed. #42

aphyr opened this issue May 5, 2020 · 4 comments · Fixed by #60

Comments

@aphyr
Copy link
Collaborator

aphyr commented May 5, 2020

Under normal operation (e.g. without membership changes, partitions, etc) and with proxy mode enabled, redis f88f866 (redis-raft b9ee410) servers can spontaneously crash, logging:

22371:05 May 05:46:37.558 node:1060682460: raft.c:342: raft_recv_appendentries_response failed, error -2
22464:C 05 May 2020 05:46:37.560 * DB saved on disk
22371:05 May 05:46:37.649 Snapshot created, 4 log entries rewritten to log.
22371:05 May 05:46:37.649 Snapshot operation completed successfuly.
22371:05 May 05:46:37.654 Log rewrite complete, 0 entries appended (from idx 9174).
22371:05 May 05:46:37.655 <raftlib> end snapshot base:9173 commit-index:9174 current-index:9178
22371:05 May 05:46:38.745 node:319092381: raft.c:479: <raftlib> recvd appendentries t:11 ci:9178 lc:9176 pli:9174 plt:9 #2
22371:05 May 05:46:38.748 <raftlib> becoming follower
22371:05 May 05:46:38.748 State change: Node is now a follower, term 11
22371:05 May 05:46:38.748 <raftlib> randomize election timeout to 1922
22371:05 May 05:46:38.748 node:319092381: node.c:361: NodeAddPendingResponse: id=14638, type=proxy, request_time=1588682798748
22371:05 May 05:46:38.748 node:1598587288: raft.c:479: <raftlib> node requested vote: 1585459200 replying: not granted
22371:05 May 05:46:38.748 node:1060682460: raft.c:479: <raftlib> node requested vote: 1585459296 replying: not granted
22371:05 May 05:46:38.748 node:201926373: raft.c:479: <raftlib> node requested vote: 1640037424 replying: not granted
22371:05 May 05:46:38.755 node:319092381: node.c:377: NodeDismissPendingResponse: id=14638, type=proxy, latency=7
22371:05 May 05:46:38.795 node:319092381: node.c:361: NodeAddPendingResponse: id=14639, type=proxy, request_time=1588682798795
22371:05 May 05:46:38.806 node:319092381: node.c:377: NodeDismissPendingResponse: id=14639, type=proxy, latency=11
22371:05 May 05:46:38.848 node:319092381: raft.c:479: <raftlib> recvd appendentries t:11 ci:9178 lc:9179 pli:9174 plt:9 #5
redis-server: raft.c:784: callRaftPeriodic: Assertion `ret == 0' failed.

It looks like this node was a leader at 05:45:16, took a snapshot, discovered another node won an election and stepped down, just prior to the ret == 0 assertion failing. The node crashes at that point; active clients get an EOF on the socket, and subsequent connections are refused. Here's a complete log from Jepsen 6c063da0503e9430a0354d37da8a4da430d1671e:

lein run test-all --raft-version b9ee410 --time-limit 600 --nemesis none --follower-proxy --test-count 16

20200505T084148.000-0400.zip

This particular line suggests that maybe... the snapshot process exited with some sort of unexpected state. I wonder if maybe the stepdown occurring immediately after/during the snapshot might have been to blame...

assert(ret == 0);

yossigo added a commit that referenced this issue May 8, 2020
When a snapshot is taken the log is rewritten to begin right after the
last applied entry. The process of switching the log got its index
truncated, making all existing entries inaccessible - an attempt to
access them would assert.

Note: truncation *is* generally desired because the log and the index
cannot be atomically updated in the filesystem. For this reason, we
generally truncate and re-build the index when opening the log on
process start. In this case however the index should have been reused
(or at least re-built, although there's really no reason to do that).

This seems to be the root cause for #42 and possibly #43 as well.
@aphyr
Copy link
Collaborator Author

aphyr commented May 9, 2020

It looks like this is also present in 2d1cf30: 20200509T130823.000-0400.zip

@aphyr
Copy link
Collaborator Author

aphyr commented May 10, 2020

I've narrowed this down to partitions and crashes alone:
20200510T082549.000-0400.zip

@aphyr
Copy link
Collaborator Author

aphyr commented Jun 15, 2020

I haven't reproduced this in e0123a9--I think it might be fixed!

@aphyr aphyr closed this as completed Jun 15, 2020
@aphyr
Copy link
Collaborator Author

aphyr commented Jun 15, 2020

Nope, spoke too soon. This recurred in later testing with higher concurrency on e0123a9:
20200615T170005.000-0400.zip

With jepsen.redis ddb68c8961a8f2b1cb0b37f263bd1eb0930ab5b8, try:

lein run test-all --time-limit 600 --nemesis kill,partition --follower-proxy --test-count 16 --concurrency 4n

@aphyr aphyr reopened this Jun 15, 2020
yossigo added a commit to yossigo/redisraft that referenced this issue Jul 14, 2020
The API specifies the index passed on reset is the one that's going to
be used for the *next* entry.

As a result, the truncated log was created with an off-by-one header in
handleLoadSnapshot(). However, RaftLogCreate() was immediately called
with the proper index, masking the problem.

If the process was restarted in-between, the corrupt header would be
read causing the log index to be ahead of the real log by a single
entry. Next, raft_periodic() would be called, attempt to read the last
log entry, fail and trigger an assert.

This fixes the last (known?) case that manifested as RedisLabs#42.
yossigo added a commit that referenced this issue Jul 15, 2020
The API specifies the index passed on reset is the one that's going to
be used for the *next* entry.

As a result, the truncated log was created with an off-by-one header in
handleLoadSnapshot(). However, RaftLogCreate() was immediately called
with the proper index, masking the problem.

If the process was restarted in-between, the corrupt header would be
read causing the log index to be ahead of the real log by a single
entry. Next, raft_periodic() would be called, attempt to read the last
log entry, fail and trigger an assert.

This fixes the last (known?) case that manifested as #42.
sjpotter added a commit to sjpotter/redisraft that referenced this issue Oct 7, 2021
before, purposefully, virtraft had simplex partitions (i.e. traffic was only blocked from flowing one way).

however, for read_queue tests and for correctness of them, we have to ensure that its duplex (i.e. neither side can talk to each other).

otherwise, we can end up in a situation, where leader heartbeated every node, the majority of the nodes got the ping, but the majority couldn't reply back.  If there's a bug in libraft's calculation, by simply looking at the other nodes, we would think everything is fine, as they got the heartbeat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant