Panic on startup: log snapshot index mismatch #43
With Jepsen 6c063da, Redis f88f866, and Redis-Raft b9ee410, testing with network partitions, process crashes, pauses, and membership changes resulted in nodes panicking on restart, due to a mismatch between log initial indices and snapshots.
In this test run, several keys exhibited split-brain anomalies: multiple timelines of a single key were concurrently observed by distinct subsets of nodes, resulting in the apparent loss of some, or all, committed writes. This issue (possibly a separate problem) is described in #44.
Minutes later, multiple nodes observed a gap between their snapshot and initial log index when restarted by Jepsen. Logs also contained 0 entries. Immediately prior to the panics, Jepsen killed n2, n3, and n5, recovered from a network partition, isolated n2, n3, and n4 away from n1 and n5, then started all nodes.
At 10:39:39, Jepsen started n2, and it immediately panicked with
Subsequent restarts of n2 panicked as well.
n3 was also started at 10:39:39, and panicked too:
Two minutes later, at 10:41:35, Jepsen started n4, and it too panicked:
n1 and n5 did not panic.
The text was updated successfully, but these errors were encountered:
Since the split-brain happened minutes before the restart panics, the panics affected nodes on both sides of the split, and because it looks like we can have nodes panic without exhibiting split-brain, I'm tentatively gonna split this into two separate issues. Might turn out to be the same underlying cause, of course. It also looks like panics are pretty common, whereas split brain is less frequent.
Here's a couple cases with just panics:
When a snapshot is taken the log is rewritten to begin right after the last applied entry. The process of switching the log got its index truncated, making all existing entries inaccessible - an attempt to access them would assert. Note: truncation *is* generally desired because the log and the index cannot be atomically updated in the filesystem. For this reason, we generally truncate and re-build the index when opening the log on process start. In this case however the index should have been reused (or at least re-built, although there's really no reason to do that). This seems to be the root cause for #42 and possibly #43 as well.