Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Total data loss on failover #14

Closed
aphyr opened this issue Mar 3, 2020 · 2 comments
Closed

Total data loss on failover #14

aphyr opened this issue Mar 3, 2020 · 2 comments

Comments

@aphyr
Copy link

aphyr commented Mar 3, 2020

With Redis f88f866 and Redisraft 1b3fbf6, every failover results in the complete loss of all data in the cluster. This can occur both in healthy clusters due to sporadic leader elections, by pausing or killing nodes, network partitions, etc. redis monitor does show leader-executed commands on followers, but I'm not clear if that's being proxied from the Raft leader, replicated but not actually applied to the local state machine, applied but then destroyed somehow, or what.

For example:

$ ssh n1 /opt/redis/redis-cli set x foo
OK
$ ssh n1 /opt/redis/redis-cli set y bar
OK
$ ssh n1 /opt/redis/redis-cli set z baz
OK
$ ssh n1 /opt/redis/redis-cli set t xyzzy

^C⏎

Here, a leader election occurred during our set of t. We retry on n2...

$ ssh n1 /opt/redis/redis-cli set t xyzzy
MOVED 192.168.122.12:6379

$ ssh n2 /opt/redis/redis-cli set t xyzzy
OK

And lo and behold, all the keys we wrote before are gone.

$ ssh n2 /opt/redis/redis-cli keys "'*'"
t

If we kill n2 as a leader, even that disappears.

$ ssh n2 killall -9 redis-server
$ ssh n3 /opt/redis/redis-cli keys "'*'"
__raft_snapshot_info__
yossigo added a commit that referenced this issue Mar 4, 2020
@yossigo
Copy link
Collaborator

yossigo commented Mar 4, 2020

Your observation is correct, the problem in this (and the previous) case is with applying the commands to Redis. A missing re-entrancy check causes the applied commands to be re-raftized (i.e. sent again to the Raft log instead of to the FSM).

@aphyr
Copy link
Author

aphyr commented Mar 4, 2020

Confirmed--this looks fixed as of d589127

@aphyr aphyr closed this as completed Mar 4, 2020
sjpotter added a commit to sjpotter/redisraft that referenced this issue Oct 7, 2021
raft_get_last_log_term() is used for filling in vote requests, except it didn't match 100% with the logic in should_grant_vote

* update test to prove that change works as needed

previous test tested when we don't snapshot everything, but code was broken when we did.
modified test tested only when we snapshot everything, now we do both in sequence.

* use raft_get_last_log_term in __should_grant_vote()

also modify __should_grant_vote to take a raft_server_t instead of raft_server_privat_t so its not constantly casting it to void.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants