Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"master,fail" state not handled correctly #2249

Closed
lolodi opened this issue Sep 8, 2022 · 7 comments
Closed

"master,fail" state not handled correctly #2249

lolodi opened this issue Sep 8, 2022 · 7 comments

Comments

@lolodi
Copy link
Collaborator

lolodi commented Sep 8, 2022

Hello!
We saw instances where a node was in "master,fail" state but SE.Redis kept trying to connect to it ignoring that, although is marked as master, it is in failed state.

E.g. the cluster info command returned:
ecffa36f58a103c199314291a68a66406195da01 20.212.157.81:15014 master,fail - 1662670083503 1662670080944 78 connected
20c41293cb0d5e1781c532e783b415bd32c2fcf8 20.212.157.81:13015 myself,master - 0 1662670077000 77 connected 2048-2340 7510-7802 10240-10532 12970-13182

but the library kept trying to connect to the "master,fail" node.

@NickCraver
Copy link
Collaborator

What's the scenario here - e.g. is it a configured endpoint, or are we discovering it?

@lolodi
Copy link
Collaborator Author

lolodi commented Sep 14, 2022

This is on a clustered cache with discovery.

@NickCraver
Copy link
Collaborator

Gotcha - what would the expected behavior here be?

It is intentional that we try to connect to the node because we're told it exists and we're monitoring for the moment it comes back online (in the background). There's also the possibility of a cluster going split brained and we wouldn't know to talk to the winning half if we didn't observe this (corner case, we hope).

@lolodi
Copy link
Collaborator Author

lolodi commented Sep 15, 2022

I think my expectation in this situation, where one node was in 'master,fail' and the other (the one that is actually answering) is 'myself,master' would be to try to failover to the one that says it's 'myself, master' and disconnect from the other one, especially since it's status says 'fail'.
If both nodes of a shard are reporting as master, but one is fail and the other is not, I would expect the library to connect to the one not in fail state.

@NickCraver
Copy link
Collaborator

Connections are not per-shard though, they are per-server which has some number of shard responsibilities (which can also change on the fly). There can also me (and usually are) many masters in a cluster. We want to connect to what we're told is there as quickly as possible.

Overall though, this happens in the background and isn't meant to be noisy - what issue is it actually causing?

@lolodi
Copy link
Collaborator Author

lolodi commented Sep 16, 2022

We have had instances where the client kept the connection to the failed master for hours instead of switching over to the healthy one. Our understanding is that SE.Redis checks if the current node is still master, but doesn't verify if it also not in fail state. If a node dies suddenly, it might still be reported as "master, fail" and the client never tries to reconnect to a different one.

NickCraver added a commit that referenced this issue Oct 28, 2022
Right now we don't pay attention to fail state (PFAIL == FAIL) and continue trying to connect in the main loop. I don't believe this was intended looking at the code, we just weren't handling the flag appropriately. Added now.

Docs at: https://redis.io/commands/cluster-nodes/
@NickCraver
Copy link
Collaborator

Got some time to look at it over break this week - agreed this isn't handled correctly and fixing in #2288 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants