New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"master,fail" state not handled correctly #2249
Comments
What's the scenario here - e.g. is it a configured endpoint, or are we discovering it? |
This is on a clustered cache with discovery. |
Gotcha - what would the expected behavior here be? It is intentional that we try to connect to the node because we're told it exists and we're monitoring for the moment it comes back online (in the background). There's also the possibility of a cluster going split brained and we wouldn't know to talk to the winning half if we didn't observe this (corner case, we hope). |
I think my expectation in this situation, where one node was in 'master,fail' and the other (the one that is actually answering) is 'myself,master' would be to try to failover to the one that says it's 'myself, master' and disconnect from the other one, especially since it's status says 'fail'. |
Connections are not per-shard though, they are per-server which has some number of shard responsibilities (which can also change on the fly). There can also me (and usually are) many masters in a cluster. We want to connect to what we're told is there as quickly as possible. Overall though, this happens in the background and isn't meant to be noisy - what issue is it actually causing? |
We have had instances where the client kept the connection to the failed master for hours instead of switching over to the healthy one. Our understanding is that SE.Redis checks if the current node is still master, but doesn't verify if it also not in fail state. If a node dies suddenly, it might still be reported as "master, fail" and the client never tries to reconnect to a different one. |
Right now we don't pay attention to fail state (PFAIL == FAIL) and continue trying to connect in the main loop. I don't believe this was intended looking at the code, we just weren't handling the flag appropriately. Added now. Docs at: https://redis.io/commands/cluster-nodes/
Got some time to look at it over break this week - agreed this isn't handled correctly and fixing in #2288 :) |
Hello!
We saw instances where a node was in "master,fail" state but SE.Redis kept trying to connect to it ignoring that, although is marked as master, it is in failed state.
E.g. the cluster info command returned:
ecffa36f58a103c199314291a68a66406195da01 20.212.157.81:15014 master,fail - 1662670083503 1662670080944 78 connected
20c41293cb0d5e1781c532e783b415bd32c2fcf8 20.212.157.81:13015 myself,master - 0 1662670077000 77 connected 2048-2340 7510-7802 10240-10532 12970-13182
but the library kept trying to connect to the "master,fail" node.
The text was updated successfully, but these errors were encountered: