"master,fail" state not handled correctly #2249

lolodi · 2022-09-08T23:54:48Z

Hello!
We saw instances where a node was in "master,fail" state but SE.Redis kept trying to connect to it ignoring that, although is marked as master, it is in failed state.

E.g. the cluster info command returned:
ecffa36f58a103c199314291a68a66406195da01 20.212.157.81:15014 master,fail - 1662670083503 1662670080944 78 connected
20c41293cb0d5e1781c532e783b415bd32c2fcf8 20.212.157.81:13015 myself,master - 0 1662670077000 77 connected 2048-2340 7510-7802 10240-10532 12970-13182

but the library kept trying to connect to the "master,fail" node.

NickCraver · 2022-09-09T01:06:19Z

What's the scenario here - e.g. is it a configured endpoint, or are we discovering it?

lolodi · 2022-09-14T17:03:35Z

This is on a clustered cache with discovery.

NickCraver · 2022-09-15T18:03:11Z

Gotcha - what would the expected behavior here be?

It is intentional that we try to connect to the node because we're told it exists and we're monitoring for the moment it comes back online (in the background). There's also the possibility of a cluster going split brained and we wouldn't know to talk to the winning half if we didn't observe this (corner case, we hope).

lolodi · 2022-09-15T18:25:26Z

I think my expectation in this situation, where one node was in 'master,fail' and the other (the one that is actually answering) is 'myself,master' would be to try to failover to the one that says it's 'myself, master' and disconnect from the other one, especially since it's status says 'fail'.
If both nodes of a shard are reporting as master, but one is fail and the other is not, I would expect the library to connect to the one not in fail state.

NickCraver · 2022-09-15T18:32:28Z

Connections are not per-shard though, they are per-server which has some number of shard responsibilities (which can also change on the fly). There can also me (and usually are) many masters in a cluster. We want to connect to what we're told is there as quickly as possible.

Overall though, this happens in the background and isn't meant to be noisy - what issue is it actually causing?

lolodi · 2022-09-16T00:34:55Z

We have had instances where the client kept the connection to the failed master for hours instead of switching over to the healthy one. Our understanding is that SE.Redis checks if the current node is still master, but doesn't verify if it also not in fail state. If a node dies suddenly, it might still be reported as "master, fail" and the client never tries to reconnect to a different one.

Right now we don't pay attention to fail state (PFAIL == FAIL) and continue trying to connect in the main loop. I don't believe this was intended looking at the code, we just weren't handling the flag appropriately. Added now. Docs at: https://redis.io/commands/cluster-nodes/

NickCraver · 2022-10-28T12:19:03Z

Got some time to look at it over break this week - agreed this isn't handled correctly and fixing in #2288 :)

NickCraver added 🪲 bug ⚙️ area:cluster labels Oct 28, 2022

NickCraver closed this as completed in f3ac74a Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"master,fail" state not handled correctly #2249

"master,fail" state not handled correctly #2249

lolodi commented Sep 8, 2022

NickCraver commented Sep 9, 2022

lolodi commented Sep 14, 2022

NickCraver commented Sep 15, 2022

lolodi commented Sep 15, 2022

NickCraver commented Sep 15, 2022

lolodi commented Sep 16, 2022

NickCraver commented Oct 28, 2022

"master,fail" state not handled correctly #2249

"master,fail" state not handled correctly #2249

Comments

lolodi commented Sep 8, 2022

NickCraver commented Sep 9, 2022

lolodi commented Sep 14, 2022

NickCraver commented Sep 15, 2022

lolodi commented Sep 15, 2022

NickCraver commented Sep 15, 2022

lolodi commented Sep 16, 2022

NickCraver commented Oct 28, 2022