Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery in case of discovered slots from redis cluster is partial #191

Closed
rahulKrishnaM opened this issue Sep 26, 2023 · 2 comments · Fixed by #192
Closed

Recovery in case of discovered slots from redis cluster is partial #191

rahulKrishnaM opened this issue Sep 26, 2023 · 2 comments · Fixed by #192

Comments

@rahulKrishnaM
Copy link

During one of resiliency tests at redis server side, recently came across an issue where redisClusterAsyncCommandArgv calls kept failing continuously with REDIS_ERR return code after some of the server pods were restarted.

Looking at the server-side data, it seems like for a brief period of time some of the slots were missing at the server side and the overall slot count was less than what it had to be (10088 instead of 16384). The cluster slots command seems to have been triggered during this period in time and might have fetched this partial data from server.

It is a hypothesis (from code reading) that the failure in redisClusterAsyncCommandArgv could be happening here if the targeted slot for the command is falling in the missing range.

node = node_get_by_table(cc, (uint32_t)slot_num);

the node pointer would be NULL at this point, and the api returns out with REDIS_ERR.

We don't ever recover from this scenario if we hit this, since we don't go again for rediscovering the slots and hold on to the already discovered partial slots. The query I have is how to handle/recover out of this scenario. Should this be handled in the library to maybe schedule a rediscovery if we find that the slot information is partial.

cc: @bjosv

@bjosv
Copy link
Collaborator

bjosv commented Sep 26, 2023

It sounds reasonable that the library should handle a re-discovery when the slotmap is partial.
Currently it just gives the REDIS_ERR as you state, but maybe it should call throttledUpdateSlotMapAsync(acc, NULL); as well.
I believe there is a need for a testcase for this scenario and some fix.

@zuiderkwast
Copy link
Collaborator

I agree, throttled update is a good idea for the async API.

For the sync API we don't have throttling, but I think we can try at every command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants