Recovery in case of discovered slots from redis cluster is partial #191

rahulKrishnaM · 2023-09-26T12:52:32Z

During one of resiliency tests at redis server side, recently came across an issue where redisClusterAsyncCommandArgv calls kept failing continuously with REDIS_ERR return code after some of the server pods were restarted.

Looking at the server-side data, it seems like for a brief period of time some of the slots were missing at the server side and the overall slot count was less than what it had to be (10088 instead of 16384). The cluster slots command seems to have been triggered during this period in time and might have fetched this partial data from server.

It is a hypothesis (from code reading) that the failure in redisClusterAsyncCommandArgv could be happening here if the targeted slot for the command is falling in the missing range.

hiredis-cluster/hircluster.c

Line 4136 in 0a4deb6

node = node_get_by_table(cc, (uint32_t)slot_num);

the node pointer would be NULL at this point, and the api returns out with REDIS_ERR.

We don't ever recover from this scenario if we hit this, since we don't go again for rediscovering the slots and hold on to the already discovered partial slots. The query I have is how to handle/recover out of this scenario. Should this be handled in the library to maybe schedule a rediscovery if we find that the slot information is partial.

cc: @bjosv

The text was updated successfully, but these errors were encountered:

bjosv · 2023-09-26T13:08:54Z

It sounds reasonable that the library should handle a re-discovery when the slotmap is partial.
Currently it just gives the REDIS_ERR as you state, but maybe it should call throttledUpdateSlotMapAsync(acc, NULL); as well.
I believe there is a need for a testcase for this scenario and some fix.

zuiderkwast · 2023-09-26T14:31:29Z

I agree, throttled update is a good idea for the async API.

For the sync API we don't have throttling, but I think we can try at every command.

bjosv mentioned this issue Sep 27, 2023

Update slotmap when slot is not served by any node #192

Merged

bjosv closed this as completed in #192 Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery in case of discovered slots from redis cluster is partial #191

Recovery in case of discovered slots from redis cluster is partial #191

rahulKrishnaM commented Sep 26, 2023

bjosv commented Sep 26, 2023

zuiderkwast commented Sep 26, 2023

Recovery in case of discovered slots from redis cluster is partial #191

Recovery in case of discovered slots from redis cluster is partial #191

Comments

rahulKrishnaM commented Sep 26, 2023

bjosv commented Sep 26, 2023

zuiderkwast commented Sep 26, 2023