You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During one of resiliency tests at redis server side, recently came across an issue where redisClusterAsyncCommandArgv calls kept failing continuously with REDIS_ERR return code after some of the server pods were restarted.
Looking at the server-side data, it seems like for a brief period of time some of the slots were missing at the server side and the overall slot count was less than what it had to be (10088 instead of 16384). The cluster slots command seems to have been triggered during this period in time and might have fetched this partial data from server.
It is a hypothesis (from code reading) that the failure in redisClusterAsyncCommandArgv could be happening here if the targeted slot for the command is falling in the missing range.
the node pointer would be NULL at this point, and the api returns out with REDIS_ERR.
We don't ever recover from this scenario if we hit this, since we don't go again for rediscovering the slots and hold on to the already discovered partial slots. The query I have is how to handle/recover out of this scenario. Should this be handled in the library to maybe schedule a rediscovery if we find that the slot information is partial.
It sounds reasonable that the library should handle a re-discovery when the slotmap is partial.
Currently it just gives the REDIS_ERR as you state, but maybe it should call throttledUpdateSlotMapAsync(acc, NULL); as well.
I believe there is a need for a testcase for this scenario and some fix.
During one of resiliency tests at redis server side, recently came across an issue where
redisClusterAsyncCommandArgv
calls kept failing continuously withREDIS_ERR
return code after some of the server pods were restarted.Looking at the server-side data, it seems like for a brief period of time some of the slots were missing at the server side and the overall slot count was less than what it had to be (10088 instead of 16384). The
cluster slots
command seems to have been triggered during this period in time and might have fetched this partial data from server.It is a hypothesis (from code reading) that the failure in
redisClusterAsyncCommandArgv
could be happening here if the targeted slot for the command is falling in the missing range.hiredis-cluster/hircluster.c
Line 4136 in 0a4deb6
the
node
pointer would beNULL
at this point, and the api returns out withREDIS_ERR
.We don't ever recover from this scenario if we hit this, since we don't go again for rediscovering the slots and hold on to the already discovered partial slots. The query I have is how to handle/recover out of this scenario. Should this be handled in the library to maybe schedule a rediscovery if we find that the slot information is partial.
cc: @bjosv
The text was updated successfully, but these errors were encountered: