-
Notifications
You must be signed in to change notification settings - Fork 316
Client fails during live resharding #89
Comments
@milyenpabo Have you looked at the unstable branch and the code in there if the logic is the same? |
@Grokzen As far as I can tell from the source, it has the same issue. I also ran a search in the repo for the Do I miss something? |
@milyenpabo I think the client can handle ASK errors now after i rewrote the code to use a better parser https://github.com/Grokzen/redis-py-cluster/blob/unstable/rediscluster/client.py#L260 but it could be that ASKING command is not sent when that happens. |
@Grokzen That's right, the exception is caught but Traceback (most recent call last):
File "redis-writer.py", line 33, in <module>
r.set(d, d)
File "build/bdist.linux-x86_64/egg/redis/client.py", line 1055, in set
File "build/bdist.linux-x86_64/egg/rediscluster/utils.py", line 87, in inner
File "build/bdist.linux-x86_64/egg/rediscluster/client.py", line 266, in execute_command
rediscluster.exceptions.ClusterError: TTL exhausted. |
@milyenpabo I will dig some during the weekend and if all goes well i might have a patch that adds this feature if it needs to be there. The only thing is that i have never seen commands or client that breaks when it is running during a live reshard. Both me and @72squared have very recently had a test script (with threads) that performs operations during a reshard and after our fixes (thread related) we have not yet seen any problems. Have you seen a acctual problem/exception somehow when you have been running a reshard that we can use to reproduce the problem, or this more a "it should comply to the docs" issue? |
About the change of error, see https://github.com/Grokzen/redis-py-cluster/blob/unstable/docs/Upgrading.md#100----next-release where it states that the error was changed due to the old one was not so descriptive. |
Can you share the test script you have been running in a gist that i can access? Also a short description about the redis-cluster you are running, number of nodes, locally/cloud/docker? How much data do you have in the server and how many nodes are you moving around? With that i might be able to reproduce the error. |
@Grokzen I think the current implementation does not comply with the redis cluster specification. But anyhow, I can provide a test case. I have a from rediscluster import RedisCluster
startup_nodes = [{"host": "127.0.0.1", "port": "27002"}]
r = RedisCluster(startup_nodes=startup_nodes, max_connections=32, decode_responses=True)
N = 1000000
p = 0
pdiff = 1
print "Writing %d keys to redis" % N
for i in xrange(N):
d = str(i)
r.set(d, d)
progress = 100.0*i/N
if progress >= p:
print "%.0f%%" % progress,
p += pdiff
if progress < 100:
print "100%" First I run the script to populate redis: python -u redis-writer.py Then I re-run the script while resharding redis with the following command: ./redis-trib.rb reshard --from 412072cda05b27485348f3b83a5b5b09ba01b1ce --to 2f51725948ca46649048a05f4c4ac0bfa605bad5 --slots 2000 --yes 127.0.0.1:27002 If the number of slots to migrate (in my case 2000) and the number of keys/slot (in my case ~60) is large enough, the script will fail with a probability close to 100%. The above mentioned exception is thrown when the writer script hits a key that was already migrated, but the slot containing the key is still under migration. (For explanation see the issue-opening post.) Please note that the error is not triggered deterministically. You probably did not see this happening because in your tests you did not hit an already moved key in a slot just under migration. So depending on your setup, you might need to re-run the test several times or tweak the parameters (number of slots to reshard and/or the number of keys/slot, i.e. the N variable in the script). |
very nice. I will play with it some during the weekend and see what i can come up with. |
@milyenpabo Do you think that only this will be enough? 4bf8e51 I am not really sure how to verify it works as intended more then that it do not break anything and your script still works as expected during the reshard operation. I do not however see the exception you got after that fix has been applied. It feels stable enough for me to merge this. |
@milyenpabo I think the fix i pushed to unstable branch have fixed the problem. I compared to jedis so the logic would look the same and it does. If you think there is still problems with the implementation, you can reopen this issue and i will take another look at the problem. |
@Grokzen I think that the fix is OK. I also made some tests, and now redirection seems to work fine. Thanks! |
@milyenpabo Another fix was just commited because the original implementation was not enough. It was found that it was still failing on some ASK errors. See commit 199ee0b |
It seems that resharding is handled incorrectly up to v1.0.0. The relevant source code is the execute_command function in client.py.
The problem is the following: if a redis-py-cluster client is querying a slot while it is under migration from node A to node B, then the client will be ping-ponging between A and B until
RedisClusterRequestTTL
is exhausted and an exception is thrown. (Node A will repeatedly redirect the client to B with anASK
reply, while B will repeatedly redirect to A with aMOVED
reply.)The solution for this situation is the
ASKING
cluster command, which is completely missing from redis-py-cluster, if I'm right. Some explanations on this mechanism:http://redis.io/commands/cluster-setslot
http://grokbase.com/p/gg/redis-db/142wrajgdq/cluster-questions
In case of an
ASK
redirection the client should not blindly issue the command to the new target node, but first send anASKING
command, notifying node B that it was already redirected from the authoritative slot owner, A.This issue is related to issue #67.
The text was updated successfully, but these errors were encountered: