Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

Client fails during live resharding #89

Closed
milyenpabo opened this issue Oct 7, 2015 · 13 comments
Closed

Client fails during live resharding #89

milyenpabo opened this issue Oct 7, 2015 · 13 comments

Comments

@milyenpabo
Copy link

It seems that resharding is handled incorrectly up to v1.0.0. The relevant source code is the execute_command function in client.py.

The problem is the following: if a redis-py-cluster client is querying a slot while it is under migration from node A to node B, then the client will be ping-ponging between A and B until RedisClusterRequestTTL is exhausted and an exception is thrown. (Node A will repeatedly redirect the client to B with an ASK reply, while B will repeatedly redirect to A with a MOVED reply.)

The solution for this situation is the ASKING cluster command, which is completely missing from redis-py-cluster, if I'm right. Some explanations on this mechanism:

http://redis.io/commands/cluster-setslot
http://grokbase.com/p/gg/redis-db/142wrajgdq/cluster-questions

In case of an ASK redirection the client should not blindly issue the command to the new target node, but first send an ASKING command, notifying node B that it was already redirected from the authoritative slot owner, A.

This issue is related to issue #67.

@Grokzen
Copy link
Owner

Grokzen commented Oct 7, 2015

@milyenpabo Have you looked at the unstable branch and the code in there if the logic is the same?

@milyenpabo
Copy link
Author

@Grokzen As far as I can tell from the source, it has the same issue. I also ran a search in the repo for the ASKING command, but did not find anything relevant. Strangely, exceptions.py on the unstable branch has a new AskError class that describes the correct mechanism in a comment, but I found no other clues.

Do I miss something?

@Grokzen
Copy link
Owner

Grokzen commented Oct 8, 2015

@milyenpabo I think the client can handle ASK errors now after i rewrote the code to use a better parser https://github.com/Grokzen/redis-py-cluster/blob/unstable/rediscluster/client.py#L260 but it could be that ASKING command is not sent when that happens.

@milyenpabo
Copy link
Author

@Grokzen That's right, the exception is caught but ASKING is not sent, so in effect the ping-pong issue is still there. The difference in the unstable version is that it throws a different type of exception than in v1.0.0:

Traceback (most recent call last):
  File "redis-writer.py", line 33, in <module>
    r.set(d, d)
  File "build/bdist.linux-x86_64/egg/redis/client.py", line 1055, in set
  File "build/bdist.linux-x86_64/egg/rediscluster/utils.py", line 87, in inner
  File "build/bdist.linux-x86_64/egg/rediscluster/client.py", line 266, in execute_command
rediscluster.exceptions.ClusterError: TTL exhausted.

@Grokzen
Copy link
Owner

Grokzen commented Oct 8, 2015

@milyenpabo I will dig some during the weekend and if all goes well i might have a patch that adds this feature if it needs to be there.

The only thing is that i have never seen commands or client that breaks when it is running during a live reshard. Both me and @72squared have very recently had a test script (with threads) that performs operations during a reshard and after our fixes (thread related) we have not yet seen any problems.

Have you seen a acctual problem/exception somehow when you have been running a reshard that we can use to reproduce the problem, or this more a "it should comply to the docs" issue?

@Grokzen
Copy link
Owner

Grokzen commented Oct 8, 2015

About the change of error, see https://github.com/Grokzen/redis-py-cluster/blob/unstable/docs/Upgrading.md#100----next-release where it states that the error was changed due to the old one was not so descriptive.

@Grokzen
Copy link
Owner

Grokzen commented Oct 8, 2015

Can you share the test script you have been running in a gist that i can access? Also a short description about the redis-cluster you are running, number of nodes, locally/cloud/docker? How much data do you have in the server and how many nodes are you moving around? With that i might be able to reproduce the error.

@milyenpabo
Copy link
Author

@Grokzen I think the current implementation does not comply with the redis cluster specification. But anyhow, I can provide a test case. I have a redis-writer.py script that writes keys to a redis cluster (v3.0.4) with 3 masters and 1 replica per master. (See gist.)

from rediscluster import RedisCluster

startup_nodes = [{"host": "127.0.0.1", "port": "27002"}]
r = RedisCluster(startup_nodes=startup_nodes, max_connections=32, decode_responses=True)

N = 1000000
p = 0
pdiff = 1

print "Writing %d keys to redis" % N

for i in xrange(N):
    d = str(i)
    r.set(d, d)
    progress = 100.0*i/N
    if progress >= p:
        print "%.0f%%" % progress,
        p += pdiff
if progress < 100:
    print "100%"

First I run the script to populate redis:

python -u redis-writer.py

Then I re-run the script while resharding redis with the following command:

./redis-trib.rb reshard --from 412072cda05b27485348f3b83a5b5b09ba01b1ce --to 2f51725948ca46649048a05f4c4ac0bfa605bad5 --slots 2000 --yes 127.0.0.1:27002

If the number of slots to migrate (in my case 2000) and the number of keys/slot (in my case ~60) is large enough, the script will fail with a probability close to 100%. The above mentioned exception is thrown when the writer script hits a key that was already migrated, but the slot containing the key is still under migration. (For explanation see the issue-opening post.)

Please note that the error is not triggered deterministically. You probably did not see this happening because in your tests you did not hit an already moved key in a slot just under migration. So depending on your setup, you might need to re-run the test several times or tweak the parameters (number of slots to reshard and/or the number of keys/slot, i.e. the N variable in the script).

@Grokzen
Copy link
Owner

Grokzen commented Oct 8, 2015

very nice. I will play with it some during the weekend and see what i can come up with.

@Grokzen
Copy link
Owner

Grokzen commented Oct 18, 2015

@milyenpabo Do you think that only this will be enough? 4bf8e51

I am not really sure how to verify it works as intended more then that it do not break anything and your script still works as expected during the reshard operation. I do not however see the exception you got after that fix has been applied.

It feels stable enough for me to merge this.

@Grokzen
Copy link
Owner

Grokzen commented Oct 18, 2015

@milyenpabo I think the fix i pushed to unstable branch have fixed the problem. I compared to jedis so the logic would look the same and it does. If you think there is still problems with the implementation, you can reopen this issue and i will take another look at the problem.

@Grokzen Grokzen closed this as completed Oct 18, 2015
@milyenpabo
Copy link
Author

@Grokzen I think that the fix is OK. I also made some tests, and now redirection seems to work fine. Thanks!

@Grokzen
Copy link
Owner

Grokzen commented Oct 22, 2015

@milyenpabo Another fix was just commited because the original implementation was not enough. It was found that it was still failing on some ASK errors. See commit 199ee0b

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants