First implementation of dog-pile effect avoidance via python-redis-lock additions#134
First implementation of dog-pile effect avoidance via python-redis-lock additions#134Karmak23 wants to merge 1 commit into
Conversation
|
Hi, this looks like very solid patch overall. There are some issues with it though:
|
|
Thinking a bit more about it I came up with this procedure: result_p = cache_get(cache_key)
while result_p is None:
lock = Lock(redis_conn, cache_key, expire=1)
if lock.acquire(blocking=False):
result = calculate()
cache_set(cache_key, pickle.dumps(result))
lock.release()
return result
else:
with lock:
result_p = cache_get(cache_key)
return pickle.loads(result_p) |
|
Re, Thanks for comment 1, I missed it and will update the patch. About comment 2: yes, my bad. The load is not parallel anymore but now sequential… I will provide a new patch updated from your suggestion ASAP. About comment 3 and your implementation, yes, my bad too. I missed some part when getting the python-redis-lock example. However I think the final code will be simpler than yours because what @ionelmc propose seems to give the same result without the I'm currently leaving for a customer meeting, hoping to send a new patch in a few hours. Thanks for your review, |
|
Hi, sorry for the delay, here is another patch to complement the first. Point 1) was tricky, I had to think hard to get to a factorized result, thus the lot of comments to make the implicit cases obviously visible. I think Point 3) is addressed by my patch too, but I perhaps misunderstood something. Regards, |
There was a problem hiding this comment.
Please, mention QuerySet cache and @cached_as() here too.
|
Overall this looks solid, thanks for all your work so far. I added line notes, please look them through. I also have some general considerations:
|
fbac3a9 to
7ebde4f
Compare
|
So here is the rewrapped version with all your remarks. It looks better ;-) |
|
Only one remains unclear for me: about 2), when you say “any process acquiring the lock and then failing will cause all others to hang up”. What kind of failure are you thinking about ? If you mean that all processes except the first will block until the first finishes to compute the result, then I think this is exactly intended. They will be unblocked by a signal at the end of computation, thanks to the original implementation of python-redis-lock. In a high load scenario where a lot of processes wait for the lock while one computes a result, all waiters will all acquire the lock in turn at the end of computation, and will immediately return the cached result from inside the lock. In my mind, this workflow is the implicit tradeoff of the locking feature. It's the first point of dogpile effect avoiding (the second beiing serving stale data, but it's out of scope of my patch for now). This is not perfect, but still far better than all of them computing in parallel. With an eventual timeout to avoid blocking, we need to introduce the If you mean that a failing first-process will let the lock in an unknown or blocked state forever (python-redis-lock refers to this rare case and provides a management command to clear the locks), then yes, it can happen, but it's very rare. In this case, a timeout would effectively workaround this problem. But then, I personnaly strongly need to have a long timeout, because in most places where I use the cache have more than 10 seconds computations (my database is huge and dog-pile is clearly hammering the machine). As this is for a rare bug, I'd prefer to publish the code first as it is, and "see" how much the problem happens (or not) before introducing polling and (relatively) more code complexity. As the lock feature is disabled by default, we can label it as "experimental" and invite users to report/share issues. It is still possible I have missed something or not understood what you told, so forgive me if it's the case. |
|
By fail I mean crash. And yes, I am worried if block is left in acquired state forever. This effectively means that any process trying to acquire it will hang forever, which will trigger website or whatever completely inoperable until admin comes up and fixes that, and he or she will have a hard time too, cause reboot won't help and tracking an issue to cache could be non-trivial, also if clearing all the cache is not an option admin will need to trace it to certain lock key, which is even harder. Relying on integrator to write some hacky code using If 1 second is too small, use 60 or make a setting. This way you won't hang someones site for half a day - a very realistic scenario otherwise. |
|
That seems a fair worry; automatic recover is likely to be an appreciated feature. I will update the patch with a dedicated setting and more documentation ASAP. |
|
Btw, the Travis build failed with a syntax error, I didn't see it. I will create dedicated flake8 config for cacheops, I have too much warnings per file in default config. |
|
You can use pip install -r test_requirements.txt
./runtests.py
CACHEOPS_USE_LOCK=1 ./runtests.py |
|
BTW, there is currently a nasty side-effect when the while the lock is acquired blocking and waiting from release signal, it will fail with I would tend to workaround this problem by creating another client dedicated to locking, with its own timeout value, which should always be greater than the one from in various What's your point of view on this issue ? |
|
Sounds sane. |
|
Hey, this turned to be harder than it looked initially. Reach me if you need help or have any questions. |
|
Hi, sorry for the delay, and thanks for your proposition. I have been very busy last week synching people around working on https://github.com/UnissonCo/dataserver while beiing 1000km away from home ;-) I'll take time to fix the timeout issue and implement the crash-auto-workaround feature this week. This is taking me much time than anticipated, but the handfull of features and reliability is worth it I think. |
|
Hi @Suor I didn't forget you but have been very busy lately on python-ftr and building a REST API for 1flow. I'm trying to find time to finish this lock patch. Surely I will make it because I need it, but currently other things have more priority. Sorry for the delay… |
| 'redis>=2.9.1', | ||
| 'funcy>=1.2,<2.0', | ||
| 'six>=1.4.0', | ||
| 'python-redis-lock==2.0.0', |
There was a problem hiding this comment.
You should not pin the dependency like this. Add major version constraints instead (>=X.0,<Y.0).
|
@Karmak23 what's blocking this PR (besides the conflicts)? about the timeout issue, can you explain a bit more - maybe there's something that redis-lock can handle/fix? |
|
I think now that python-redis-lock is overkill for this, custom small implementation could be much simpler and more efficient:
Possible lock func: def get_or_lock(client, key):
signal_key = key + ':signal'
while True:
data = client.get(key)
if data is None:
if client.set(key, 'LOCK', nx=True, ex=LOCK_TIMEOUT):
client.delete(signal_key)
return None # Should release lock there and lpush to signal_key
elif data != 'LOCK':
return data
# No data and not locked, wait
client.brpoplpush(signal_key, signal_key, timeout=LOCK_TIMEOUT)Then add this to cache_thing script: redis.call('lpush', signal_key, '1')
redis.call('expire', signal_key, 5)And we are done. |
|
BTW, @ionelmc, why did you stop using |
|
@Suor It's still used: https://github.com/ionelmc/python-redis-lock/blob/master/src/redis_lock/__init__.py#L221 Are you referring to something else? |
|
I mean you don't rely on expire command to remove old lists.
|
|
Implemented in 7a89046, tests and docs to follow. Can be used as: @cached_as(qs, lock=True)
def heavy_func(...):
# ...
return res
for item in qs.cache(lock=True):
# ...I found that global setting would be less useful. It also bypasses locking for cache hits and hence will have zero overhead for them. |
|
@Suor regarding the expire removal from redis-lock, the reason was that it wasn't reliable, see: ionelmc/python-redis-lock#32 - the issue became apparent when someone tried to replace EXPIRE with PEXPIRE. I suspect on a busy-enough server EXPIRE would have the same problem too. Feel free to give input if you know more about how redis works internally. |
|
As far as I understand you only experienced issues when tried to set timeout to some very low value, so leaving it sane like 1 or few seconds should be ok. Also seems far cleaner solution than cleanup thread. |
|
Isn't it safe to just make it big enough? A solution with a separate thread
looks much dirtier to me.
|
|
Technically cleanup is not done in thread, only the auto extend is in On Oct 22, 2016 4:51 PM, "Alexander Schepanovski" notifications@github.com
|
Hi @Suor,
here is a patch to enable dog-pile avoidance in
cacheops.Could you please review it and see if I didn't miss anything ?
This patched version has been running for hours on my 1flow development machine without problems.
I will deploy it on http://1flow.io/ at next major release.
I patched
simple&query, and didn't find any other obvious locations to patch.Your feedback is appreciated.
Regards,