First implementation of dog-pile effect avoidance via python-redis-lock additions by Karmak23 · Pull Request #134 · Suor/django-cacheops

Karmak23 · 2015-02-25T21:05:01Z

Hi @Suor,

here is a patch to enable dog-pile avoidance in cacheops.

Could you please review it and see if I didn't miss anything ?

This patched version has been running for hours on my 1flow development machine without problems.

I will deploy it on http://1flow.io/ at next major release.

I patched simple & query, and didn't find any other obvious locations to patch.

Your feedback is appreciated.

Regards,

Suor · 2015-02-26T02:36:40Z

Hi, this looks like very solid patch overall. There are some issues with it though:

QuerySet cache is not protected by locks, see QuerySetMixin.iterator().
Unlocking processes should not try to calculate value and recache it, they should at least first try to get it from cache. The way it's implemented now every process will still make the calculation.
All processes enter lock sequentially which is not desired behavior if there is a fast path - get value from cache filled by previous process.

Suor · 2015-02-26T03:21:16Z

Thinking a bit more about it I came up with this procedure:

result_p = cache_get(cache_key)

while result_p is None:
    lock = Lock(redis_conn, cache_key, expire=1)
    if lock.acquire(blocking=False):
        result = calculate()
        cache_set(cache_key, pickle.dumps(result))
        lock.release()
        return result
    else:
        with lock:
            result_p = cache_get(cache_key)

return pickle.loads(result_p)

Karmak23 · 2015-02-26T08:26:55Z

Re,

Thanks for comment 1, I missed it and will update the patch.

About comment 2: yes, my bad. The load is not parallel anymore but now sequential… I will provide a new patch updated from your suggestion ASAP.

About comment 3 and your implementation, yes, my bad too. I missed some part when getting the python-redis-lock example. However I think the final code will be simpler than yours because what @ionelmc propose seems to give the same result without the while … acquire(blocking=False) overhead.

I'm currently leaving for a customer meeting, hoping to send a new patch in a few hours.

Thanks for your review,

Karmak23 · 2015-02-27T23:31:15Z

Hi,

sorry for the delay, here is another patch to complement the first.

Point 1) was tricky, I had to think hard to get to a factorized result, thus the lot of comments to make the implicit cases obviously visible.
Point 2) should be addressed correctly.

I think Point 3) is addressed by my patch too, but I perhaps misunderstood something.

Regards,

Suor · 2015-02-28T01:19:39Z

Please, mention QuerySet cache and @cached_as() here too.

Suor · 2015-02-28T02:02:21Z

Overall this looks solid, thanks for all your work so far. I added line notes, please look them through.

I also have some general considerations:

You are not testing CACHEOPS_USE_LOCK=True code path in any way now.
The way you can achieve that is by adding this line to tests/settings.py:
```
CACHEOPS_USE_LOCK = bool(os.environ.get('CACHEOPS_USE_LOCK'))
```
And this to tox.ini:
```
    env CACHEOPS_USE_LOCK=1 ./run_tests.py []
```
Please add timeout to all locks, overwise any process acquiring the lock and then failing will cause all others to hang up. Something like 1 second will do, you can also create setting for that, but that's unnecessary.
Point 3 is not really taken care of since all locked processes will still go sequentially. But it appears to be impossible to do smarter with redis-lock, you'll need shared/exclusive lock for that. There is no readily available implementation for that and if it were it could have actually made things slower, cause it's more complex. So I think we should stick with current approach for now of unpredictable length :).

Karmak23 · 2015-03-02T08:43:31Z

So here is the rewrapped version with all your remarks. It looks better ;-)
I will let it run for 2-3 days on my development environment to see if the "rare bug" mentionned above occurs or not.

Karmak23 · 2015-03-02T08:49:32Z

Only one remains unclear for me: about 2), when you say “any process acquiring the lock and then failing will cause all others to hang up”.

What kind of failure are you thinking about ?

If you mean that all processes except the first will block until the first finishes to compute the result, then I think this is exactly intended. They will be unblocked by a signal at the end of computation, thanks to the original implementation of python-redis-lock.

In a high load scenario where a lot of processes wait for the lock while one computes a result, all waiters will all acquire the lock in turn at the end of computation, and will immediately return the cached result from inside the lock.

In my mind, this workflow is the implicit tradeoff of the locking feature. It's the first point of dogpile effect avoiding (the second beiing serving stale data, but it's out of scope of my patch for now). This is not perfect, but still far better than all of them computing in parallel.

With an eventual timeout to avoid blocking, we need to introduce the while you suggested before. This basically introduces polling, which seems to be what python-redis-lock allows to avoid with its back-signal feature.

If you mean that a failing first-process will let the lock in an unknown or blocked state forever (python-redis-lock refers to this rare case and provides a management command to clear the locks), then yes, it can happen, but it's very rare. In this case, a timeout would effectively workaround this problem.

But then, I personnaly strongly need to have a long timeout, because in most places where I use the cache have more than 10 seconds computations (my database is huge and dog-pile is clearly hammering the machine).

As this is for a rare bug, I'd prefer to publish the code first as it is, and "see" how much the problem happens (or not) before introducing polling and (relatively) more code complexity. As the lock feature is disabled by default, we can label it as "experimental" and invite users to report/share issues.

It is still possible I have missed something or not understood what you told, so forgive me if it's the case.

Suor · 2015-03-02T09:16:22Z

By fail I mean crash. And yes, I am worried if block is left in acquired state forever. This effectively means that any process trying to acquire it will hang forever, which will trigger website or whatever completely inoperable until admin comes up and fixes that, and he or she will have a hard time too, cause reboot won't help and tracking an issue to cache could be non-trivial, also if clearing all the cache is not an option admin will need to trace it to certain lock key, which is even harder.

Relying on integrator to write some hacky code using lock.reset() is just naive, nobody will do that. And I don't think it's a good approach anyway, timeout is way more reliable.

If 1 second is too small, use 60 or make a setting. This way you won't hang someones site for half a day - a very realistic scenario otherwise.

Karmak23 · 2015-03-02T10:40:04Z

That seems a fair worry; automatic recover is likely to be an appreciated feature.

I will update the patch with a dedicated setting and more documentation ASAP.

Karmak23 · 2015-03-02T10:46:04Z

Btw, the Travis build failed with a syntax error, I didn't see it. I will create dedicated flake8 config for cacheops, I have too much warnings per file in default config.

Suor · 2015-03-02T12:48:30Z

You can use tox -e flakes to run flake8, no need to make your own config.
You can also run tests with:

pip install -r test_requirements.txt
./runtests.py
CACHEOPS_USE_LOCK=1 ./runtests.py

…ck additions.

Karmak23 · 2015-03-03T12:04:47Z

BTW, there is currently a nasty side-effect when the redis_client has a timeout configured:

while the lock is acquired blocking and waiting from release signal, it will fail with timeout reading from socket.

I would tend to workaround this problem by creating another client dedicated to locking, with its own timeout value, which should always be greater than the one from in various @cache* implementation. This would mean another client in the module.

What's your point of view on this issue ?

Suor · 2015-03-03T12:11:19Z

Sounds sane.

Suor · 2015-03-09T09:08:35Z

Hey, this turned to be harder than it looked initially. Reach me if you need help or have any questions.

Karmak23 · 2015-03-09T09:43:52Z

Hi, sorry for the delay, and thanks for your proposition.

I have been very busy last week synching people around working on https://github.com/UnissonCo/dataserver while beiing 1000km away from home ;-)

I'll take time to fix the timeout issue and implement the crash-auto-workaround feature this week.

This is taking me much time than anticipated, but the handfull of features and reliability is worth it I think.

Karmak23 · 2015-03-16T22:05:46Z

Hi @Suor I didn't forget you but have been very busy lately on python-ftr and building a REST API for 1flow. I'm trying to find time to finish this lock patch. Surely I will make it because I need it, but currently other things have more priority. Sorry for the delay…

ionelmc · 2016-10-06T13:38:49Z

        'redis>=2.9.1',
        'funcy>=1.2,<2.0',
        'six>=1.4.0',
+        'python-redis-lock==2.0.0',


You should not pin the dependency like this. Add major version constraints instead (>=X.0,<Y.0).

ionelmc · 2016-10-06T13:44:01Z

@Karmak23 what's blocking this PR (besides the conflicts)? about the timeout issue, can you explain a bit more - maybe there's something that redis-lock can handle/fix?

Karmak23 · 2016-10-06T19:30:07Z

@ionelmc I didn't have the time to fix the timeout bug. I think the patch in its current form will trigger nasty bugs in very loaded conditions.

We didn't validate it with @Suor because I needed to review it completely

Today I don't have time anymore. Sorry.

Suor · 2016-10-07T07:52:22Z

I think now that python-redis-lock is overkill for this, custom small implementation could be much simpler and more efficient:

reuse cache_key as lock key,
release all locked threads simultaneously,
blend release into cache_thing.lua script.

Possible lock func:

def get_or_lock(client, key):
    signal_key = key + ':signal'

    while True:
        data = client.get(key)
        if data is None:
            if client.set(key, 'LOCK', nx=True, ex=LOCK_TIMEOUT):
                client.delete(signal_key)
                return None  # Should release lock there and lpush to signal_key
        elif data != 'LOCK':
            return data

        # No data and not locked, wait
        client.brpoplpush(signal_key, signal_key, timeout=LOCK_TIMEOUT)

Then add this to cache_thing script:

redis.call('lpush', signal_key, '1')
redis.call('expire', signal_key, 5)

And we are done.

Suor · 2016-10-07T08:15:35Z

BTW, @ionelmc, why did you stop using EXPIRE on signal keys?

ionelmc · 2016-10-07T08:31:53Z

@Suor It's still used: https://github.com/ionelmc/python-redis-lock/blob/master/src/redis_lock/__init__.py#L221

Are you referring to something else?

Suor · 2016-10-07T08:48:26Z

I mean you don't rely on expire command to remove old lists.

Suor · 2016-10-10T14:50:59Z

Implemented in 7a89046, tests and docs to follow.

Can be used as:

@cached_as(qs, lock=True)
def heavy_func(...):
    # ...
    return res

for item in qs.cache(lock=True):
    # ...

I found that global setting would be less useful. It also bypasses locking for cache hits and hence will have zero overhead for them.

ionelmc · 2016-10-22T06:58:41Z

@Suor regarding the expire removal from redis-lock, the reason was that it wasn't reliable, see: ionelmc/python-redis-lock#32 - the issue became apparent when someone tried to replace EXPIRE with PEXPIRE. I suspect on a busy-enough server EXPIRE would have the same problem too. Feel free to give input if you know more about how redis works internally.

Suor · 2016-10-22T13:51:05Z

As far as I understand you only experienced issues when tried to set timeout to some very low value, so leaving it sane like 1 or few seconds should be ok. Also seems far cleaner solution than cleanup thread.

Suor · 2016-10-25T18:14:39Z

Isn't it safe to just make it big enough? A solution with a separate thread looks much dirtier to me.

ionelmc · 2016-10-25T18:18:02Z

Technically cleanup is not done in thread, only the auto extend is in
thread (completely optional iow)

On Oct 22, 2016 4:51 PM, "Alexander Schepanovski" notifications@github.com
wrote:

As far as I understand you only experienced issues when tried to set
timeout to some very low value, so leaving it sane like 1 or few seconds
should be ok. Also seems far cleaner solution than cleanup thread.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Suor reviewed Feb 28, 2015
View reviewed changes

Comment thread README.rst Outdated

Suor Feb 28, 2015

Copy link
Copy Markdown

Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, mention QuerySet cache and @cached_as() here too.

Karmak23 force-pushed the master branch 4 times, most recently from fbac3a9 to 7ebde4f Compare March 2, 2015 08:37

First implementation of dog-pile effect avoidance via python-redis-lo…

02fdbd1

…ck additions.

Karmak23 force-pushed the master branch from 7ebde4f to 02fdbd1 Compare March 3, 2015 00:13

Suor force-pushed the master branch from fb855c4 to c9a847e Compare November 19, 2015 11:32

Suor force-pushed the master branch from fe30ef1 to 8245222 Compare February 3, 2016 05:21

Suor mentioned this pull request Aug 12, 2016

How to force cache usage even if it's invalidated #204

Closed

ionelmc reviewed Oct 6, 2016

View reviewed changes

Suor closed this Oct 10, 2016

mrmachine mentioned this pull request Jul 16, 2018

Serve stale data instead of blocking with lock=True #289

Open

Conversation

Karmak23 commented Feb 25, 2015

Uh oh!

Suor commented Feb 26, 2015

Uh oh!

Suor commented Feb 26, 2015

Uh oh!

Karmak23 commented Feb 26, 2015

Uh oh!

Karmak23 commented Feb 27, 2015

Uh oh!

Suor Feb 28, 2015

Choose a reason for hiding this comment

Uh oh!

Suor commented Feb 28, 2015

Uh oh!

Karmak23 commented Mar 2, 2015

Uh oh!

Karmak23 commented Mar 2, 2015

Uh oh!

Suor commented Mar 2, 2015

Uh oh!

Karmak23 commented Mar 2, 2015

Uh oh!

Karmak23 commented Mar 2, 2015

Uh oh!

Suor commented Mar 2, 2015

Uh oh!

Karmak23 commented Mar 3, 2015

Uh oh!

Suor commented Mar 3, 2015

Uh oh!

Suor commented Mar 9, 2015

Uh oh!

Karmak23 commented Mar 9, 2015

Uh oh!

Karmak23 commented Mar 16, 2015

Uh oh!

ionelmc Oct 6, 2016

Choose a reason for hiding this comment

Uh oh!

ionelmc commented Oct 6, 2016

Uh oh!

Karmak23 commented Oct 6, 2016

Uh oh!

Suor commented Oct 7, 2016

Uh oh!

Suor commented Oct 7, 2016

Uh oh!

ionelmc commented Oct 7, 2016

Uh oh!

Suor commented Oct 7, 2016 via email

Uh oh!

Suor commented Oct 10, 2016

Uh oh!

ionelmc commented Oct 22, 2016

Uh oh!

Suor commented Oct 22, 2016

Uh oh!

Suor commented Oct 25, 2016 via email

Uh oh!

ionelmc commented Oct 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants