Skip to content

First implementation of dog-pile effect avoidance via python-redis-lock additions#134

Closed
Karmak23 wants to merge 1 commit into
Suor:masterfrom
Karmak23:master
Closed

First implementation of dog-pile effect avoidance via python-redis-lock additions#134
Karmak23 wants to merge 1 commit into
Suor:masterfrom
Karmak23:master

Conversation

@Karmak23

Copy link
Copy Markdown

Hi @Suor,

here is a patch to enable dog-pile avoidance in cacheops.

Could you please review it and see if I didn't miss anything ?

This patched version has been running for hours on my 1flow development machine without problems.

I will deploy it on http://1flow.io/ at next major release.

I patched simple & query, and didn't find any other obvious locations to patch.

Your feedback is appreciated.

Regards,

@Suor

Suor commented Feb 26, 2015

Copy link
Copy Markdown
Owner

Hi, this looks like very solid patch overall. There are some issues with it though:

  1. QuerySet cache is not protected by locks, see QuerySetMixin.iterator().
  2. Unlocking processes should not try to calculate value and recache it, they should at least first try to get it from cache. The way it's implemented now every process will still make the calculation.
  3. All processes enter lock sequentially which is not desired behavior if there is a fast path - get value from cache filled by previous process.

@Suor

Suor commented Feb 26, 2015

Copy link
Copy Markdown
Owner

Thinking a bit more about it I came up with this procedure:

result_p = cache_get(cache_key)

while result_p is None:
    lock = Lock(redis_conn, cache_key, expire=1)
    if lock.acquire(blocking=False):
        result = calculate()
        cache_set(cache_key, pickle.dumps(result))
        lock.release()
        return result
    else:
        with lock:
            result_p = cache_get(cache_key)

return pickle.loads(result_p)

@Karmak23

Copy link
Copy Markdown
Author

Re,

Thanks for comment 1, I missed it and will update the patch.

About comment 2: yes, my bad. The load is not parallel anymore but now sequential… I will provide a new patch updated from your suggestion ASAP.

About comment 3 and your implementation, yes, my bad too. I missed some part when getting the python-redis-lock example. However I think the final code will be simpler than yours because what @ionelmc propose seems to give the same result without the while … acquire(blocking=False) overhead.

I'm currently leaving for a customer meeting, hoping to send a new patch in a few hours.

Thanks for your review,

@Karmak23

Copy link
Copy Markdown
Author

Hi,

sorry for the delay, here is another patch to complement the first.

Point 1) was tricky, I had to think hard to get to a factorized result, thus the lot of comments to make the implicit cases obviously visible.
Point 2) should be addressed correctly.

I think Point 3) is addressed by my patch too, but I perhaps misunderstood something.

Regards,

Comment thread README.rst Outdated

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, mention QuerySet cache and @cached_as() here too.

@Suor

Suor commented Feb 28, 2015

Copy link
Copy Markdown
Owner

Overall this looks solid, thanks for all your work so far. I added line notes, please look them through.

I also have some general considerations:

  1. You are not testing CACHEOPS_USE_LOCK=True code path in any way now.
    The way you can achieve that is by adding this line to tests/settings.py:

    CACHEOPS_USE_LOCK = bool(os.environ.get('CACHEOPS_USE_LOCK'))

    And this to tox.ini:

        env CACHEOPS_USE_LOCK=1 ./run_tests.py []
  2. Please add timeout to all locks, overwise any process acquiring the lock and then failing will cause all others to hang up. Something like 1 second will do, you can also create setting for that, but that's unnecessary.

  3. Point 3 is not really taken care of since all locked processes will still go sequentially. But it appears to be impossible to do smarter with redis-lock, you'll need shared/exclusive lock for that. There is no readily available implementation for that and if it were it could have actually made things slower, cause it's more complex. So I think we should stick with current approach for now of unpredictable length :).

@Karmak23 Karmak23 force-pushed the master branch 4 times, most recently from fbac3a9 to 7ebde4f Compare March 2, 2015 08:37
@Karmak23

Karmak23 commented Mar 2, 2015

Copy link
Copy Markdown
Author

So here is the rewrapped version with all your remarks. It looks better ;-)
I will let it run for 2-3 days on my development environment to see if the "rare bug" mentionned above occurs or not.

@Karmak23

Karmak23 commented Mar 2, 2015

Copy link
Copy Markdown
Author

Only one remains unclear for me: about 2), when you say “any process acquiring the lock and then failing will cause all others to hang up”.

What kind of failure are you thinking about ?

If you mean that all processes except the first will block until the first finishes to compute the result, then I think this is exactly intended. They will be unblocked by a signal at the end of computation, thanks to the original implementation of python-redis-lock.

In a high load scenario where a lot of processes wait for the lock while one computes a result, all waiters will all acquire the lock in turn at the end of computation, and will immediately return the cached result from inside the lock.

In my mind, this workflow is the implicit tradeoff of the locking feature. It's the first point of dogpile effect avoiding (the second beiing serving stale data, but it's out of scope of my patch for now). This is not perfect, but still far better than all of them computing in parallel.

With an eventual timeout to avoid blocking, we need to introduce the while you suggested before. This basically introduces polling, which seems to be what python-redis-lock allows to avoid with its back-signal feature.

If you mean that a failing first-process will let the lock in an unknown or blocked state forever (python-redis-lock refers to this rare case and provides a management command to clear the locks), then yes, it can happen, but it's very rare. In this case, a timeout would effectively workaround this problem.

But then, I personnaly strongly need to have a long timeout, because in most places where I use the cache have more than 10 seconds computations (my database is huge and dog-pile is clearly hammering the machine).

As this is for a rare bug, I'd prefer to publish the code first as it is, and "see" how much the problem happens (or not) before introducing polling and (relatively) more code complexity. As the lock feature is disabled by default, we can label it as "experimental" and invite users to report/share issues.

It is still possible I have missed something or not understood what you told, so forgive me if it's the case.

@Suor

Suor commented Mar 2, 2015

Copy link
Copy Markdown
Owner

By fail I mean crash. And yes, I am worried if block is left in acquired state forever. This effectively means that any process trying to acquire it will hang forever, which will trigger website or whatever completely inoperable until admin comes up and fixes that, and he or she will have a hard time too, cause reboot won't help and tracking an issue to cache could be non-trivial, also if clearing all the cache is not an option admin will need to trace it to certain lock key, which is even harder.

Relying on integrator to write some hacky code using lock.reset() is just naive, nobody will do that. And I don't think it's a good approach anyway, timeout is way more reliable.

If 1 second is too small, use 60 or make a setting. This way you won't hang someones site for half a day - a very realistic scenario otherwise.

@Karmak23

Karmak23 commented Mar 2, 2015

Copy link
Copy Markdown
Author

That seems a fair worry; automatic recover is likely to be an appreciated feature.

I will update the patch with a dedicated setting and more documentation ASAP.

@Karmak23

Karmak23 commented Mar 2, 2015

Copy link
Copy Markdown
Author

Btw, the Travis build failed with a syntax error, I didn't see it. I will create dedicated flake8 config for cacheops, I have too much warnings per file in default config.

@Suor

Suor commented Mar 2, 2015

Copy link
Copy Markdown
Owner

You can use tox -e flakes to run flake8, no need to make your own config.
You can also run tests with:

pip install -r test_requirements.txt
./runtests.py
CACHEOPS_USE_LOCK=1 ./runtests.py

@Karmak23

Karmak23 commented Mar 3, 2015

Copy link
Copy Markdown
Author

BTW, there is currently a nasty side-effect when the redis_client has a timeout configured:

while the lock is acquired blocking and waiting from release signal, it will fail with timeout reading from socket.

I would tend to workaround this problem by creating another client dedicated to locking, with its own timeout value, which should always be greater than the one from in various @cache* implementation. This would mean another client in the module.

What's your point of view on this issue ?

@Suor

Suor commented Mar 3, 2015

Copy link
Copy Markdown
Owner

Sounds sane.

@Suor

Suor commented Mar 9, 2015

Copy link
Copy Markdown
Owner

Hey, this turned to be harder than it looked initially. Reach me if you need help or have any questions.

@Karmak23

Karmak23 commented Mar 9, 2015

Copy link
Copy Markdown
Author

Hi, sorry for the delay, and thanks for your proposition.

I have been very busy last week synching people around working on https://github.com/UnissonCo/dataserver while beiing 1000km away from home ;-)

I'll take time to fix the timeout issue and implement the crash-auto-workaround feature this week.

This is taking me much time than anticipated, but the handfull of features and reliability is worth it I think.

@Karmak23

Copy link
Copy Markdown
Author

Hi @Suor I didn't forget you but have been very busy lately on python-ftr and building a REST API for 1flow. I'm trying to find time to finish this lock patch. Surely I will make it because I need it, but currently other things have more priority. Sorry for the delay…

Comment thread setup.py
'redis>=2.9.1',
'funcy>=1.2,<2.0',
'six>=1.4.0',
'python-redis-lock==2.0.0',

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not pin the dependency like this. Add major version constraints instead (>=X.0,<Y.0).

@ionelmc

ionelmc commented Oct 6, 2016

Copy link
Copy Markdown

@Karmak23 what's blocking this PR (besides the conflicts)? about the timeout issue, can you explain a bit more - maybe there's something that redis-lock can handle/fix?

@Karmak23

Karmak23 commented Oct 6, 2016

Copy link
Copy Markdown
Author

@ionelmc I didn't have the time to fix the timeout bug. I think the patch in its current form will trigger nasty bugs in very loaded conditions.

We didn't validate it with @Suor because I needed to review it completely

Today I don't have time anymore. Sorry.

@Suor

Suor commented Oct 7, 2016

Copy link
Copy Markdown
Owner

I think now that python-redis-lock is overkill for this, custom small implementation could be much simpler and more efficient:

  • reuse cache_key as lock key,
  • release all locked threads simultaneously,
  • blend release into cache_thing.lua script.

Possible lock func:

def get_or_lock(client, key):
    signal_key = key + ':signal'

    while True:
        data = client.get(key)
        if data is None:
            if client.set(key, 'LOCK', nx=True, ex=LOCK_TIMEOUT):
                client.delete(signal_key)
                return None  # Should release lock there and lpush to signal_key
        elif data != 'LOCK':
            return data

        # No data and not locked, wait
        client.brpoplpush(signal_key, signal_key, timeout=LOCK_TIMEOUT)

Then add this to cache_thing script:

redis.call('lpush', signal_key, '1')
redis.call('expire', signal_key, 5)

And we are done.

@Suor

Suor commented Oct 7, 2016

Copy link
Copy Markdown
Owner

BTW, @ionelmc, why did you stop using EXPIRE on signal keys?

@ionelmc

ionelmc commented Oct 7, 2016

Copy link
Copy Markdown

@Suor It's still used: https://github.com/ionelmc/python-redis-lock/blob/master/src/redis_lock/__init__.py#L221

Are you referring to something else?

@Suor

Suor commented Oct 7, 2016 via email

Copy link
Copy Markdown
Owner

@Suor

Suor commented Oct 10, 2016

Copy link
Copy Markdown
Owner

Implemented in 7a89046, tests and docs to follow.

Can be used as:

@cached_as(qs, lock=True)
def heavy_func(...):
    # ...
    return res

for item in qs.cache(lock=True):
    # ...

I found that global setting would be less useful. It also bypasses locking for cache hits and hence will have zero overhead for them.

@Suor Suor closed this Oct 10, 2016
@ionelmc

ionelmc commented Oct 22, 2016

Copy link
Copy Markdown

@Suor regarding the expire removal from redis-lock, the reason was that it wasn't reliable, see: ionelmc/python-redis-lock#32 - the issue became apparent when someone tried to replace EXPIRE with PEXPIRE. I suspect on a busy-enough server EXPIRE would have the same problem too. Feel free to give input if you know more about how redis works internally.

@Suor

Suor commented Oct 22, 2016

Copy link
Copy Markdown
Owner

As far as I understand you only experienced issues when tried to set timeout to some very low value, so leaving it sane like 1 or few seconds should be ok. Also seems far cleaner solution than cleanup thread.

@Suor

Suor commented Oct 25, 2016 via email

Copy link
Copy Markdown
Owner

@ionelmc

ionelmc commented Oct 25, 2016

Copy link
Copy Markdown

Technically cleanup is not done in thread, only the auto extend is in
thread (completely optional iow)

On Oct 22, 2016 4:51 PM, "Alexander Schepanovski" notifications@github.com
wrote:

As far as I understand you only experienced issues when tried to set
timeout to some very low value, so leaving it sane like 1 or few seconds
should be ok. Also seems far cleaner solution than cleanup thread.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants