Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

Make bcolz threadsafe #286

Closed
wants to merge 9 commits into from
Closed

Make bcolz threadsafe #286

wants to merge 9 commits into from

Conversation

alimanfoo
Copy link
Contributor

This PR builds on #285 to investigate the possibility to make bcolz safe to use within multi-threaded environments.

The primary goal of this PR is to make bcolz threadsafe.

The secondary goal of this PR is to investigate ways to optimise performance of multithreaded applications using bcolz. This includes looking for opportunities to safely release the GIL, and investigating the possibility for concurrent read operations on the same carray.

@alimanfoo
Copy link
Contributor Author

I did a bit of reading around threading in Python and believe that, even if bcolz holds on to the GIL at all times, it is still not safe to use within a multi-threaded environment. Here's a bit of background and reasoning.

There is some very useful explanation of Python multi-threaded behaviour and the GIL in these slides from David Beazley. The key point is that if a CPython process launches multiple threads, even though only one thread can execute Python code at any one time because of the GIL, control will still get switched between threads at various points. Exactly when and how switching occurs depends on which Python version and configuration settings. However, in theory, switches could occur at any point in the code. So, for example, one thread could be half-way through reading a bcolz carray, then a switch could pass control to a different thread which could erase all data in the carray. When control returns to the first thread, the data will obviously be in an inconsistent state.

My main motivation for wanting thread-safety is so that I can use bcolz with dask.array via dask's thread scheduler. I would like to be able to use carrays as the source of data for some parallel computation (i.e., pass a carray into da.from_array()). This implies multiple threads attempting to concurrently read from the same carray via __getitem__. I would also like to be able to use carrays as the sink for data from some parallel computation (i.e., pass a carray into da.store()). This implies multiple threads attempting to concurrently write to the same carray via __setitem__. Note that concurrent write operations will always be attempting to write a different slice of the array. However, the slice boundaries may not coincide with the internal chunk boundaries within the carray, and so two threads could be attempting to write to the same chunks.

First and foremost, I would like this to be safe, i.e., no segfaults, no data corruption. If that can be achieved, then I would also like to maximise performance. I think this would mean two things, (i) supporting concurrent reads on a carray, and (ii) releasing the GIL during compression and decompression operations so other threads can run in parallel.

I think achieving thread safety means implementing some kind of locking behaviour within bcolz, probably at the level of an individual carray. With the correct locking behaviour, it should then also be safe to release the GIL during calls to blosc compression and decompression functions.

Note that my use cases do not involve ever attempting to concurrently read and write to the same carray. Therefore, strictly speaking, my use cases do not ever need to consider this situation. Either multiple threads are attempting to read, or multiple threads are attempting to write. However, I think the solution should also handle the concurrent read/write situation somehow, rather than simply say in the documentation "this situation is not safe".

@FrancescAlted
Copy link
Member

Hmm, AppVeyor on Python2 and 32-bit platforms is complaining about:

c-blosc/blosc\bitshuffle-sse2.c(20) : fatal error C1189: #error :  SSE2 is not supported by the target architecture/platform and/or this compiler.

python-blosc has the same issue, so probably we will need to add the /arch:sse2 option only for Windows 32 platforms. Could you try that and see?

@FrancescAlted
Copy link
Member

Regarding your comments on making bcolz thread-safe during writes, I agree that this is actually unsafe for a single carray, so perhaps we can activate kind of a lock for different threads trying to modify the same carray. But frankly, I would not put too much time on this until we are sure that this actually makes writes faster. So why not do you do some benchmarks for your use case (writing in different carrays simultaneously) and in case you find evidence that is would be actually faster, then try a safe implementation?

Secondly, I think we should not discuss this in this PR (mainly about upgrading c-blosc), but rather in another one.

@alimanfoo
Copy link
Contributor Author

Hi @FrancescAlted, yes I will do some benchmarks for my use cases. Btw sorry for any confusion, in my use cases I want to be able to concurrently read from the same carray, and I want to be able to concurrently write to the same carray, but I never need to concurrently both read and write to the same carray.

I'll try and push a fix for the SSE2 issue in #285 which is the PR for upgrading blosc. This PR is about thread-safety, which I will (re)base on #285.

@FrancescAlted
Copy link
Member

Ah, ok, it is much clear now. Sorry for the misunderstanding.

@alimanfoo
Copy link
Contributor Author

No problem. Basically, I think some locking is required to use a carray safely with da.store(), because dask will attempt to concurrently write to the carray via __setitem__. cc @mrocklin

@mrocklin
Copy link
Contributor

I'd be happy to handle locking on the da.store side if that's required. Ideally the storage library handles this (so that other multi-threaded applications can benefit) but if its easier to do with dask I'm happy to help out.

@alimanfoo
Copy link
Contributor Author

I've done some investigating of the feasibility of implementing a lock for the bcolz carray public methods. In principle I think the appropriate lock in general would be a read-write lock, i.e., single-writer multiple-reader. Commit 911c763 has a RWLock implementation borrowed from django, and uses the lock when entering public methods like __getitem__(), __setitem__(), append(), etc.

I'd really appreciate some thoughts on whether this is worth pursuing. There are a couple of issues that make this less than straightforward, so making the full bcolz public API thread-safe would take some further work. In particular, I could not immediately see how to make thread-safe implementations of any method that returns an iterator, e.g., __iter__() or where(). Partly I think this is because a carray is allowed to be its own iterator, but even if that were not the case it's not obvious to me how to handle iterators.

FWIW the test suite runs with bcolz in thread-safe mode, i.e., there are no situations that result in a deadlock. However a number of tests fail because I am currently raising NotImplementedError for public methods that are not implemented in thread-safe mode.

It certainly would be an easier option to require that multithreaded libraries like dask implement their own write lock. It's tempting to go for that. Bcolz could then just release the GIL around blosc calls, and leave synchronisation up to the caller.

Btw I'm not invested in either approach so very open to discussion. This has been a great learning exercise so not wasted effort either way.

@alimanfoo
Copy link
Contributor Author

On reflection, I think it might be better to handle thread-safety at a higher-level than the bcolz carray and ctable classes. E.g., one could imagine writing a wrapper class which protects a wrapped carray with a RW lock, and exposing only those methods from the wrapped carray that can be protected with the lock.

A wrapper class like this would be straightforward to implement. I might modify this PR to implement that, so at least there is some code available for anyone interested, but then leave it open to discussion as to whether this should get incorporated into bcolz.

@FrancescAlted
Copy link
Member

Yes, I think the wrapper class is a good idea too.

@mrocklin
Copy link
Contributor

A wrapper class would be handy. Other libraries could benefit from this.

I'm also entirely happy to do thread-safe writing on the dask side if this is distracting from more pressing work.

@alimanfoo
Copy link
Contributor Author

I've reverted the original work attempting to integrate a lock within the carray class, and have added a thread-safe wrapper class instead within a new bcolz.swmr module. It was a small amount of work so thought it worth following through. @FrancescAlted I leave it up to you if you want to integrate this within bcolz or not. If you do I can add some documentation and tests, if not I totally understand, happy either way.

@mrocklin thanks, it sounds like a good idea to me to implement thread-safe writing in dask, whatever happens with this PR.

@mrocklin
Copy link
Contributor

Actually, it looks like da.store does use a lock.

See dask.array.core.py:insert_to_ooc()

@FrancescAlted
Copy link
Member

@alimanfoo Yes, the new approach is simple enough; I like it. Feel free to add some tests and docs and will merge.

@alimanfoo alimanfoo mentioned this pull request Mar 8, 2016
@FrancescAlted
Copy link
Member

@alimanfoo , after 1.0.0 using better multithreading locking this would need some adaptation. Would you be interested in continuing the work (possibly in other PR)? Or maybe your prefer to focus on zarr? Anyway, whatever your decision is it would be fine with me.

@alimanfoo
Copy link
Contributor Author

I think the best thing for now is if I focus on zarr and gain some
experience, then come back to bcolz at a later date.

On Friday, April 1, 2016, FrancescAlted notifications@github.com wrote:

@alimanfoo https://github.com/alimanfoo , after 1.0.0 using better
multithreading locking this would need some adaptation. Would you be
interested in continuing the work (possibly in other PR)? Or maybe your
prefer to focus on zarr? Anyway, whatever your decision is it would be fine
with me.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#286 (comment)

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@FrancescAlted
Copy link
Member

This was addressed in 5f7652d (basically a backport of the zarr solution), so closing.

@alimanfoo
Copy link
Contributor Author

Cool, glad this was useful for bcolz too.

On Wed, Jun 29, 2016 at 7:47 PM, FrancescAlted notifications@github.com
wrote:

Closed #286 #286.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#286 (comment), or mute the
thread
https://github.com/notifications/unsubscribe/AAq8Qjf3y9QhBstj57Yx8huv6c6C1muYks5qQr3RgaJpZM4G0nIv
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com alimanfoo@gmail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants