Conversation
I did a bit of reading around threading in Python and believe that, even if bcolz holds on to the GIL at all times, it is still not safe to use within a multi-threaded environment. Here's a bit of background and reasoning. There is some very useful explanation of Python multi-threaded behaviour and the GIL in these slides from David Beazley. The key point is that if a CPython process launches multiple threads, even though only one thread can execute Python code at any one time because of the GIL, control will still get switched between threads at various points. Exactly when and how switching occurs depends on which Python version and configuration settings. However, in theory, switches could occur at any point in the code. So, for example, one thread could be half-way through reading a bcolz carray, then a switch could pass control to a different thread which could erase all data in the carray. When control returns to the first thread, the data will obviously be in an inconsistent state. My main motivation for wanting thread-safety is so that I can use bcolz with dask.array via dask's thread scheduler. I would like to be able to use carrays as the source of data for some parallel computation (i.e., pass a carray into da.from_array()). This implies multiple threads attempting to concurrently read from the same carray via First and foremost, I would like this to be safe, i.e., no segfaults, no data corruption. If that can be achieved, then I would also like to maximise performance. I think this would mean two things, (i) supporting concurrent reads on a carray, and (ii) releasing the GIL during compression and decompression operations so other threads can run in parallel. I think achieving thread safety means implementing some kind of locking behaviour within bcolz, probably at the level of an individual carray. With the correct locking behaviour, it should then also be safe to release the GIL during calls to blosc compression and decompression functions. Note that my use cases do not involve ever attempting to concurrently read and write to the same carray. Therefore, strictly speaking, my use cases do not ever need to consider this situation. Either multiple threads are attempting to read, or multiple threads are attempting to write. However, I think the solution should also handle the concurrent read/write situation somehow, rather than simply say in the documentation "this situation is not safe". |
Hmm, AppVeyor on Python2 and 32-bit platforms is complaining about:
python-blosc has the same issue, so probably we will need to add the /arch:sse2 option only for Windows 32 platforms. Could you try that and see? |
Regarding your comments on making bcolz thread-safe during writes, I agree that this is actually unsafe for a single carray, so perhaps we can activate kind of a lock for different threads trying to modify the same carray. But frankly, I would not put too much time on this until we are sure that this actually makes writes faster. So why not do you do some benchmarks for your use case (writing in different carrays simultaneously) and in case you find evidence that is would be actually faster, then try a safe implementation? Secondly, I think we should not discuss this in this PR (mainly about upgrading c-blosc), but rather in another one. |
Hi @FrancescAlted, yes I will do some benchmarks for my use cases. Btw sorry for any confusion, in my use cases I want to be able to concurrently read from the same carray, and I want to be able to concurrently write to the same carray, but I never need to concurrently both read and write to the same carray. I'll try and push a fix for the SSE2 issue in #285 which is the PR for upgrading blosc. This PR is about thread-safety, which I will (re)base on #285. |
Ah, ok, it is much clear now. Sorry for the misunderstanding. |
No problem. Basically, I think some locking is required to use a carray safely with da.store(), because dask will attempt to concurrently write to the carray via |
I'd be happy to handle locking on the |
I've done some investigating of the feasibility of implementing a lock for the bcolz carray public methods. In principle I think the appropriate lock in general would be a read-write lock, i.e., single-writer multiple-reader. Commit 911c763 has a RWLock implementation borrowed from django, and uses the lock when entering public methods like I'd really appreciate some thoughts on whether this is worth pursuing. There are a couple of issues that make this less than straightforward, so making the full bcolz public API thread-safe would take some further work. In particular, I could not immediately see how to make thread-safe implementations of any method that returns an iterator, e.g., FWIW the test suite runs with bcolz in thread-safe mode, i.e., there are no situations that result in a deadlock. However a number of tests fail because I am currently raising NotImplementedError for public methods that are not implemented in thread-safe mode. It certainly would be an easier option to require that multithreaded libraries like dask implement their own write lock. It's tempting to go for that. Bcolz could then just release the GIL around blosc calls, and leave synchronisation up to the caller. Btw I'm not invested in either approach so very open to discussion. This has been a great learning exercise so not wasted effort either way. |
On reflection, I think it might be better to handle thread-safety at a higher-level than the bcolz carray and ctable classes. E.g., one could imagine writing a wrapper class which protects a wrapped carray with a RW lock, and exposing only those methods from the wrapped carray that can be protected with the lock. A wrapper class like this would be straightforward to implement. I might modify this PR to implement that, so at least there is some code available for anyone interested, but then leave it open to discussion as to whether this should get incorporated into bcolz. |
Yes, I think the wrapper class is a good idea too. |
…ethods" This reverts commit 911c763.
A wrapper class would be handy. Other libraries could benefit from this. I'm also entirely happy to do thread-safe writing on the dask side if this is distracting from more pressing work. |
I've reverted the original work attempting to integrate a lock within the carray class, and have added a thread-safe wrapper class instead within a new bcolz.swmr module. It was a small amount of work so thought it worth following through. @FrancescAlted I leave it up to you if you want to integrate this within bcolz or not. If you do I can add some documentation and tests, if not I totally understand, happy either way. @mrocklin thanks, it sounds like a good idea to me to implement thread-safe writing in dask, whatever happens with this PR. |
Actually, it looks like See |
@alimanfoo Yes, the new approach is simple enough; I like it. Feel free to add some tests and docs and will merge. |
@alimanfoo , after 1.0.0 using better multithreading locking this would need some adaptation. Would you be interested in continuing the work (possibly in other PR)? Or maybe your prefer to focus on zarr? Anyway, whatever your decision is it would be fine with me. |
I think the best thing for now is if I focus on zarr and gain some On Friday, April 1, 2016, FrancescAlted notifications@github.com wrote:
Alistair Miles |
This was addressed in 5f7652d (basically a backport of the zarr solution), so closing. |
Cool, glad this was useful for bcolz too. On Wed, Jun 29, 2016 at 7:47 PM, FrancescAlted notifications@github.com
Alistair Miles |
This PR builds on #285 to investigate the possibility to make bcolz safe to use within multi-threaded environments.
The primary goal of this PR is to make bcolz threadsafe.
The secondary goal of this PR is to investigate ways to optimise performance of multithreaded applications using bcolz. This includes looking for opportunities to safely release the GIL, and investigating the possibility for concurrent read operations on the same carray.