-
Notifications
You must be signed in to change notification settings - Fork 91
Compression with TranscodingStreams API #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This latest commit works well locally. Annoyingly I can't get |
|
I
|
|
Failure is due to |
|
I've got a working version now! This allows for compression with Blosc doesn't actually support the which is a shameful abuse of the powers of Julia, but hey, at least it's possible at all. |
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
- Coverage 90.32% 90.26% -0.07%
==========================================
Files 26 27 +1
Lines 2647 2721 +74
==========================================
+ Hits 2391 2456 +65
- Misses 256 265 +9
Continue to review full report at Codecov.
|
|
I have trialled this branch, and it works as advertised, however, I see a large performance penalty when reading files. In a particular example, @time says:
for reading the same zlib-compressed 250MB JLD2 file, saved with compress=CodecZlib.ZlibCompressor(), using the code in this branch. Is this known / expected? |
|
Hi @takbal ! Thanks for trying this out. No, this is not expected. That shouldn't take so long. Would you share your testfile with me or a script to reproduce something like it? |
|
@JonasIsensee the problem appears even with plain random matrices. Test to reproduce: This code gives me on JLD2 0.4.2: And on the 'compression' branch: Uncompressed files give the same results on both. |
|
The compression itself is not the problem, but my changes appear to have caused type instabilities or optimization failures downstream. I'll look into it |
|
Hi @takbal, |
|
@JonasIsensee the change seems to fix it completely, thanks! BTW how does compression interact with memory mapping? Overall, the doc does not explain when and how JLD2 uses mmap(), but from the code, I suspect it tries to use it when it is likely faster. But it is even more unclear to me what happens with compression. |
hm, I'm not super sure I understand your question. |
|
What I am really interested in is the question if it is possible to do sparse access of larger-than-memory arrays with JLD2 via memory mapping files, perhaps even with compression. HDF5.jl has some of this functionality, see https://juliaio.github.io/HDF5.jl/stable/#Memory-mapping. This gets a bit off-topic here, but it is related to compression, as I believe there are two theoretical ways of doing memory mapping + compression: either deny mapping compressed arrays like HDF5.jl, or map the compressed array, and decompress required bits on the fly when accessed. Based on what you wrote, I suspect now that this is (yet) missing, and data read by JDL2 are always gets fully placed in a memory buffer. That would mean the mmap() usage I have seen in the code serves only as a technique for getting the data from the disk into the buffer, and decompression happens during this. Is this correct? |
|
Hi @takbal ,
There are a few things to not here:
Again, I haven't seen this before. (Also, are you sure that HDF5 allows mmap of compressed arrays? That doesn't really seem to make sense for me..)
Yes, that is correct. However, if you are keen, we can discuss how some of these ideas could be added to JLD2. |
No, it does not, that's why I wrote "deny mapping compressed arrays like HDF5.jl". I agree that on-the-fly decompression would be hard to implement, but as you said, by chunking it is possible in theory. But it likely does not worth it - if compression is so important, that means the user likely did not choose the best representation of the data in the first place. I believe memory-mapped data read has an important use case for process-level parallelism. A large array may fit into memory once when it is written, but processes likely do read-only work on in parts, where limiting per-process memory use is important. If no memory-mapped read is possible, threads (and a more complicated code) remain the only option. So if JLD2 would allow simple memory mapping of bits-type arrays like HDF5, I believe that would be a valuable addition. Anyway, thank you for your clarifications! |
add optimizations Added (broken) chunking wip: transcodingstreams compression Pass compressor directly add test deps need another invokelatest revert kw syntax for 1.0 also added bzip2 add message to custom exceptions working version remove isnothing for 1.0 compat add bzip test dep typo in project add a type assertion improve error messages remove doubly defined stuff
a0db11a to
513a577
Compare
This PR implements compression using the API of TranscodingStreams.
It appears to work now!
Please try it out and report!