Skip to content

Conversation

@JonasIsensee
Copy link
Collaborator

@JonasIsensee JonasIsensee commented Dec 5, 2020

This PR implements compression using the API of TranscodingStreams.

It appears to work now!
Please try it out and report!

@JonasIsensee
Copy link
Collaborator Author

JonasIsensee commented Jan 9, 2021

This latest commit works well locally.
E.g.

using JLD2, CodecLz4, CodecZlib
@save "test1.jld2" {iotype=IOStream, compress=CodecLz4.LZ4FrameCompressor()} a=Union{Missing,Float64}[zeros(10^6);]
@save "test2.jld2" {compress=CodecZlib.ZlibCompressor()} a=Union{Missing,Float64}[zeros(10^6);]
jldopen("test4.jld2", "w") do f
       write(f, "a", zeros(10^6); compress=CodecZlib.ZlibCompressor())
       write(f, "b", zeros(10^7); compress=CodecLz4.LZ4FrameCompressor())
end
julia> using Revise, JLD2

julia> @load "test4.jld2"
[ Info: Attempting to dynamically load CodecZlib
[ Info: Attempting to dynamically load CodecLz4
2-element Array{Symbol,1}:
 :a
 :b

Annoyingly I can't get h5dump to display files with LZ4 compressed data sets. Within julia they work fine.

@JonasIsensee
Copy link
Collaborator Author

I CodecLz4 and CodecBzip2 also work but it seems that the implementations differ (?) from the normal c libraries?
At least I can't get it to load in regular HDF5. JLD2 roundtrips work fine.

Blosc can also be added as soon as it supports the TranscodingStreams API. ( see JuliaIO/Blosc.jl#79 )
Of course, for proper blosc support JLD2 also needs to implement chunking as I had attempted in #254 .

@JonasIsensee
Copy link
Collaborator Author

Failure is due to isnothing only being available starting julia v1.1

@JonasIsensee
Copy link
Collaborator Author

I've got a working version now!

This allows for compression with CodecZlib, CodecLz4, CodecBzip2, (and Blosc).

Blosc doesn't actually support the TranscodingStreams API, yet. So, you have to define in your script:

using Blosc
Blosc.eval(:(struct BloscCompressor end))
Blosc.eval(:(struct BloscDecompressor end))
import Blosc: BloscCompressor, BloscDecompressor
import JLD2: TranscodingStreams

TranscodingStreams.transcode(::BloscCompressor, buf) = Blosc.compress(buf)
TranscodingStreams.initialize(::BloscCompressor) = nothing
TranscodingStreams.finalize(::BloscCompressor) = nothing

TranscodingStreams.transcode(::BloscDecompressor, buf) = Blosc.decompress(UInt8, buf)
TranscodingStreams.initialize(::BloscDecompressor) = nothing
TranscodingStreams.finalize(::BloscDecompressor) = nothing

which is a shameful abuse of the powers of Julia, but hey, at least it's possible at all.

@codecov
Copy link

codecov bot commented Jan 24, 2021

Codecov Report

Merging #264 (513a577) into master (ee85fa7) will decrease coverage by 0.06%.
The diff coverage is 91.55%.

❗ Current head 513a577 differs from pull request most recent head e179d49. Consider uploading reports for the commit e179d49 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
- Coverage   90.32%   90.26%   -0.07%     
==========================================
  Files          26       27       +1     
  Lines        2647     2721      +74     
==========================================
+ Hits         2391     2456      +65     
- Misses        256      265       +9     
Impacted Files Coverage Δ
src/dataio.jl 98.51% <ø> (-0.20%) ⬇️
src/groups.jl 86.12% <ø> (+<0.01%) ⬆️
src/JLD2.jl 89.71% <11.11%> (-2.15%) ⬇️
src/inlineunion.jl 96.77% <75.00%> (ø)
src/compression.jl 96.66% <96.66%> (ø)
src/datasets.jl 92.24% <100.00%> (-0.41%) ⬇️
src/misc.jl 96.07% <0.00%> (-3.93%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee85fa7...e179d49. Read the comment docs.

@takbal
Copy link

takbal commented Mar 20, 2021

I have trialled this branch, and it works as advertised, however, I see a large performance penalty when reading files.

In a particular example, @time says:

  • JLD2 0.4.2: 2 seconds at 700MB allocations
  • compression branch: 17 seconds at 6.6GB allocations

for reading the same zlib-compressed 250MB JLD2 file, saved with compress=CodecZlib.ZlibCompressor(), using the code in this branch.

Is this known / expected?

@JonasIsensee
Copy link
Collaborator Author

Hi @takbal !

Thanks for trying this out. No, this is not expected. That shouldn't take so long.

Would you share your testfile with me or a script to reproduce something like it?
In my own testing I hadn't seen slowdowns like that.

@takbal
Copy link

takbal commented Mar 23, 2021

@JonasIsensee the problem appears even with plain random matrices. Test to reproduce:

using JLD2, BenchmarkTools, FileIO

if isdefined(JLD2, :COMPRESSOR_TO_ID)
        using CodecZlib
        compressor = CodecZlib.ZlibCompressor()
else
        compressor = true
end

testmat = rand(1000,1000)

save("test.jld2", "testmat", testmat, compress = compressor)

@btime load("test.jld2")

This code gives me on JLD2 0.4.2:

42.475 ms (159 allocations: 22.00 MiB)

And on the 'compression' branch:

137.544 ms (3000142 allocations: 67.78 MiB)

Uncompressed files give the same results on both.

@JonasIsensee
Copy link
Collaborator Author

The compression itself is not the problem, but my changes appear to have caused type instabilities or optimization failures downstream. I'll look into it

@JonasIsensee
Copy link
Collaborator Author

Hi @takbal,
I just pushed a single line change that removes the runtime problem for me locally.
Would you be so kind to verify?

@takbal
Copy link

takbal commented Mar 25, 2021

@JonasIsensee the change seems to fix it completely, thanks!

BTW how does compression interact with memory mapping?

Overall, the doc does not explain when and how JLD2 uses mmap(), but from the code, I suspect it tries to use it when it is likely faster. But it is even more unclear to me what happens with compression.

@JonasIsensee
Copy link
Collaborator Author

BTW how does compression interact with memory mapping?

Overall, the doc does not explain when and how JLD2 uses mmap(), but from the code, I suspect it tries to use it when it is likely faster. But it is even more unclear to me what happens with compression.

hm, I'm not super sure I understand your question.
JLD2 has two modes of interacting with files. Either with mmapping or via the regular io stream where mmap is the default since it is often faster.
Unless there is a e.g. a system specific problem with mmap or so, this does not need to concern the user.

@takbal
Copy link

takbal commented Mar 26, 2021

What I am really interested in is the question if it is possible to do sparse access of larger-than-memory arrays with JLD2 via memory mapping files, perhaps even with compression. HDF5.jl has some of this functionality, see https://juliaio.github.io/HDF5.jl/stable/#Memory-mapping.

This gets a bit off-topic here, but it is related to compression, as I believe there are two theoretical ways of doing memory mapping + compression: either deny mapping compressed arrays like HDF5.jl, or map the compressed array, and decompress required bits on the fly when accessed.

Based on what you wrote, I suspect now that this is (yet) missing, and data read by JDL2 are always gets fully placed in a memory buffer. That would mean the mmap() usage I have seen in the code serves only as a technique for getting the data from the disk into the buffer, and decompression happens during this.

Is this correct?

@JonasIsensee
Copy link
Collaborator Author

Hi @takbal ,

What I am really interested in is the question if it is possible to do sparse access of larger-than-memory arrays with JLD2 via memory mapping files, perhaps even with compression. HDF5.jl has some of this functionality, see https://juliaio.github.io/HDF5.jl/stable/#Memory-mapping.

There are a few things to not here:

  • First, yes sparse access of (larger-than-memory) arrays would in principle be possible with JLD2. It just has not been implemented yet.
  • Currently, arrays can only be written to file in a single piece. (Making it hard to create such large arrays....) This is possible with HDF5 and as such could also be added to JLD2. (chunking / hdf5 array index)
  • This is hard to do in general: short arrays are typically wrapped into the dataset header (hdf5/jld2 internals) meaning that it is part of a checksum. That makes it difficult to allow modifications of the bytes using mmap. This is not a problem for larger arrays with more than 2^16 bytes.
  • Compression is a whole different topic: I'm not aware of any compression algorithms that allow for sparse decoding. Typically compression is a non-local procedure so that you can only decode the whole array at once. One way around this would be to divide the array into chunks (see above) prior to serialization. Then one could load the blocks independent of one another.

This gets a bit off-topic here, but it is related to compression, as I believe there are two theoretical ways of doing memory mapping + compression: either deny mapping compressed arrays like HDF5.jl, or map the compressed array, and decompress required bits on the fly when accessed.

Again, I haven't seen this before. (Also, are you sure that HDF5 allows mmap of compressed arrays? That doesn't really seem to make sense for me..)

Based on what you wrote, I suspect now that this is (yet) missing, and data read by JDL2 are always gets fully placed in a memory buffer. That would mean the mmap() usage I have seen in the code serves only as a technique for getting the data from the disk into the buffer, and decompression happens during this.

Is this correct?

Yes, that is correct.

However, if you are keen, we can discuss how some of these ideas could be added to JLD2.
In the end, it's all just bits and bytes in the files and we can do with them whatever we like.

@takbal
Copy link

takbal commented Mar 27, 2021

are you sure that HDF5 allows mmap of compressed arrays?

No, it does not, that's why I wrote "deny mapping compressed arrays like HDF5.jl". I agree that on-the-fly decompression would be hard to implement, but as you said, by chunking it is possible in theory. But it likely does not worth it - if compression is so important, that means the user likely did not choose the best representation of the data in the first place.

I believe memory-mapped data read has an important use case for process-level parallelism. A large array may fit into memory once when it is written, but processes likely do read-only work on in parts, where limiting per-process memory use is important. If no memory-mapped read is possible, threads (and a more complicated code) remain the only option.

So if JLD2 would allow simple memory mapping of bits-type arrays like HDF5, I believe that would be a valuable addition.

Anyway, thank you for your clarifications!

add optimizations

Added (broken) chunking

wip: transcodingstreams compression

Pass compressor directly

add test deps

need another invokelatest

revert kw syntax for 1.0

also added bzip2

add message to custom exceptions

working version

remove isnothing for 1.0 compat

add bzip test dep

typo in project

add a type assertion

improve error messages

remove doubly defined stuff
@JonasIsensee JonasIsensee changed the title WIP: Compression with TranscodingStreams API Compression with TranscodingStreams API Apr 24, 2021
@JonasIsensee JonasIsensee merged commit 42bff5b into master Apr 24, 2021
@JonasIsensee JonasIsensee mentioned this pull request Apr 24, 2021
@JonasIsensee JonasIsensee deleted the compression branch May 13, 2021 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants