Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zstd compression #560

Merged
merged 7 commits into from
Jul 5, 2024
Merged

Add Zstd compression #560

merged 7 commits into from
Jul 5, 2024

Conversation

milankl
Copy link
Contributor

@milankl milankl commented May 13, 2024

fixes #357

@mkitti I'm not sure what the 4-element tuple in ID_TO_COMPRESSOR is supposed to be? Package name, compressor name, decompressor name, some short name in caps?

Copy link

codecov bot commented May 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.94%. Comparing base (ccd60eb) to head (761217e).
Report is 8 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #560   +/-   ##
=======================================
  Coverage   86.94%   86.94%           
=======================================
  Files          31       31           
  Lines        4311     4321   +10     
=======================================
+ Hits         3748     3757    +9     
- Misses        563      564    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mkitti
Copy link
Member

mkitti commented May 13, 2024

I'm not an expert on this package, but you seem to be correct.

modname, compressorname, decompressorname, = ID_TO_DECOMPRESSOR[filter_id]

@milankl
Copy link
Contributor Author

milankl commented May 13, 2024

Manual test:

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.44335952762563824

julia> sizeof(A)/1000^2   # 8 MB array
8.0

julia> save("test_without_compression.jld2", "A", A)

julia> save("test_with_compression.jld2", "A", A, compress=ZstdCompressor())

julia> A == load("test_with_compression.jld2", "A")
true

julia> A == load("test_without_compression.jld2", "A")
true

and I have two files on disk then

-rw-r--r--@   1 milan  staff   1.1K 13 May 12:59 test_with_compression.jld2
-rw-r--r--@   1 milan  staff   7.6M 13 May 12:58 test_without_compression.jld2

one uncompressed of about 8MB and one compressed to 1KB

@mkitti
Copy link
Member

mkitti commented May 13, 2024

You should test if you can load the file via HDF5.jl.

using HDF5
using H5Zzstd

h5open("test_with_compression.jld2") do h5f
   h5f["A"][]
end

@milankl
Copy link
Contributor Author

milankl commented May 13, 2024

Okay that's not working correctly

julia> h5open("test_with_compression.jld2") do h5f
          h5f["A"][]
       end
1000×1000 Matrix{Float64}:
   0.0           0.0             0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   1.95e-321     0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   1.42e-321     0.0             0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NaN             0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
 NaN             0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   7.41892e-310  0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   3.6e-322      0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
                                          
   0.0           4.0e-323        0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           5.17085e170      0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           9.56977e-315    0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           0.0              0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           1.11254e-308     0.0  0.0  0.0  0.0  0.0  0.0  0.0
   0.0           1.11255e-308     0.0  0.0  0.0  0.0  0.0  0.0  0.0

(it should be A[1] == 0.4436 otherwise zero), even though the file without zstd compression is loaded correctly. Why is that?

edit: The same works perfectly fine with compress=ZlibCompressor(), but I'm also confused the decompressed matrix is different every time it's decompressed...

@mkitti
Copy link
Member

mkitti commented May 13, 2024

So the problem is that we are using ZSTD_compressStream and ZSTD_decompressStream here. While the upstream reference uses ZSTD_compress and ZSTD_decompress.

https://github.com/HDFGroup/hdf5_plugins/blob/master/ZSTD/src/H5Zzstd.c

https://facebook.github.io/zstd/zstd_manual.html

@mkitti
Copy link
Member

mkitti commented May 13, 2024

OK, I'm going to start working on a ZstdFrameCompressor / ZstdFrameDecompressor . I'm not sure if that is the final name yet, but it's basically the NotStream compressor / decompressor.

I'm not completely sure how to make that work in the TranscodingStream API, but we'll see.

@milankl
Copy link
Contributor Author

milankl commented May 13, 2024

Awesome, let me know if I can help!

@milankl
Copy link
Contributor Author

milankl commented May 14, 2024

@mkitti I've written already the tests for CodecZstd here using ZstdFrameCompressor, however, they are commented until JuliaIO/CodecZstd.jl#46 is merged and released

@milankl
Copy link
Contributor Author

milankl commented May 30, 2024

I've set the compressor here from ZstdFrameCompressor to ZstdCompressor, but we need to release a new version of CodecZstd.jl to actually test this here properly: JuliaIO/CodecZstd.jl#52 (comment)

@mkitti
Copy link
Member

mkitti commented May 30, 2024

I think it should still be ZstdFrameCompressor. Sorry for the confusion. I'll will work on a release tonight.

@mkitti
Copy link
Member

mkitti commented Jun 2, 2024

Note that CodecZstd.jl contains ZstdFrameCompressor, so there is no dependency block here now.

@JonasIsensee
Copy link
Collaborator

Hi,
what is the status on this?

Locally, a round trip of ZstdFrameCompressor and ZstdDecompressor works just fine within JLD2
but not via HDF5.
e.g.

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.44335952762563824

julia> sizeof(A)/1000^2   # 8 MB array
8.0

julia> save("test_without_compression.jld2", "A", A)

julia> save("test_with_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_compression.jld2", "A")
true

julia> A == load("test_without_compression.jld2", "A")
true

with h5dump

HDF5 "test_with_compression.jld2" {
GROUP "/" {
   DATASET "A" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 1000, 1000 ) / ( 1000, 1000 ) }
      DATA {h5dump error: unable to print data

      }
   }
}
}

or

julia> h5open("test_with_compression.jld2") do h5f
                 h5f["A"][]
              end
┌ Error: Exception while generating log record in module H5Zzstd at /home/jisensee/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│   exception =
│    UndefVarError: `err` not defined
│    Stacktrace:
│      [1] logging_error(logger::Any, level::Any, _module::Any, group::Any, id::Any, filepath::Any, line::Any, err::Any, real::Bool)
│        @ Base.CoreLogging ./logging.jl:478
│      [2] #invokelatest#2
│        @ ./essentials.jl:892 [inlined]
│      [3] invokelatest
│        @ ./essentials.jl:889 [inlined]
│      [4] macro expansion
│        @ ./logging.jl:364 [inlined]
│      [5] H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}})
│        @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
│      [6] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64})
│        @ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:796
│      [7] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
│      [8] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
│      [9] getindex(::HDF5.Dataset)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
│     [10] (::var"#5#6")(h5f::HDF5.File)
│        @ Main ./REPL[6]:2
│     [11] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
│     [12] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
│        @ Base ./task.jl:297
│     [13] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
│     [14] h5open(f::Function, args::String)
│        @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
│     [15] top-level scope
│        @ REPL[6]:1
│     [16] eval
│        @ ./boot.jl:385 [inlined]
│     [17] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
│     [18] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
│     [19] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
│     [20] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
│     [21] run_repl(repl::REPL.AbstractREPL, consumer::Any)
│        @ REPL ~/.julia/juliaup/julia-1.10.2+0.x64.linux.gnu/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
│     [22] (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module)
│        @ Base ./client.jl:432
│     [23] #invokelatest#2
│        @ ./essentials.jl:892 [inlined]
│     [24] invokelatest
│        @ ./essentials.jl:889 [inlined]
│     [25] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
│        @ Base ./client.jl:416
│     [26] exec_options(opts::Base.JLOptions)
│        @ Base ./client.jl:333
│     [27] _start()
│        @ Base ./client.jl:552
└ @ H5Zzstd ~/.julia/packages/H5Zzstd/ad081/src/H5Zzstd.jl:86
24-element Vector{Base.StackTraces.StackFrame}:
 H5Z_filter_zstd(flags::UInt32, cd_nelmts::UInt64, cd_values::Ptr{UInt32}, nbytes::UInt64, buf_size::Ptr{UInt64}, buf::Ptr{Ptr{Nothing}}) at H5Zzstd.jl:40
 h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{Float64}) at functions.jl:796
 _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing) at readwrite.jl:180
 generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}) at readwrite.jl:146
 getindex(::HDF5.Dataset) at readwrite.jl:59
 (::var"#5#6")(h5f::HDF5.File) at REPL[6]:2
 (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File})() at file.jl:101
 task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}, var"#5#6", HDF5.File}, key::Symbol, val::HDF5.HDF5Context) at task.jl:297
 h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}) at file.jl:96
 h5open(f::Function, args::String) at file.jl:94
 top-level scope at REPL[6]:1
 eval at boot.jl:385 [inlined]
 eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module) at REPL.jl:150
 repl_backend_loop(backend::REPL.REPLBackend, get_module::Function) at REPL.jl:246
 start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function) at REPL.jl:231
 kwcall(::NamedTuple, ::typeof(REPL.start_repl_backend), backend::REPL.REPLBackend, consumer::Any) at REPL.jl:228
 run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any) at REPL.jl:389
 run_repl(repl::REPL.AbstractREPL, consumer::Any) at REPL.jl:375
 (::Base.var"#1013#1015"{Bool, Bool, Bool})(REPL::Module) at client.jl:432
 #invokelatest#2 at essentials.jl:892 [inlined]
 invokelatest at essentials.jl:889 [inlined]
 run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool) at client.jl:416
 exec_options(opts::Base.JLOptions) at client.jl:333
 _start() at client.jl:552
ERROR: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
 [1] H5Z_pipeline: Data filters/Read failed
     filter returned failure during read
  ⋮
Stacktrace:
 [1] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:209
 [2] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
 [3] getindex(::HDF5.Dataset)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
 [4] (::var"#5#6")(h5f::HDF5.File)
   @ Main ./REPL[6]:2
 [5] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
 [6] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
   @ Base ./task.jl:297
 [7] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
 [8] h5open(f::Function, args::String)
   @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
 [9] top-level scope
   @ REPL[6]:1

caused by: HDF5.API.H5Error: Error reading dataset /A
libhdf5 Stacktrace:
 [1] H5Z_pipeline: Data filters/Read failed
     filter returned failure during read
  ⋮
Stacktrace:
  [1] macro expansion
    @ ~/.julia/packages/HDF5/Z859u/src/api/error.jl:18 [inlined]
  [2] h5d_read(dataset_id::HDF5.Dataset, mem_type_id::HDF5.Datatype, mem_space_id::HDF5.Dataspace, file_space_id::HDF5.Dataspace, xfer_plist_id::HDF5.DatasetTransferProperties, buf::Matrix{…})
    @ HDF5.API ~/.julia/packages/HDF5/Z859u/src/api/functions.jl:800
  [3] _generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64}, ::Nothing)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:180
  [4] generic_read(::HDF5.Dataset, ::HDF5.Datatype, ::Type{Float64})
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:146
  [5] getindex(::HDF5.Dataset)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/readwrite.jl:59
  [6] (::var"#5#6")(h5f::HDF5.File)
    @ Main ./REPL[6]:2
  [7] (::HDF5.var"#17#18"{HDF5.HDF5Context, @Kwargs{}, var"#5#6", HDF5.File})()
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:101
  [8] task_local_storage(body::HDF5.var"#17#18"{…}, key::Symbol, val::HDF5.HDF5Context)
    @ Base ./task.jl:297
  [9] h5open(f::var"#5#6", args::String; context::HDF5.HDF5Context, pv::@Kwargs{})
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:96
 [10] h5open(f::Function, args::String)
    @ HDF5 ~/.julia/packages/HDF5/Z859u/src/file.jl:94
 [11] top-level scope
    @ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.

@mkitti
Copy link
Member

mkitti commented Jun 23, 2024

This should work now:

julia> using H5Zzstd, HDF5

julia> h5open("test_with_compression.jld2") do h5f
           h5f["A"][]
       end

julia> using HDF5_jll

julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`);
Opened "test_with_compression.jld2" with sec2 driver.
A                        Dataset {1000/1000, 1000/1000}
    Location:  1:48
    Links:     1
    Chunks:    {1000, 1000} 8000000 bytes
    Storage:   8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization
    Filter-0:  ZSTD-32015  {5}
    Type:      native double
    Address: 189
           Flags    Bytes     Address          Logical Offset
        ========== ======== ========== ==============================
        0x00000000      283        189 [0, 0, 0]

@JonasIsensee
Copy link
Collaborator

This should work now:

julia> using H5Zzstd, HDF5

julia> h5open("test_with_compression.jld2") do h5f
           h5f["A"][]
       end

julia> using HDF5_jll

julia> run(`$(HDF5_jll.h5ls()) -va test_with_compression.jld2`);
Opened "test_with_compression.jld2" with sec2 driver.
A                        Dataset {1000/1000, 1000/1000}
    Location:  1:48
    Links:     1
    Chunks:    {1000, 1000} 8000000 bytes
    Storage:   8000000 logical bytes, 283 allocated bytes, 2826855.12% utilization
    Filter-0:  ZSTD-32015  {5}
    Type:      native double
    Address: 189
           Flags    Bytes     Address          Logical Offset
        ========== ======== ========== ==============================
        0x00000000      283        189 [0, 0, 0]

Thank you! I can reproduce locally.

Thanks for your work :)

@milankl
Copy link
Contributor Author

milankl commented Jun 24, 2024

@mkitti you want to include a compatibility constrain with the CodecZstd v0.8.3? Probably not hence you suggested the extension? Otherwise this seems to be ready to be merged?

@JonasIsensee
Copy link
Collaborator

JonasIsensee commented Jun 24, 2024

Can this even be done, given that CodecZstd isn't even a dependency or weak dep?

@milankl
Copy link
Contributor Author

milankl commented Jun 24, 2024

Can this even be done, given that CodecZstd isn't even a dependency or weal dep?

riiiiiight, forgot that. Any reason then not to merge this?

@mkitti
Copy link
Member

mkitti commented Jun 24, 2024

To be honest, I do not really understand how the compressors are being loaded here.

@mkitti
Copy link
Member

mkitti commented Jun 30, 2024

I think that we should declare CodecZlib, CodecBzip2, and now CodecZstd as a weak dependencies with compat ranges. Eventually, it would be good to consider how to take advantage of package extensions.

Over in JuliaIO/HDF5.jl#1160, I am working on converting all the HDF5 filter packages into package extensions.

@JonasIsensee
Copy link
Collaborator

To be honest, I do not really understand how the compressors are being loaded here.

Compressor loading works the way it was done in FileIO for loading backends a few years ago. I'm not sure this has changed.

The logic finds out which decompressor is needed for loading and checks if the library is loaded. If not, it checks if it is installed - in which case it then tries to load it.

Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.

@mkitti
Copy link
Member

mkitti commented Jul 2, 2024

Pkg extensions would not help here, since there are zero lines of code specific to the individual backends.

That's not very clear to me. Right now the loading mechanism ends up being very dynamic. I think this could be studied further.

@JonasIsensee
Copy link
Collaborator

I think, this is a discussion for another day.
I'll merge this for now.

@JonasIsensee JonasIsensee merged commit 17c3612 into JuliaIO:master Jul 5, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Zstd compression
3 participants