-
Notifications
You must be signed in to change notification settings - Fork 142
Description
I want to start a discussion if implementing the DiskArrays interface would make sense for this package. Making HDF5Dataset a subtype of AbstractDiskArray would have the benefit of out-sourcing Base-conformant indexing rules and providing nice features like views, reductions, lazy broadcasting etc.
However, it would probably break some old code, that relies on the old behavior. As an alternative, users can simply wrap HDF5 Datasets into a DiskArray by themselves:
using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked
struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
ds::HDF5Dataset
cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Chunked()
haschunks(x::HDF5DiskArray) = Unchunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5Dataset)
cs = try
GridChunks(ds, get_chunk(ds))
catch
nothing
end
HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
endNow you can wrap a HDF5Dataset into a DiskArray:
f = h5open("chunk_test.h5","w")
A = rand(100,100)
f["A", "chunk", (5,5)] = A
d = HDF5DiskArray(f["A"])and the following will operate chunk by chunk and be much more efficient than using the AbstractArray interface:
using Statistics
#Reducing over datasets is done chunk by chunk
mean(d)
#Broadcasting respects chunks
d .= 1
#Reductions over dimensions are ok as well
mean(d, dims=2)So, maybe it would be a better option to create HDF5Plus.jl where we define this wrapper and users can decide between which package to use. What do the maintainers of HDF5 think?