Skip to content

Support for DiskArrays #615

@meggart

Description

@meggart

I want to start a discussion if implementing the DiskArrays interface would make sense for this package. Making HDF5Dataset a subtype of AbstractDiskArray would have the benefit of out-sourcing Base-conformant indexing rules and providing nice features like views, reductions, lazy broadcasting etc.

However, it would probably break some old code, that relies on the old behavior. As an alternative, users can simply wrap HDF5 Datasets into a DiskArray by themselves:

using HDF5, DiskArrays
import DiskArrays: eachchunk, haschunks, readblock!, writeblock!, GridChunks, Chunked, Unchunked

struct HDF5DiskArray{T,N,CS} <: AbstractDiskArray{T,N}
  ds::HDF5Dataset
  cs::CS
end
Base.size(x::HDF5DiskArray) = size(x.ds)
haschunks(x::HDF5DiskArray{<:Any,<:Any,Nothing}) = Chunked()
haschunks(x::HDF5DiskArray) = Unchunked()
eachchunk(x::HDF5DiskArray{<:Any,<:Any,<:GridChunks}) = x.cs
readblock!(x::HDF5DiskArray, aout, r::AbstractUnitRange...) = aout .= x.ds[r...]
writeblock!(x::HDF5DiskArray, v, r::AbstractUnitRange...) = x.ds[r...] = v
function HDF5DiskArray(ds::HDF5Dataset)
    cs = try
        GridChunks(ds, get_chunk(ds))
    catch
        nothing
    end
    HDF5DiskArray{eltype(ds),ndims(ds),typeof(cs)}(ds,cs)
end

Now you can wrap a HDF5Dataset into a DiskArray:

f = h5open("chunk_test.h5","w")

A = rand(100,100)
f["A", "chunk", (5,5)] = A
d = HDF5DiskArray(f["A"])

and the following will operate chunk by chunk and be much more efficient than using the AbstractArray interface:

using Statistics
#Reducing over datasets is done chunk by chunk
mean(d)

#Broadcasting respects chunks
d .= 1

#Reductions over dimensions are ok as well
mean(d, dims=2)

So, maybe it would be a better option to create HDF5Plus.jl where we define this wrapper and users can decide between which package to use. What do the maintainers of HDF5 think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions