Skip to content

Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118

Merged
benjaminhwilliams merged 71 commits intomainfrom
sparse-binning
Sep 13, 2024
Merged

Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118
benjaminhwilliams merged 71 commits intomainfrom
sparse-binning

Conversation

@benjaminhwilliams
Copy link
Copy Markdown
Contributor

@benjaminhwilliams benjaminhwilliams commented Sep 13, 2024

This is a fairly major refactor of all the images tools and underlying utilities. The result is improved performance and stability, especially for large input data.

The main changes are:

  • Issues with using Dask Distributed to read HDF5 input data, related to the fact that h5py.File objects cannot be serialised for passing between scheduler and workers, are resolved by using Xarray instead.
  • The binning of chunks of events to images is made far more lightweight by the use of sparse.COO arrays, rather than NumPy bincount calls, and assigning new values to the Zarr cache by coordinate selection.
  • The resulting contention, whereby different threads will attempt to write to the same chunk in the cache, is resolved by using a custom subclass of zarr.Array with a locking mechanism on in-place writes, borrowed from Zarr's own locking mechanism for reading and writing.

These changes have permitted and necessitated quite a sizeable refactor, which includes the following:

  • The logic for the individual images sub-commands have each been split into a dedicated sub-module, to improve the readability of this bloated part of the code. This is already where some of the worst code lives, having often being fixed/modified in a hurry before or during an experiment. This refactor mitigates that messiness somewhat.
  • Version pins for Dask, etc., have been lifted.
  • The Zarr store used for cacheing is now a zarr.TempStore, so that it is removed automatically at exit, rather than requiring explicit deletion.
  • Python versions in the CI are now 3.10, 3.11 & 3.12, with tests against all three.

These support sparse binning and serialisable read.
Display the progress of the full task graph, not just the top layer.
Constraining ourselves to just the top layer should not be necessary if
we can serialise the file reading tasks approriately, which leads to
more fluid progress through the task graph.  Also, add option to gather
and return the computed futures.
Replace h5py-based context manager with Xarray for reading raw Tristan
data from HDF5 files.  This removes the need to pin Dask Distributed to
old versions that still permitted read-only h5py.File objects to be
serialised.  It also seems to improve the overall performance, reducing
latency for multi-threaded read operations.
This will avoid circular imports in the next steps of the refactor.
'interleave_partitions' and 'ignore_order' overcomplicate the task graph
and slow things down, especially the former.
Specify dtypes of events and image data in a single place in
tristan.data.  Also specify a key 'pixel_index' for identifying data
pixel location data that have been converted from the Tristan event_id
form to the index of the pixel in the flattened image array.
When converting the pixel location datum from Tristan 'event_id' format
to the index in the flattened image array, use the function
tristan.data.pixel_index to overwrite the 'event_id' column in the
dataframe of events, rather than simply calculating and returning the
index value.
Exploit sparse arrays for the binned images and thread-locking for the
Zarr array cache.
Only initialise dask.distributed.Client at __enter__, rather than at
initialisation of the ContextDecorator subclass
WithDistributedLocalCluster.  This prevents duplicate Clients being
created for every possible command line program and only one of them
used.
@benjaminhwilliams benjaminhwilliams changed the title Sparse binning and Xarray data loading Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache Sep 13, 2024
@benjaminhwilliams benjaminhwilliams merged commit adf2d72 into main Sep 13, 2024
@benjaminhwilliams benjaminhwilliams deleted the sparse-binning branch September 13, 2024 11:07
benjaminhwilliams added a commit that referenced this pull request Sep 24, 2024
Thread synchronisation for inplace addition to the Zarr cache was
introduced in #118 but inadvertendly left disabled. This change enables
that functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant