Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118
Merged
benjaminhwilliams merged 71 commits intomainfrom Sep 13, 2024
Merged
Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118benjaminhwilliams merged 71 commits intomainfrom
benjaminhwilliams merged 71 commits intomainfrom
Conversation
These support sparse binning and serialisable read.
Display the progress of the full task graph, not just the top layer. Constraining ourselves to just the top layer should not be necessary if we can serialise the file reading tasks approriately, which leads to more fluid progress through the task graph. Also, add option to gather and return the computed futures.
Replace h5py-based context manager with Xarray for reading raw Tristan data from HDF5 files. This removes the need to pin Dask Distributed to old versions that still permitted read-only h5py.File objects to be serialised. It also seems to improve the overall performance, reducing latency for multi-threaded read operations.
This will avoid circular imports in the next steps of the refactor.
'interleave_partitions' and 'ignore_order' overcomplicate the task graph and slow things down, especially the former.
Specify dtypes of events and image data in a single place in tristan.data. Also specify a key 'pixel_index' for identifying data pixel location data that have been converted from the Tristan event_id form to the index of the pixel in the flattened image array.
When converting the pixel location datum from Tristan 'event_id' format to the index in the flattened image array, use the function tristan.data.pixel_index to overwrite the 'event_id' column in the dataframe of events, rather than simply calculating and returning the index value.
Exploit sparse arrays for the binned images and thread-locking for the Zarr array cache.
Only initialise dask.distributed.Client at __enter__, rather than at initialisation of the ContextDecorator subclass WithDistributedLocalCluster. This prevents duplicate Clients being created for every possible command line program and only one of them used.
This is ugly as sin but it works. 🤷
Useful for debugging but a slight drain on resources in production.
This reverts commit e559a38.
Don't make in-place column changes, make copies instead.
benjaminhwilliams
added a commit
that referenced
this pull request
Sep 24, 2024
Thread synchronisation for inplace addition to the Zarr cache was introduced in #118 but inadvertendly left disabled. This change enables that functionality.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a fairly major refactor of all the
imagestools and underlying utilities. The result is improved performance and stability, especially for large input data.The main changes are:
h5py.Fileobjects cannot be serialised for passing between scheduler and workers, are resolved by using Xarray instead.sparse.COOarrays, rather than NumPybincountcalls, and assigning new values to the Zarr cache by coordinate selection.zarr.Arraywith a locking mechanism on in-place writes, borrowed from Zarr's own locking mechanism for reading and writing.These changes have permitted and necessitated quite a sizeable refactor, which includes the following:
imagessub-commands have each been split into a dedicated sub-module, to improve the readability of this bloated part of the code. This is already where some of the worst code lives, having often being fixed/modified in a hurry before or during an experiment. This refactor mitigates that messiness somewhat.zarr.TempStore, so that it is removed automatically at exit, rather than requiring explicit deletion.