Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache by benjaminhwilliams · Pull Request #118 · DiamondLightSource/python-tristan

benjaminhwilliams · 2024-09-13T10:00:02Z

This is a fairly major refactor of all the images tools and underlying utilities. The result is improved performance and stability, especially for large input data.

The main changes are:

Issues with using Dask Distributed to read HDF5 input data, related to the fact that h5py.File objects cannot be serialised for passing between scheduler and workers, are resolved by using Xarray instead.
The binning of chunks of events to images is made far more lightweight by the use of sparse.COO arrays, rather than NumPy bincount calls, and assigning new values to the Zarr cache by coordinate selection.
The resulting contention, whereby different threads will attempt to write to the same chunk in the cache, is resolved by using a custom subclass of zarr.Array with a locking mechanism on in-place writes, borrowed from Zarr's own locking mechanism for reading and writing.

These changes have permitted and necessitated quite a sizeable refactor, which includes the following:

The logic for the individual images sub-commands have each been split into a dedicated sub-module, to improve the readability of this bloated part of the code. This is already where some of the worst code lives, having often being fixed/modified in a hurry before or during an experiment. This refactor mitigates that messiness somewhat.
Version pins for Dask, etc., have been lifted.
The Zarr store used for cacheing is now a zarr.TempStore, so that it is removed automatically at exit, rather than requiring explicit deletion.
Python versions in the CI are now 3.10, 3.11 & 3.12, with tests against all three.

These support sparse binning and serialisable read.

Display the progress of the full task graph, not just the top layer. Constraining ourselves to just the top layer should not be necessary if we can serialise the file reading tasks approriately, which leads to more fluid progress through the task graph. Also, add option to gather and return the computed futures.

Replace h5py-based context manager with Xarray for reading raw Tristan data from HDF5 files. This removes the need to pin Dask Distributed to old versions that still permitted read-only h5py.File objects to be serialised. It also seems to improve the overall performance, reducing latency for multi-threaded read operations.

This will avoid circular imports in the next steps of the refactor.

'interleave_partitions' and 'ignore_order' overcomplicate the task graph and slow things down, especially the former.

Specify dtypes of events and image data in a single place in tristan.data. Also specify a key 'pixel_index' for identifying data pixel location data that have been converted from the Tristan event_id form to the index of the pixel in the flattened image array.

When converting the pixel location datum from Tristan 'event_id' format to the index in the flattened image array, use the function tristan.data.pixel_index to overwrite the 'event_id' column in the dataframe of events, rather than simply calculating and returning the index value.

Exploit sparse arrays for the binned images and thread-locking for the Zarr array cache.

Only initialise dask.distributed.Client at __enter__, rather than at initialisation of the ContextDecorator subclass WithDistributedLocalCluster. This prevents duplicate Clients being created for every possible command line program and only one of them used.

This is ugly as sin but it works. 🤷

Useful for debugging but a slight drain on resources in production.

This reverts commit e559a38.

Don't make in-place column changes, make copies instead.

Thread synchronisation for inplace addition to the Zarr cache was introduced in #118 but inadvertendly left disabled. This change enables that functionality.

benjaminhwilliams added 30 commits September 11, 2024 00:19

Remove dependency version constraints

f6b2835

Add dependencies

21d84c0

These support sparse binning and serialisable read.

Update cues utility to use new tooling

ffb95a6

Move image binning CLI to dedicated module

71a2580

Correct type hint syntax

f662bd5

Move images CLI parser to submodule

79937dd

This will avoid circular imports in the next steps of the refactor.

Move single image CLI logic to dedicated submodule

61a55a9

Move more CLI logic to dedicated submodules

6c2f755

Move yet more CLI logic to dedicated submodules

d944ad1

Introduce cluster context manager decorator

6f20308

Fix chunking logic to avoid unecessary calculations

be61df5

Don't try to be too clever when reading the data

3e6677b

'interleave_partitions' and 'ignore_order' overcomplicate the task graph and slow things down, especially the former.

Use relative import for consistency

41096d4

Fix type hints

b617f5c

Update the binning utilities

c2f7587

Exploit sparse arrays for the binned images and thread-locking for the Zarr array cache.

Update 'images single' CLI

d8ea06e

Update development environment spec

33714f1

Turn off Dask Distributed Bokeh dashboard

51d10cc

Update tests to reflect new data reading functions

e8a5df5

Correct nested quotes in f-strings for Python<3.12

937c8f7

Fix harmless typo

311f081

Factor out time bin key

d4bf208

Be more specific with some dtypes

ed7def7

Update multiple-image binning function

3193b9a

Update 'images multi' CLI tool

c698fb1

benjaminhwilliams added 23 commits September 12, 2024 14:14

Update 'images pp' CLI tool

1a9ca6d

Fix assignment to a column of a dataframe

d15ef04

Cast trigger times to signed int immediately

e3fdc53

Update 'images multi' CLI tool

3306a4f

This is ugly as sin but it works. 🤷

Tidy up an import

1692a04

Disable the Dask Distributed dashboard

55da437

Useful for debugging but a slight drain on resources in production.

Add some missing __future__.annotations imports

e1a9991

Update a comment

14d7b50

Update 'images serial' CLI tool

ddac6a5

Update CI OS image and Python versions

9525deb

Massage requirements spec to fix CI

b21ab85

Fix tests

0f3a650

Further massage requirements spec

9b34892

Exorcise a ghost?

c24111f

A different sort of exorcism?

8efbf93

Fix a platform-dependent test

f509562

Perhaps silence some Pytest warnings

de76b4f

One last attempt at de-haunting the CI

e559a38

Revert "One last attempt at de-haunting the CI"

fa8bd5d

This reverts commit e559a38.

Wooooooooo

46c03ac

Another approach to a haunted test in the CI

d77ee45

Tear down Zarr stores at exit

4053c3f

Move create_cache to storage submodule

c24355a

benjaminhwilliams changed the title ~~Sparse binning and Xarray data loading~~ Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache Sep 13, 2024

benjaminhwilliams added 2 commits September 13, 2024 12:01

Fix a Dask map_partitions subtlety

321c457

Don't make in-place column changes, make copies instead.

Simplify 'images sequences' CLI, reuse other code

5fe7b12

benjaminhwilliams merged commit adf2d72 into main Sep 13, 2024

benjaminhwilliams deleted the sparse-binning branch September 13, 2024 11:07

benjaminhwilliams mentioned this pull request Sep 24, 2024

Switch on the thread synchronisation #122

Merged

benjaminhwilliams added a commit that referenced this pull request Sep 24, 2024

Switch on the thread synchronisation (#122)

c68c139

Thread synchronisation for inplace addition to the Zarr cache was introduced in #118 but inadvertendly left disabled. This change enables that functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118

Sparse binning, Xarray data loading and thread synchronisation for writing to the Zarr cache#118
benjaminhwilliams merged 71 commits intomainfrom
sparse-binning

benjaminhwilliams commented Sep 13, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjaminhwilliams commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benjaminhwilliams commented Sep 13, 2024 •

edited

Loading