Use a temporary array to read detector data with pulse selection #220

takluyver · 2021-09-23T14:55:23Z

This aims to speed up reading only selected frames from XTDF detector data, using the LPD/AGIPD/DSSC .get_array() method with pulses=. This was prompted by @daviddoji's use case reading LPD data.

I believe this is working around a performance issue in HDF5, which appears to have been around for years. It looks like HDF5 doesn't realise that it can copy large blocks of data, and falls back to copying point by point. By using a temporary array, we can persuade it that the source & destination are sufficiently similar so it works more efficiently. Then we copy the selected data to the output array with numpy.compress().

This involves some extra temporary memory use for the intermediate array. It should never be more than the size of a single file, and it should be possible for the operating system to do virtual memory tricks and only allocate memory for the frames we're actually reading, not the gaps between them.

The timings below for reading LPD parallel gain data with different numbers of pulses selected (e.g. .get_array('image.data', pulses=np.s[:1])). There are 100 pulses per train, but parallel gain mode records all 3 gain stages as separate frames, so we're actually reading 3n of 300 frames in each case. Times are per train; I used 10 or 25 trains to get a better average. I ran each one a couple of times to ensure data was cached.

Pulses selected	Parent branch	This branch
1	0.56 s	0.026 s
3	1.7 s	0.030 s
10	5.6 s	0.057 s
50	27.8 s	0.20 s
100 (all)	0.201 s	0.204 s

Without this change, you can see that reading even a single frame is slow, and the time scales linearly with the number of frames to read (except when we read all of them). With this change, reading a subset of frames per train is much faster.

daviddoji · 2021-09-24T10:38:34Z

LGTM.
Great improvement @takluyver.

takluyver · 2021-09-24T10:41:50Z

Thanks David! I'll hold off merging for now, because this goes into the branch for #218, and I don't want to complicate that PR.

takluyver added 3 commits September 22, 2021 16:46

Read data with pulse selection into a temporary array

26e302b

Add explanatory comment

439c2d0

Move some code into new method

8bfc829

takluyver added the enhancement New feature or request label Sep 23, 2021

takluyver added this to the 1.8 milestone Oct 1, 2021

Base automatically changed from refactor-multimod-array to master October 6, 2021 09:26

takluyver merged commit e603df9 into master Oct 6, 2021

takluyver deleted the pulse-sel-tmp-array branch October 6, 2021 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a temporary array to read detector data with pulse selection #220

Use a temporary array to read detector data with pulse selection #220

takluyver commented Sep 23, 2021

daviddoji commented Sep 24, 2021

takluyver commented Sep 24, 2021

Use a temporary array to read detector data with pulse selection #220

Use a temporary array to read detector data with pulse selection #220

Conversation

takluyver commented Sep 23, 2021

daviddoji commented Sep 24, 2021

takluyver commented Sep 24, 2021