Improve Dask serializers, registration #1948

thewtex · 2020-08-06T02:49:03Z

As suggested by @jakirkham in #1942, we can improve the registration of Dask serialization and deserialization functions, including making them available in distribute. This could be done with NDArrayITKBase, but we can also have optimized serializers for common itk.DataObject's.

As @jakirkham noted:

If one communicates objects that Dask already knows how to work with (like NumPy arrays), then one will get optimized serialization for free. This may involve a little (or maybe no) code in ITK to make sure NumPy arrays are handled reasonable well and handed to Dask when functions return. The benefit here seems useful even outside the Dask context.

If one adds pickle protocol 5 support to objects (possibly extending to older Python's with pickle5 or leveraging NumPy to do this indirectly ( numpy/numpy#12091 )), then one can leverage Dask's own support for pickle protocol 5 ( dask/distributed#3784 ) ( dask/distributed#3849 ), which should yield a similar performance boost to Dask's own custom serialization (as the underlying principles are the same). This would require merely updating how pickling works in ITK. It also will work with anything else that supports pickle protocol 5.

Alternatively one can use Dask's custom serialization by following this example on how to extend to custom types. The usual strategy here is to put these registration bits in its own module. This will soften the Distributed dependency and provide a single point for registering these functions. The latter is useful as one needs to import this module in distributed for it to work. This is doable with a bit of work in ITK and Distributed, but not too difficult.

Any of these options seems fine. Possibly it's worth doing multiple or even all of them (this is what we did in RAPIDS rapidsai/cudf#5139 ). Just trying to show what's possible and what it would entail to do 🙂

Originally posted by @jakirkham in #1942 (comment)

The text was updated successfully, but these errors were encountered:

jakirkham · 2020-08-06T03:10:56Z

From that cuDF PR at the end, I know there's a lot going on there, but I think only 2 pieces are of particular note (the rest is cleanup thanks to those two pieces). The first is an ABC called Serializable, which implements various serialization modes and other classes inherit from. The second bit is registering Serializable methods with Dask. By doing this, we can implement Dask's custom serialization (we also implement CUDA serialization for sending GPU data). We can also get a free pickle implementation by just rearranging the same building blocks from Dask serialization.

thewtex · 2020-08-06T13:40:28Z

@jakirkham thanks for helping to break it down and explain! -- the role of the different pieces was not very transparent to me :-)

jakirkham · 2020-08-07T04:38:17Z

Looking at NDArrayITKBase, something like this might already be a good starting point for pickle protocol 5 since NumPy arrays already implement that.

def __reduce__(self):
    return NDArrayITKBase, (np.asarray(ndarray_itk_base),)

Though am not sure how to capture itk_base since that doesn't appear to be pickleable. Am guessing you have a better idea of how to do this than I do 😄

stale · 2020-12-06T05:41:04Z

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

jakirkham · 2021-11-11T06:25:18Z

The pickling side of the story seems to be largely addressed through PR ( #2829 ) if I'm not mistaken. Dask serialization builds on that a bit, but just supporting pickle should be pretty good there (unless you see a need for additional performance gains).

The remaining question here would be handling GPU data without device-to-host/host-to-device copies (as those can be quite costly). Looking at how Dask serializes/deserializes CuPy arrays may be a good starting point.

stale · 2022-04-16T10:34:43Z

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

thewtex added the type:Enhancement Improvement of existing methods or implementation label Aug 6, 2020

thewtex mentioned this issue Aug 6, 2020

ENH: accommodate dask_loads for default serializers of dask #1942

Closed

7 tasks

thewtex changed the title ~~Improve Dask serializer registration~~ Improve Dask serializers, registration Aug 6, 2020

stale bot added the status:Backlog Postponed without a fixed deadline label Dec 6, 2020

jakirkham mentioned this issue Jun 17, 2021

Support itk.Image use in Dask Array map_blocks #1091

Closed

4 tasks

thewtex mentioned this issue Jul 16, 2021

ENH: fix Python multi-processing hang on unix #2649

Merged

stale bot removed the status:Backlog Postponed without a fixed deadline label Nov 11, 2021

stale bot added the status:Backlog Postponed without a fixed deadline label Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Dask serializers, registration #1948

Improve Dask serializers, registration #1948

thewtex commented Aug 6, 2020 •

edited

Loading

jakirkham commented Aug 6, 2020

thewtex commented Aug 6, 2020

jakirkham commented Aug 7, 2020

stale bot commented Dec 6, 2020

jakirkham commented Nov 11, 2021

stale bot commented Apr 16, 2022

Improve Dask serializers, registration #1948

Improve Dask serializers, registration #1948

Comments

thewtex commented Aug 6, 2020 • edited Loading

jakirkham commented Aug 6, 2020

thewtex commented Aug 6, 2020

jakirkham commented Aug 7, 2020

stale bot commented Dec 6, 2020

jakirkham commented Nov 11, 2021

stale bot commented Apr 16, 2022

thewtex commented Aug 6, 2020 •

edited

Loading