Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Dask serializers, registration #1948

Open
thewtex opened this issue Aug 6, 2020 · 6 comments
Open

Improve Dask serializers, registration #1948

thewtex opened this issue Aug 6, 2020 · 6 comments
Labels
status:Backlog Postponed without a fixed deadline type:Enhancement Improvement of existing methods or implementation

Comments

@thewtex
Copy link
Member

thewtex commented Aug 6, 2020

As suggested by @jakirkham in #1942, we can improve the registration of Dask serialization and deserialization functions, including making them available in distribute. This could be done with NDArrayITKBase, but we can also have optimized serializers for common itk.DataObject's.

As @jakirkham noted:

If one communicates objects that Dask already knows how to work with (like NumPy arrays), then one will get optimized serialization for free. This may involve a little (or maybe no) code in ITK to make sure NumPy arrays are handled reasonable well and handed to Dask when functions return. The benefit here seems useful even outside the Dask context.

If one adds pickle protocol 5 support to objects (possibly extending to older Python's with pickle5 or leveraging NumPy to do this indirectly ( numpy/numpy#12091 )), then one can leverage Dask's own support for pickle protocol 5 ( dask/distributed#3784 ) ( dask/distributed#3849 ), which should yield a similar performance boost to Dask's own custom serialization (as the underlying principles are the same). This would require merely updating how pickling works in ITK. It also will work with anything else that supports pickle protocol 5.

Alternatively one can use Dask's custom serialization by following this example on how to extend to custom types. The usual strategy here is to put these registration bits in its own module. This will soften the Distributed dependency and provide a single point for registering these functions. The latter is useful as one needs to import this module in distributed for it to work. This is doable with a bit of work in ITK and Distributed, but not too difficult.

Any of these options seems fine. Possibly it's worth doing multiple or even all of them (this is what we did in RAPIDS rapidsai/cudf#5139 ). Just trying to show what's possible and what it would entail to do 🙂

Originally posted by @jakirkham in #1942 (comment)

@thewtex thewtex added the type:Enhancement Improvement of existing methods or implementation label Aug 6, 2020
@jakirkham
Copy link
Member

From that cuDF PR at the end, I know there's a lot going on there, but I think only 2 pieces are of particular note (the rest is cleanup thanks to those two pieces). The first is an ABC called Serializable, which implements various serialization modes and other classes inherit from. The second bit is registering Serializable methods with Dask. By doing this, we can implement Dask's custom serialization (we also implement CUDA serialization for sending GPU data). We can also get a free pickle implementation by just rearranging the same building blocks from Dask serialization.

@thewtex
Copy link
Member Author

thewtex commented Aug 6, 2020

@jakirkham thanks for helping to break it down and explain! -- the role of the different pieces was not very transparent to me :-)

@thewtex thewtex changed the title Improve Dask serializer registration Improve Dask serializers, registration Aug 6, 2020
@jakirkham
Copy link
Member

Looking at NDArrayITKBase, something like this might already be a good starting point for pickle protocol 5 since NumPy arrays already implement that.

def __reduce__(self):
    return NDArrayITKBase, (np.asarray(ndarray_itk_base),)

Though am not sure how to capture itk_base since that doesn't appear to be pickleable. Am guessing you have a better idea of how to do this than I do 😄

@stale
Copy link

stale bot commented Dec 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@jakirkham
Copy link
Member

The pickling side of the story seems to be largely addressed through PR ( #2829 ) if I'm not mistaken. Dask serialization builds on that a bit, but just supporting pickle should be pretty good there (unless you see a need for additional performance gains).

The remaining question here would be handling GPU data without device-to-host/host-to-device copies (as those can be quite costly). Looking at how Dask serializes/deserializes CuPy arrays may be a good starting point.

@stale stale bot removed the status:Backlog Postponed without a fixed deadline label Nov 11, 2021
@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the status:Backlog Postponed without a fixed deadline label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:Backlog Postponed without a fixed deadline type:Enhancement Improvement of existing methods or implementation
Projects
None yet
Development

No branches or pull requests

2 participants