Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want to join this working group? #2

Open
HenrikBengtsson opened this issue Jun 7, 2024 · 7 comments
Open

Want to join this working group? #2

HenrikBengtsson opened this issue Jun 7, 2024 · 7 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

Hi all,

let us know if you'd like to join this working group on 'Marshaling and Serialization in R'. To join, just add a comment below with a very brief introduction of yourself and your interest is in this topic.

@HenrikBengtsson HenrikBengtsson pinned this issue Jun 7, 2024
@coolbutuseless
Copy link

Thanks for the invitation to join. . Happy to help in any way I can!

R's Serialisation powers some things I've written: e.g. {xxhashlite}, rlang::hash(). I'm also interested in some low level aspects of R e.g. {rbytecode}.

@simonpcouch
Copy link

Hey yall—more than happy to tag along. Simon from Chicago, IL, USA—I work on packages for predictive modeling at Posit.

We end up thinking about marshaling/serialization a good bit in the context of model deployment and training in parallel. Re: model deployment, we put together a package for marshaling model objects last year that standardizes interfaces to serialization methods from different modeling packages. The issue of serialization also comes up in our support for parallel processing, where various models may be fitted in several R processes but handed back to a parent process for analysis.

cc @juliasilge and @topepo. I won't be able to make it to useR! in person, but Max will be there for the in-person meeting.

@ltierney
Copy link

Not sure how much time I'll have to participate, but I'll try to follow what goes on and chip in from time to time. I developed the current serialization framework a number of years ago. The main goals of that redesign of what came before were to support parallel computation and separate loading of objects in a collection while maintaining identity of mutable objects, mainly environments (i.e. lazy loading).

@shikokuchuo
Copy link
Member

mirai provides an implementation of what is currently possible using the serialization framework @ltierney describes above. Specifically, it interfaces at the C level with the ‘refhook’ system for reference objects, supporting their use in parallel and distributed computing.

This feature was originally motivated by parallel computations involving torch tensors, as described in https://shikokuchuo.net/mirai/articles/torch.html, and following helpful discussions with @dfalbel.

Permitted usage was subsequently broadened to a much wider class of serialization functions, as described in https://shikokuchuo.net/mirai/articles/mirai.html#serialization-arrow-polars-and-beyond, which also benefited from input by @eitsupi.

Finally, it also allows hosting of ADBC database connections in parallel processes as described in https://shikokuchuo.net/mirai/articles/databases.html, where @krlmlr was instrumental in proposing and verifying this use case.

@wlandau
Copy link

wlandau commented Jun 18, 2024

Thanks for the invite. I develop targets and crew, both of which rely on sending objects to concurrent R processes. targets lets you select or customize a "format", which is a storage type that covers serialization and marshaling. It works, but it is not implicit, and some users have struggled with the extra responsibility.

@Jiefei-Wang
Copy link

Thanks for the invite. I'm one of the developers in BiocParallel and SharedObject, which provide a parallelization framework to all Bioconductor packages. I have been thinking about serialization for a while. I think one interesting topic is how we can serialize/unserialize only once in each computer and make the object available to all workers on the same computer. The current solution is ignoring the fact that multiple workers are on the same computer and sending the object to each worker. This is clearly an unnecessary waste of the resources we have in distributed computing. I do not know what the best solution could be but I'll be happy to see any idea.

Great to meet with you @wlandau. I am using your package targets to manage my data extraction pipeline. It is incredibly helpful. Frankly speaking, I am the person who struggled with the extra responsibility you mentioned. I like it, but also hate it. I might open a thread in your repository to discuss the automation of the format selection :)

@t-kalinowski
Copy link

t-kalinowski commented Jul 9, 2024

Hi all,

Tomasz here from mlverse at Posit. I'm more than happy to help out too. I am particularly interested in making sure S7, reticulate, TensorFlow/Jax/Keras, torch, and things like it (R external ptrs, potentially complex environment requirements) work well with whatever the final solution is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants