Distributed PyTorch on Grid (via IPFS/PubSub) #166

jvmncs · 2018-03-31T05:50:45Z

This PR establishes a general framework for distributed tensor computation on Grid with PyTorch. This is accomplished by overloading all relevant functions and methods in the torch module, so that when a user calls torch commands, they distribute the commands across the Grid on remote nodes that contain the actual tensors involved. When we send a tensor over the network, a 0-d tensor pointer remains for future computation. When we execute a command on remote tensors that return a new tensor, the worker node returns registration attributes that are used to create a new 0-d pointer for the result. This means that the original composability of the torch library is maintained, allowing for lower level operations to make their way into higher level abstractions like Modules and Containers with minimal future effort.

There are a few other major benefits to this approach -- mainly, it's the lowest level of abstraction we've been able to successfully distribute over IPFS. This has a host of benefits other than its potential to scale up to higher level abstractions, including more fine-grained control of computation, tighter security guarantees, and allowing for larger models by keeping more IPFS blocks under the 1 MiB limit.

Currently, this work is complete for all Tensor types, although it could use a good stress test or two. The implementation for Variable is incomplete -- the only remaining bits are the special methods send_, get_, and ser, and to ensure that Variables are handled properly throughout transmission and remote computation. I'll be following up with another PR in the coming days to get autograd working.

Due to the nature of Grid at the moment, it's not entirely resilient. Future work should improve error reporting on the client side (#151), induce garbage collection on the worker nodes when a client signals they're done (or after a timeout), and figure out how to handle workers that drop out mid-computation (likely by notifying the client that a worker with one of their tensors has disconnected from IPFS/stopped listening to openmined channels). There's a whole range of other things we need to do as well (e.g. #134), but let's get this merged first. 🙂

Critical files being created and modified mostly include files in grid/services, and in particular the torch subdirectory there, although changes have been made across all files in the repository relating to compute mode. A brief demo of computation on multiple nodes over IPFS can be found at notebooks/experimental/torch_integration/Grid_MultiNode_Demo.ipynb.

Happy Torching!

🎉

* feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug

* feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * adjust comments * this round of work sponsored by parallel jalebi * in the middle of fixing #130 and #132 * resolved #132, #130 will take a bit more effort than I'd planned for * completes #130, prepares #129 and #131; almost took care of #148 in the process

resolved #129

* laptop sync * finished up ipfs integration, yet to test * syncing with colab notebooks * renamed channels.openmined to channels.om * found a worker node error * bug in Tensor.send_ * fixed two client side bugs * keyerror in receive_obj message * register tensors before sending * well that was rough * more bug fixes * premerge * fix utils import in hook_worker_service * fix return_result for worker * premerge * premerge2 * BOOM

* lots o' comments * reorganize notebooks

iamtrask

This marks a new chapter in the OpenMined project - very, very excited to merge this!

* First round of torch hooks integration (#152) * Finished HookService, linked it with TorchService (#154) * feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * WIP for #130 and #132 (#155) * feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug * adjust comments * this round of work sponsored by parallel jalebi * in the middle of fixing #130 and #132 * resolved #132, #130 will take a bit more effort than I'd planned for * completes #130, prepares #129 and #131; almost took care of #148 in the process * Worker side command processing and execution (#156) resolved #129 * Finished implementing IPFS into torch services (#161) * laptop sync * finished up ipfs integration, yet to test * syncing with colab notebooks * renamed channels.openmined to channels.om * found a worker node error * bug in Tensor.send_ * fixed two client side bugs * keyerror in receive_obj message * register tensors before sending * well that was rough * more bug fixes * premerge * fix utils import in hook_worker_service * fix return_result for worker * premerge * premerge2 * BOOM * multinode demo (#162) * lots o' comments (#164) * Reorganizing notebooks (#165) * lots o' comments * reorganize notebooks

jvmancuso added 8 commits March 17, 2018 23:51

First round of torch hooks integration (#152)

71a41df

Finished HookService, linked it with TorchService (#154)

25cfc02

* feat: modify pubsub_peers to handle newer IPFS api. (#153) * finished minimal transfer of overloading code * found an untested bug

Worker side command processing and execution (#156)

8a9dd6e

resolved #129

multinode demo (#162)

ee4d9b6

lots o' comments (#164)

76e9e7e

Reorganizing notebooks (#165)

b94a025

* lots o' comments * reorganize notebooks

jvmncs added the torch label Mar 31, 2018

jvmncs requested a review from iamtrask March 31, 2018 05:50

iamtrask approved these changes Mar 31, 2018

View reviewed changes

iamtrask merged commit 88404d0 into master Mar 31, 2018

jvmncs deleted the torch_hooks branch March 31, 2018 20:46

endomorphosis mentioned this pull request Jun 13, 2024

Support for jupyter-ai in web assembly (jupyterlite) jupyterlab/jupyter-ai#822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed PyTorch on Grid (via IPFS/PubSub) #166

Distributed PyTorch on Grid (via IPFS/PubSub) #166

jvmncs commented Mar 31, 2018

iamtrask left a comment

Distributed PyTorch on Grid (via IPFS/PubSub) #166

Distributed PyTorch on Grid (via IPFS/PubSub) #166

Conversation

jvmncs commented Mar 31, 2018

iamtrask left a comment

Choose a reason for hiding this comment