Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

Distributed PyTorch on Grid (via IPFS/PubSub) #166

Merged
merged 8 commits into from
Mar 31, 2018
Merged

Distributed PyTorch on Grid (via IPFS/PubSub) #166

merged 8 commits into from
Mar 31, 2018

Conversation

jvmncs
Copy link
Contributor

@jvmncs jvmncs commented Mar 31, 2018

This PR establishes a general framework for distributed tensor computation on Grid with PyTorch. This is accomplished by overloading all relevant functions and methods in the torch module, so that when a user calls torch commands, they distribute the commands across the Grid on remote nodes that contain the actual tensors involved. When we send a tensor over the network, a 0-d tensor pointer remains for future computation. When we execute a command on remote tensors that return a new tensor, the worker node returns registration attributes that are used to create a new 0-d pointer for the result. This means that the original composability of the torch library is maintained, allowing for lower level operations to make their way into higher level abstractions like Modules and Containers with minimal future effort.

There are a few other major benefits to this approach -- mainly, it's the lowest level of abstraction we've been able to successfully distribute over IPFS. This has a host of benefits other than its potential to scale up to higher level abstractions, including more fine-grained control of computation, tighter security guarantees, and allowing for larger models by keeping more IPFS blocks under the 1 MiB limit.

Currently, this work is complete for all Tensor types, although it could use a good stress test or two. The implementation for Variable is incomplete -- the only remaining bits are the special methods send_, get_, and ser, and to ensure that Variables are handled properly throughout transmission and remote computation. I'll be following up with another PR in the coming days to get autograd working.

Due to the nature of Grid at the moment, it's not entirely resilient. Future work should improve error reporting on the client side (#151), induce garbage collection on the worker nodes when a client signals they're done (or after a timeout), and figure out how to handle workers that drop out mid-computation (likely by notifying the client that a worker with one of their tensors has disconnected from IPFS/stopped listening to openmined channels). There's a whole range of other things we need to do as well (e.g. #134), but let's get this merged first. 🙂

Critical files being created and modified mostly include files in grid/services, and in particular the torch subdirectory there, although changes have been made across all files in the repository relating to compute mode. A brief demo of computation on multiple nodes over IPFS can be found at notebooks/experimental/torch_integration/Grid_MultiNode_Demo.ipynb.

Happy Torching!

🎉

jvmancuso added 8 commits March 17, 2018 23:51
* feat: modify pubsub_peers to handle newer IPFS api. (#153)

* finished minimal transfer of overloading code

* found an untested bug
* feat: modify pubsub_peers to handle newer IPFS api. (#153)

* finished minimal transfer of overloading code

* found an untested bug

* adjust comments

* this round of work sponsored by parallel jalebi

* in the middle of fixing #130 and #132

* resolved #132, #130 will take a bit more effort than I'd planned for

* completes #130, prepares #129 and #131; almost took care of #148 in the process
* laptop sync

* finished up ipfs integration, yet to test

* syncing with colab notebooks

* renamed channels.openmined to channels.om

* found a worker node error

* bug in Tensor.send_

* fixed two client side bugs

* keyerror in receive_obj message

* register tensors before sending

* well that was rough

* more bug fixes

* premerge

* fix utils import in hook_worker_service

* fix return_result for worker

* premerge

* premerge2

* BOOM
* lots o' comments

* reorganize notebooks
@jvmncs jvmncs added the torch label Mar 31, 2018
@jvmncs jvmncs requested a review from iamtrask March 31, 2018 05:50
Copy link
Member

@iamtrask iamtrask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marks a new chapter in the OpenMined project - very, very excited to merge this!

@iamtrask iamtrask merged commit 88404d0 into master Mar 31, 2018
@jvmncs jvmncs deleted the torch_hooks branch March 31, 2018 20:46
Benardi pushed a commit that referenced this pull request May 12, 2020
* First round of torch hooks integration (#152)

* Finished HookService, linked it with TorchService (#154)

* feat: modify pubsub_peers to handle newer IPFS api. (#153)

* finished minimal transfer of overloading code

* found an untested bug

* WIP for #130 and #132 (#155)

* feat: modify pubsub_peers to handle newer IPFS api. (#153)

* finished minimal transfer of overloading code

* found an untested bug

* adjust comments

* this round of work sponsored by parallel jalebi

* in the middle of fixing #130 and #132

* resolved #132, #130 will take a bit more effort than I'd planned for

* completes #130, prepares #129 and #131; almost took care of #148 in the process

* Worker side command processing and execution (#156)

resolved #129

* Finished implementing IPFS into torch services (#161)

* laptop sync

* finished up ipfs integration, yet to test

* syncing with colab notebooks

* renamed channels.openmined to channels.om

* found a worker node error

* bug in Tensor.send_

* fixed two client side bugs

* keyerror in receive_obj message

* register tensors before sending

* well that was rough

* more bug fixes

* premerge

* fix utils import in hook_worker_service

* fix return_result for worker

* premerge

* premerge2

* BOOM

* multinode demo (#162)

* lots o' comments (#164)

* Reorganizing notebooks (#165)

* lots o' comments

* reorganize notebooks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants