Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Collaborative Training Strategy! #12647

Closed
SeanNaren opened this issue Apr 6, 2022 · 6 comments · Fixed by #12842
Closed

Introduce Collaborative Training Strategy! #12647

SeanNaren opened this issue Apr 6, 2022 · 6 comments · Fixed by #12842
Assignees
Labels
3rd party Related to a 3rd-party distributed Generic distributed-related topic feature Is an improvement or enhancement strategy
Milestone

Comments

@SeanNaren
Copy link
Contributor

SeanNaren commented Apr 6, 2022

🚀 Feature

Over the past few months, I've been working with a library called hivemind, where the hivemind team have done amazing things such as the training transformers together project. The goal of hivemind is to be able to train across the internet collaboratively, over different machines (like a swarm of machines) rather than having to rely on specialized machine setups (as we traditionally see for distributed training).

I have a working hacky PoC here, however, I've made a few iterations privately namely making it a Strategy! The CollaborativeStrategy allows machines to connect to each other by passing a list of peers. The Strategy helps makes the experience with hivemind far easier (small things like handling peers/DHT or making changes to the scaler/module where configurations require) and reducing boilerplate in code. I've also successfully trained with the CollaborativeStrategy across spot instances, showing that unreliable GPU training is possible!

The main goal of this ticket is to come to an agreement on whether the CollaborativeStrategy should live in PyTorch Lightning, or as a separate integration (in its own repo).

Motivation

I believe that a new strategy within PyTorch Lightning will bring users who would like to use spot instances or unreliable GPU machines distributed training to Pytorch Lightning, and make them aware that it is possible!

Suggested API

The Strategy option makes more sense, as we have to control some of the behaviour of the precision plugin as well as control certain pieces of training. More importantly, currently, the hivemind integration will not work with any other strategy. Being a strategy ensures that the strategy is exclusive.

import pytorch_lightning as pl
from pytorch_lightning.strategies import CollaborativeStrategy

trainer = pl.Trainer(
    strategy=CollaborativeStrategy(target_batch_size=8192)
),

When users run the code, they are given a message on how to get clients to join:

python train.py
# Other peers can connect via:
# "INITIAL_PEERS=<PEERS> python ...
# or pass the peers to the strategy: 
# CollaborativeStrategy(initial_peers='<PEERS>')

Pros/Cons

Why we should add this to PyTorch Lightning

  • Easier access for users who want to use the CollaborativeStrategy (just have to install hivemind), hopefully drawing more users who are interested!
  • No need to maintain a separate entire repo, with its own CI/docs/maintainers
  • Relatively few lines (~400 lines), relying on HiveMind for the heavy lifting

Why it should exist elsewhere

  • Can exist independent of Lightning. Since internals are not touched, this is just a third party integration similar to Bagua/DeepSpeed etc (naturally it also means responsibilities are kept separate)
  • Will increase PyTorch Lightning CI time and potentially make things even more complicated (we have to install deepspeed/fairscale/bagua and now hivemind as well?!)

Alternatives

An alternative would be for the strategy to exist in Hivemind. I haven't spoken to the engineers about this (who can pitch in below) but could be viable. I would be concerned primarily out of the Hivemind repo being pretty complicated to support this type of distributed training already.

Additional Context

  • Hivemind team have already been assisting in the development of the strategy, and I'm sure they'll help us maintain the Strategy if needed!

Please leave comments and stuff below, thanks for reading!

cc @Borda @awaelchli @rohitgr7 @akihironitta @justusschock @justheuristic @mryab

@SeanNaren SeanNaren added distributed Generic distributed-related topic 3rd party Related to a 3rd-party strategy needs triage Waiting to be triaged by maintainers labels Apr 6, 2022
@SeanNaren SeanNaren self-assigned this Apr 6, 2022
@Borda Borda added this to the 1.7 milestone Apr 6, 2022
@tchaton
Copy link
Contributor

tchaton commented Apr 6, 2022

Hey, @SeanNaren I am in favor of adding the Collaborative Strategy inside of PyTorch Lightning framework.

As stated, I believe this would make the Collaborative Strategy more discoverable and help boost the adoption of such new technology.

@rohitgr7 rohitgr7 removed the needs triage Waiting to be triaged by maintainers label Apr 6, 2022
@SeanNaren SeanNaren mentioned this issue Apr 8, 2022
12 tasks
@awaelchli
Copy link
Member

I'm also in favor of adding it. Bring it home, @SeanNaren!

Will increase PyTorch Lightning CI time and potentially make things even more complicated (we have to install deepspeed/fairscale/bagua and now hivemind as well?!)

By how much do you estimate? The majority of tests should be unit tests and not add any significant time. What kind of integration/benchmark test did you have in mind?

@SeanNaren
Copy link
Contributor Author

I've learnt a lot since the days of DeepSpeed/Sharded in terms of CI times :D. Majority should be unit tests, except for one which should ensure that the endpoint works (two peers successfully connect, and we log correctly). The nearly finished tests can be seen at #12673

@justusschock
Copy link
Member

@SeanNaren just for my understanding: This does not only allow collaborative training but would also allow elastic training on a cluster, right?

Is there an option to know the initial peers beforehand? So that I can submit all the jobs together or do I have to wait for the first one to run?

Regardless of the answers to these questions, I also think that it should live within PL not only for the reasons already mentioned by others but I expect quite some interest in this and having it within PL would ease the compatibility guarantees.

@carmocca
Copy link
Contributor

I would personally advocate for the approach followed with the Bagua integration.
Meaning that we only keep the PL components (the strategy) with their specific unit tests.

Pros:

  • Faster development on collaborative specific internals
  • Splits CI time
  • PL components stay up-to-date as they are in the repo
  • Well-defined responsibilities per repo (system-like design)

Cons:

  • Needs more effort in terms of publishing and keeping a separate repository
  • There isn't that much non-PL code at the moment

On that last point, will that change in the future? Would you expect that this could evolve separately? Or that the development freedom will be useful in the future?

@carmocca carmocca added the feature Is an improvement or enhancement label Apr 12, 2022
@SeanNaren
Copy link
Contributor Author

SeanNaren commented Apr 13, 2022

@SeanNaren just for my understanding: This does not only allow collaborative training but would also allow elastic training on a cluster, right?

Thank @justusschock!

It does allow elastic training, and we in fact have a long training run going on across spot instance machines to prove this! Once the strategy has been merged the goal is to share the code to show how this works (but all of the techniques are in the strategy/docs already!)

Is there an option to know the initial peers beforehand? So that I can submit all the jobs together or do I have to wait for the first one to run?

The simplest approach is to spawn a hivemind.DHT in your own service/process, and request initial peers from this to pass into the strategy. This would allow you to make initial peers, and then pass to all your jobs!

I would personally advocate for the approach followed with the Bagua integration.
Meaning that we only keep the PL components (the strategy) with their specific unit tests.

Thanks @carmocca, I think the only piece that isn't tied to Lightning is the DHTManager (the rest of the code is Lightning specific). Having an entirely separate repo for this seems quite overkill for the integration so this hybrid approach may not be worthwhile IMO.

On that last point, will that change in the future? Would you expect that this could evolve separately? Or that the development freedom will be useful in the future?

As mentioned offline, I think most of the code changes will be UX experience changes for PyTorch Lightning (exposing variables, auto enabling configs for users to simplify their Lightning experience). Any heavy lifting of the internals will be carried out by the Hivemind team independent of this Strategy!

@carmocca carmocca assigned rohitgr7 and SeanNaren and unassigned SeanNaren and rohitgr7 Apr 26, 2022
@SeanNaren SeanNaren linked a pull request Apr 26, 2022 that will close this issue
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party distributed Generic distributed-related topic feature Is an improvement or enhancement strategy
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants