Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-GPU support with accelerate #76

Closed
wants to merge 11 commits into from

Conversation

GavEdwards
Copy link

@GavEdwards GavEdwards commented Feb 4, 2022

Summary

Enable GPU support (+more) via the Accelerate library.

This is still work in progress - there's still some bugs to be ironed out around multi-gpu and some models.

TODO:

  • add documentation for accelerate
  • multi-gpu tests
  • test with all models

  • Unit tests provided for these changes
  • Documentation and docstrings added for these changes using the sphinx style

Changes

  • Add Accelerate as a dependency
  • Adjust the pipeline code to use accelerate

@GavEdwards GavEdwards mentioned this pull request Feb 4, 2022
@benedekrozemberczki
Copy link
Contributor

Packedgraphs do not have a .to() method, but a custom .cuda() method. Pretty messed up.

https://github.com/DeepGraphLearning/torchdrug/blob/master/torchdrug/data/graph.py

@GavEdwards
Copy link
Author

@cthoyt
Copy link
Contributor

cthoyt commented Feb 5, 2022

So one main thing about this PR is that I don't think it's necessary to use Accelerate to add GPU support - we only have to more cleverly keep track of devices. I'm not against using Accelerate since it has some nice add-ons for multi-gpu etc. but I wouldn't depend on it to solve the original problem

Perhaps we can monkey patch a to() into the packed graph class

@GavEdwards
Copy link
Author

So one main thing about this PR is that I don't think it's necessary to use Accelerate to add GPU support - we only have to more cleverly keep track of devices. I'm not against using Accelerate since it has some nice add-ons for multi-gpu etc. but I wouldn't depend on it to solve the original problem

Perhaps we can monkey patch a to() into the packed graph class

Hi @cthoyt 😄 My thinking was this approach avoids us having to re-invent the wheel and quickly solves the current gpu need. The original issue doesn't describe requirements to aim for #65.

Two questions:

  • In your opinion, what parts of the original problem doesn't Accelerate solve?
  • What requirements need to be met for this approach be accepted?

I'm not attached to Accelerate as a solution either, just trying to understand limits & needs

@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2022

Codecov Report

Merging #76 (0542ab9) into main (20f0e85) will decrease coverage by 0.65%.
The diff coverage is 54.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #76      +/-   ##
==========================================
- Coverage   94.65%   94.00%   -0.66%     
==========================================
  Files          34       34              
  Lines        1478     1500      +22     
==========================================
+ Hits         1399     1410      +11     
- Misses         79       90      +11     
Impacted Files Coverage Δ
chemicalx/pipeline.py 80.19% <54.16%> (-8.41%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20f0e85...0542ab9. Read the comment docs.

@cthoyt
Copy link
Contributor

cthoyt commented Feb 8, 2022

I'm saying that implementing GPU usability and implementing accelerate are two independent things, and I would rather see a PR that first enables GPU usage explicitly without accelerate to make sure we don't make any accelerate-specific mistakes

@cthoyt
Copy link
Contributor

cthoyt commented Feb 8, 2022

Additionally I've opened a PR on torchdrug to solve the problem upstream, which will be much more elegant than us hacking it in: DeepGraphLearning/torchdrug#70. In the meantime, we could provide a compat module where we subclass PackedGraph with this function built-in and refer to that class throughout chemicalx

prediction = model(*model.unpack(batch))
loss_value = loss(prediction, batch.labels)

device_batch = to_device(model.unpack(batch), device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the batch generator should know what device to put the batches on so this doesn't have to be changed in the pipeline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e., the BatchGenerator.__init__ should take an optional torch.device (if not given, assume CPU) and the generation steps should take care of moving the tensors over to the appropriate device

cthoyt added a commit to cthoyt/chemicalx that referenced this pull request Feb 10, 2022
Closes AstraZeneca#76. This PR first requires AstraZeneca#84 to be tested and merged.

## Blocked by

- [ ] AstraZeneca#84
@cthoyt cthoyt changed the title Add GPU support Add multi-GPU support with accelerate Feb 11, 2022
@cthoyt
Copy link
Contributor

cthoyt commented Feb 12, 2022

@GavEdwards please note we've already merged a simple solution into the main branch and now updated your PR with it, please check it out

@@ -15,6 +15,9 @@
"pystow",
"pytdc",
"more-itertools",
"accelerate",
# FIXME what is packaging for?
"packaging",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's packaging for?

@@ -70,6 +71,45 @@ def save(self, directory: Union[str, Path]) -> None:
)


def to_device(objects, device):
Copy link
Contributor

@cthoyt cthoyt Feb 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary now that #84 and #86 are merged - please read through those PRs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will close the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GavEdwards I was only referring to the to_device function. There's still some benefit to consider adding accelerate for multi-GPU or TPU usage

@GavEdwards GavEdwards closed this Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants