Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddp_fix #270

Merged
merged 7 commits into from Feb 14, 2023
Merged

ddp_fix #270

merged 7 commits into from Feb 14, 2023

Conversation

shatz01
Copy link
Collaborator

@shatz01 shatz01 commented Feb 5, 2023

No description provided.

@shatz01 shatz01 changed the title 1 node 2 gpus works ddp_fix Feb 6, 2023
@shatz01
Copy link
Collaborator Author

shatz01 commented Feb 6, 2023

Single node multi gpu works (tested with 2 and 4 gpu)

@shatz01
Copy link
Collaborator Author

shatz01 commented Feb 6, 2023

Confirmed it works multi-node with exactly the same performance as an equivalent multi-gpu single node config :) (e.g. 2 node 1 gpu == 1 node 2 gpu)

@mosheraboh
Copy link
Collaborator

@smartdanny is it ready?

@shatz01
Copy link
Collaborator Author

shatz01 commented Feb 12, 2023

Basically this commit can be considered ready.

I wanted to also make some of the examples like mnist and knight ddp friendly "plug and play" but I am having a lot of trouble with the batch sampler. It seems pytorch lightning has an extremely hacky way to allow a sampler to work in ddp (see Lightning-AI/pytorch-lightning#13640).

I can get it working, but it would probably take a bit of work, and im not sure if its worth it at the moment. The options are:

  1. Leave examples the way they are (User would have to know to remove batch sampler)
  2. Remove batch sampler from examples
  3. Make all the examples use lightning datamodule (which has built in support for additional samplers, albeit sketchy)

Let me know what you think is best @mosheraboh @SagiPolaczek

@shatz01 shatz01 marked this pull request as ready for review February 12, 2023 17:38
@mosheraboh
Copy link
Collaborator

@smartdanny , we want one example with our batch sampler that works - I guess that it means to use datamodule.

@shatz01
Copy link
Collaborator Author

shatz01 commented Feb 13, 2023

Last thing is just to make a Knight ddp ready with sampler my making an datamodule. Feel free to merge or I can also add it to this PR. @mosheraboh

Copy link
Collaborator

@mosheraboh mosheraboh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
Go ahead and merge it.

@shatz01 shatz01 merged commit 2d47362 into BiomedSciAI:master Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants