New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support uneven DDP inputs with pytorch model.join #3325
Comments
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Interested in this issue! Hopefully some progress is done soon 👍 |
Interested in this also :) |
Is there any progress on this issue? Happy to help in any way. |
@rohan-varma that would be great!! Want to try and submit a draft PR? And we can help from there? |
@edenlightning Sounds good, I also pinged the slack channel for any feedback/discussions. |
We'd also be very interested in this feature. Let us know if there's anything I can do to help! |
The PR #5141 is ready for review, in case anyone wants to take a look. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
I discussed this more with @rohan-varma - DDP join docs: https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel.join
As the LightningModule is wrapped in another module which is then wrapped with DDP, the LightningModule's As a result, any collective call (such as metric syncing or |
agree, I also don't see how this can be supported at the moment. |
@ananthsub commented 6 hours ago
In PyTorch, the "with Join" construct is used as a simple wrapper around training steps. It should work in simple cases even if there are more complex cases where it doesn't work. So, why not simply add an option to the trainer that enables wrapping the invocations of |
The join here is specific to pytorch DDP. If it was implemented, it would have to live inside the DDP plugin/strategy. For simple cases it may work, but no collective calls are allowed except the ones under If we did want to do it "correctly", we would probably have to set |
To recap, the plan would be:
Some sources: |
I assume there is, if collectives are being used. For example,
pytorch/pytorch#49180 is great! Hopefully this will clarify the drop_last argument which has a slightly misleading/incomplete description :) We would indeed need the UnrepeatedDistributedSampler. |
Hi, everyone, I'm gathering information on what is needed in order to support this properly.
Is that it? cc @awaelchli, @carmocca. |
@otaj Almost. Additionally, all metrics from |
oh, those |
if we can capture user calls with that, it might work similarly with |
let's check the option with LigthingLite first 🦦 |
Here is the corresponding issue as suggested in planning: #14635 |
See more details: pytorch/pytorch#38174
cc @Borda @tchaton @rohitgr7 @akihironitta @awaelchli
The text was updated successfully, but these errors were encountered: