Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

**gather_all_tensors_if_available** share the same underlying storage for all GPUs #3253

Closed
ShomyLiu opened this issue Aug 29, 2020 · 5 comments 路 Fixed by #3319
Closed

**gather_all_tensors_if_available** share the same underlying storage for all GPUs #3253

ShomyLiu opened this issue Aug 29, 2020 · 5 comments 路 Fixed by #3319
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@ShomyLiu
Copy link
Contributor

馃悰 Bug

Hi, one of new features in #2528 gather_all_tensors_if_available has a list copy bug, and this would lead that tensors in all GPUs are the wrongly same as one GPU, since they share the same storage:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/metrics/converters.py#L304

gathered_result = world_size * [torch.zeros_like(result)]

change into:

gathered_result = [torch.zeros_like(result) for _ in range(world_size)]
@ShomyLiu ShomyLiu added bug Something isn't working help wanted Open to be worked on labels Aug 29, 2020
@edenlightning edenlightning added this to the 0.9.x milestone Sep 1, 2020
@edenlightning
Copy link
Contributor

@justusschock @SkafteNicki

@justusschock justusschock added this to To do in Metrics package via automation Sep 1, 2020
@SkafteNicki
Copy link
Member

@ShomyLiu good catch, would you be up for sending a PR? Please, note that the function is not used anywhere yet, but are there for future changes to the metric package.

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 1, 2020

@SkafteNicki it's my pleasure for a PR. I will finish this as soon as possible.

Yeah, it's a new function to wrap the torch.distributed.all_gather. But I think it is a very common use case; especially, when using DDP mode, we always need to gather all the outputs cross all the GPUs.

@SkafteNicki
Copy link
Member

@ShomyLiu Yes, I agree that it is a common use case.
Please ping me when PR is ready.

@ShomyLiu
Copy link
Contributor Author

ShomyLiu commented Sep 2, 2020

@SkafteNicki Hi, I have sent a PR jus now for your review #3319

ShomyLiu added a commit to ShomyLiu/pytorch-lightning that referenced this issue Sep 3, 2020
justusschock pushed a commit to ShomyLiu/pytorch-lightning that referenced this issue Sep 3, 2020
Metrics package automation moved this from To do to Done Sep 3, 2020
SkafteNicki pushed a commit that referenced this issue Sep 3, 2020
* Fix: gather_all_tensors cross GPUs in metrics

* add a test case for gather_all_tensors_ddp in #3253
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants