Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Fabric test #17002

Merged
merged 9 commits into from
Mar 8, 2023
Merged

Fix race condition in Fabric test #17002

merged 9 commits into from
Mar 8, 2023

Conversation

awaelchli
Copy link
Member

@awaelchli awaelchli commented Mar 8, 2023

What does this PR do?

There is a race condition if we do these in a sequence:

torch.distributed.destroy_process_group()
torch.distributed.init_process_group()

If one of the processes is slightly faster, and enters the init call before the other one finished destruction, we get in trouble.

cc @carmocca @akihironitta @Borda

@awaelchli awaelchli added the ci Continuous Integration label Mar 8, 2023
@awaelchli awaelchli added this to the 2.0 milestone Mar 8, 2023
@github-actions github-actions bot added fabric lightning.fabric.Fabric and removed ci Continuous Integration labels Mar 8, 2023
@awaelchli awaelchli added bug Something isn't working ci Continuous Integration and removed fabric lightning.fabric.Fabric labels Mar 8, 2023
@awaelchli awaelchli self-assigned this Mar 8, 2023
@awaelchli awaelchli marked this pull request as ready for review March 8, 2023 19:36
@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2023

⚡ Required checks status: All passing 🟢

Groups summary

🟢 lightning_fabric: CPU workflow
Check ID Status
fabric-cpu (macOS-11, lightning, 3.8, 1.11) success
fabric-cpu (macOS-11, lightning, 3.9, 1.12) success
fabric-cpu (macOS-11, lightning, 3.10, 1.13) success
fabric-cpu (macOS-11, lightning, 3.8, 1.11, oldest) success
fabric-cpu (macOS-11, lightning, 3.9, 2.0, pre) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 1.11) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.12) success
fabric-cpu (ubuntu-20.04, lightning, 3.10, 1.13) success
fabric-cpu (ubuntu-20.04, lightning, 3.8, 1.11, oldest) success
fabric-cpu (ubuntu-20.04, lightning, 3.9, 2.0, pre) success
fabric-cpu (windows-2022, lightning, 3.9, 1.11) success
fabric-cpu (windows-2022, lightning, 3.10, 1.12) success
fabric-cpu (windows-2022, lightning, 3.10, 1.13) success
fabric-cpu (windows-2022, lightning, 3.8, 1.11, oldest) success
fabric-cpu (windows-2022, lightning, 3.9, 2.0, pre) success
fabric-cpu (macOS-11, fabric, 3.8, 1.13) success
fabric-cpu (ubuntu-20.04, fabric, 3.8, 1.13) success
fabric-cpu (windows-2022, fabric, 3.8, 1.13) success

These checks are required after the changes to tests/tests_fabric/parity/test_parity_ddp.py.

🟢 lightning_fabric: Azure GPU
Check ID Status
lightning-fabric (GPUs) success

These checks are required after the changes to tests/tests_fabric/parity/test_parity_ddp.py.


Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

@mergify mergify bot added the ready PRs ready to be merged label Mar 8, 2023
@awaelchli awaelchli merged commit aa7f252 into master Mar 8, 2023
@awaelchli awaelchli deleted the ci/fix-fabric-hang branch March 8, 2023 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants