Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main process hanging on TPU Pod #16843

Closed
Liyang90 opened this issue Feb 22, 2023 · 1 comment · Fixed by #16844
Closed

Main process hanging on TPU Pod #16843

Liyang90 opened this issue Feb 22, 2023 · 1 comment · Fixed by #16844
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package
Milestone

Comments

@Liyang90
Copy link
Contributor

Liyang90 commented Feb 22, 2023

Bug description

On a TPU pod with more than 1 hosts, the main process on non-master hosts hangs at https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L89
The root cause is the process_id is used as local_rank, but it is actually a global rank id: https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L106

Because of that, on non-master hosts, the spawned processes don't put anything in the return_queue: https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L112-L113
so the main process hangs when trying to get item from the queue.

A potential fix would be to use cluster_environment to provide local_rank instead.

@JackCaoG @steventk-g @will-cromar

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @JackCaoG @steventk-g @Liyang90 @carmocca @justusschock @awaelchli

@Liyang90 Liyang90 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 22, 2023
@carmocca carmocca added accelerator: tpu Tensor Processing Unit fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package and removed needs triage Waiting to be triaged by maintainers labels Feb 22, 2023
@carmocca carmocca added this to the v1.9.x milestone Feb 22, 2023
@awaelchli
Copy link
Member

That's a nice find, thanks! Didn't know it is actually the global rank, because in torch.multiprocessing it would be the local rank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants