Main process hanging on TPU Pod #16843
Labels
accelerator: tpu
Tensor Processing Unit
bug
Something isn't working
fabric
lightning.fabric.Fabric
pl
Generic label for PyTorch Lightning package
Milestone
Bug description
On a TPU pod with more than 1 hosts, the main process on non-master hosts hangs at https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L89
The root cause is the
process_id
is used aslocal_rank
, but it is actually a global rank id: https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L106Because of that, on non-master hosts, the spawned processes don't put anything in the
return_queue
: https://github.com/Lightning-AI/lightning/blob/beced489040f76e7eee2f4a82d29823834b77327/src/lightning/pytorch/strategies/launchers/xla.py#L112-L113so the main process hangs when trying to get item from the queue.
A potential fix would be to use
cluster_environment
to providelocal_rank
instead.@JackCaoG @steventk-g @will-cromar
How to reproduce the bug
No response
Error messages and logs
Environment
More info
No response
cc @JackCaoG @steventk-g @Liyang90 @carmocca @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: