-
Notifications
You must be signed in to change notification settings - Fork 3.4k
NCCL TImeout #13562
Copy link
Copy link
Closed as not planned
Labels
Description
Describe the bug
I am getting NCCL timeout issue while training the model. The code usually runs for 40k epochs and then fails with the below error:
[rank2]:[E513 13:25:57.714781669 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
[rank1]:[E513 13:25:57.714777142 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
[rank0]:[E513 13:25:57.714786872 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut
=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
[rank2]:[E513 13:25:57.715039178 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.
[rank2]:[E513 13:25:57.715055330 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.
[rank2]:[E513 13:25:57.715061964 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.
[rank2]:[E513 13:25:57.715068256 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E513 13:25:57.715087695 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 532341, last en
queued NCCL work: 532341, last completed NCCL work: 532340.
[rank1]:[E513 13:25:57.715092955 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.
[rank0]:[E513 13:25:57.715112458 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 532341, last enqueued NCCL work: 532341, last completed NCCL w
ork: 532340.
[rank1]:[E513 13:25:57.715120352 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.
[rank1]:[E513 13:25:57.715130246 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.
[rank0]:[E513 13:25:57.715136289 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.
[rank1]:[E513 13:25:57.715137895 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E513 13:25:57.715143973 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E513 13:25:57.716347552 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6aa9ca0446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f6aaafa5672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6aaafacab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6aaafae51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f6af7bd95c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f6afacee609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6afaab9353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E513 13:25:57.716872543 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75e5bd2446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f75e6ed7672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f75e6edeab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f75e6ee051d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7633b0b5c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f7636c20609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f76369eb353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E513 13:25:57.716878046 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f35c3c446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f36f41672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f36f48ab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f36f4a51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f8f83b755c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f8f86c8a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f8f86a55353 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
Steps/Code to reproduce bug
def main(cfg):
logging.info(f'Hydra config: {OmegaConf.to_yaml(cfg)}')
trainer = pl.Trainer(**resolve_trainer_cfg(cfg.trainer))
exp_manager(trainer, cfg.get("exp_manager", None))
asr_model = EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)
# Initialize the weights of the model from another model, if provided via config
asr_model.maybe_init_from_pretrained_checkpoint(cfg)
trainer.fit(asr_model)
if hasattr(cfg.model, 'test_ds') and cfg.model.test_ds.manifest_filepath is not None:
if asr_model.prepare_test(trainer):
trainer.test(asr_model)
if __name__ == '__main__':
main()
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install: poetry install
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version: Ubuntu 20.04.6 LTS
- PyTorch version: 2.5.1+cu118
- Python version: 3.10
Additional context
Add any other context about the problem here.
Example: GPU model
Reactions are currently unavailable