NCCL TImeout

**Describe the bug**
I am getting NCCL timeout issue while training the model. The code usually runs for 40k epochs and then fails with the below error:

```
[rank2]:[E513 13:25:57.714781669 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank1]:[E513 13:25:57.714777142 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5
164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.                                                                                                           
[rank0]:[E513 13:25:57.714786872 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut
=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.                                                                                                            
[rank2]:[E513 13:25:57.715039178 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank2]:[E513 13:25:57.715055330 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank2]:[E513 13:25:57.715061964 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank2]:[E513 13:25:57.715068256 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715087695 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 532341, last en
queued NCCL work: 532341, last completed NCCL work: 532340.                                                                                                                            
[rank1]:[E513 13:25:57.715092955 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 532339, last en
queued NCCL work: 532340, last completed NCCL work: 532338.                                                                                                                            
[rank0]:[E513 13:25:57.715112458 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 532341, last enqueued NCCL work: 532341, last completed NCCL w
ork: 532340.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715120352 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 532339, last enqueued NCCL work: 532340, last completed NCCL w
ork: 532338.                                                                                                                                                                           
[rank1]:[E513 13:25:57.715130246 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank0]:[E513 13:25:57.715136289 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU opera
tions might run on corrupted/incomplete data.                                                                                                                                          
[rank1]:[E513 13:25:57.715137895 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                
[rank0]:[E513 13:25:57.715143973 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.                                                
[rank2]:[E513 13:25:57.716347552 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.          
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):                                                                
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6aa9ca0446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)                                                                                                                                                        
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f6aaafa5672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)                                                                                
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6aaafacab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)                                                                                                                                                          
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6aaafae51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)                                                                                                                                                         
frame #4: <unknown function> + 0x145c0 (0x7f6af7bd95c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)                                                                                                                                                                                    
frame #5: <unknown function> + 0x8609 (0x7f6afacee609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f6afaab9353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             
[rank0]:[E513 13:25:57.716872543 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532341, OpType=_ALLGATHER_BASE, NumelIn=513, NumelOut=2052, Timeout(ms)=1800000) ran for 1800036 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f75e5bd2446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f75e6ed7672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f75e6edeab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f75e6ee051d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f7633b0b5c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f7636c20609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f76369eb353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

[rank1]:[E513 13:25:57.716878046 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=532339, OpType=BROADCAST, NumelIn=5164000, NumelOut=5164000, Timeout(ms)=1800000) ran for 1800042 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f8f35c3c446 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/sit
e-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7f8f36f41672 in /home/ubuntu/.cache/pypoetry/v
irtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f8f36f48ab3 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-package
s/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f8f36f4a51d in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packag
es/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f8f83b755c0 in /home/ubuntu/.cache/pypoetry/virtualenvs/nemo-asr-finetuning-tBwAFCAJ-py3.10/lib/python3.10/site-packages/torch/lib/libtorch.
so)
frame #5: <unknown function> + 0x8609 (0x7f8f86c8a609 in /lib/x86_64-linux-gnu/libpthread.so.0)                                                                                        
frame #6: clone + 0x43 (0x7f8f86a55353 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                             

Aborted (core dumped)                        

```

**Steps/Code to reproduce bug**

```
def main(cfg):
    logging.info(f'Hydra config: {OmegaConf.to_yaml(cfg)}')

    trainer = pl.Trainer(**resolve_trainer_cfg(cfg.trainer))
    exp_manager(trainer, cfg.get("exp_manager", None))
    asr_model = EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)

    # Initialize the weights of the model from another model, if provided via config
    asr_model.maybe_init_from_pretrained_checkpoint(cfg)

    trainer.fit(asr_model)

    if hasattr(cfg.model, 'test_ds') and cfg.model.test_ds.manifest_filepath is not None:
        if asr_model.prepare_test(trainer):
            trainer.test(asr_model)


if __name__ == '__main__':
    main()
```
A  helpful guide on on how to craft a minimal bug report  http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. 


**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment overview (please complete the following information)**

 - Environment location: Bare-metal
 - Method of NeMo install: poetry install 
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version: Ubuntu 20.04.6 LTS
- PyTorch version: 2.5.1+cu118
- Python version: 3.10

**Additional context**

Add any other context about the problem here.
Example: GPU model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL TImeout #13562

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL TImeout #13562

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions