Fastconformer-CTC crashing with Watchdog caught collective operation timeout #9563

duhtapioca · 2024-06-28T15:02:00Z

Hi,

We're trying to finetune the stt_en fastconformer_ctc large model on around 20k hours of data on 2 h100s. We're using 128 batch size and 8 num_workers and we trained the tokenizer with 1024 vocab_size. The training is taking very long, more than 30 hours per epoch and after around 70-80% of the first epoch it's crashing with the following error:

Epoch 0:  78%|█████████████████████████████████████████████████████████████▎                 | 64809/83556 [25:06:58<7:15:54,  1.40s/it, v_num=18][rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 
1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1363596, OpType=BROADCAST, NumelIn=5158880, NumelOut=5158880, Timeout(ms)=1800000) ran for 1800047 milliseconds before timin
g out.
[rank1]:[E ProcessGroupNCCL.cpp:1316] [PG 0 Rank 1] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=2
[rank1]:[E ProcessGroupNCCL.cpp:1153] [PG 0 Rank 1] ProcessGroupNCCL preparing to dump debug info.
[rank1]:[F ProcessGroupNCCL.cpp:1169] [PG 0 Rank 1] [PG 0 Rank 1] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typi
cally indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the 
watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TO
RCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug th
e hang. workMetaList_.size() = 2
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1363598, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 18
00095 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 0] Timeout at NCCL work: 1363598, last enqueued NCCL work: 1363598, last completed NCCL work: 1363597.
[rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted
/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1363598, 
OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800095 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f47fd81e897 in /anaconda/envs/nemo_hindi/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f47feaf7c62 in /anaconda/envs/nemo_hindi/lib/python
3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f47feafca80 in /anaconda/envs/nemo_hindi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f47feafddcc in /anaconda/envs/nemo_hindi/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f484a5a6bf4 in /anaconda/envs/nemo_hindi/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f484ba9b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f484b866353 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

How do we avoid this issue? Should we consider reducing the finetuning data size? If we save intermediate checkpoints, is there a way to also save the lr scheduler state to effectively resume the training if it crashes? Any guidance regarding this would be of great help.

Also, unrelated to the issue, we noticed we didn't get much boost by using h100 instead of a100, and sometimes using bf16-mixed was slower on h100 than using fp16 on h100, on the other hand, bf16-mixed is almost always faster on a100 than fp16, is this expected?

Thank you!

The text was updated successfully, but these errors were encountered:

zhang7346 · 2024-07-11T03:35:39Z

I have the same issue. Have you solved it? How can we avoid it？

titu1994 · 2024-07-11T04:27:36Z

First, to triage whether it's the model or data store as the problem, run with a subset of data, maybe 50 hours of so. What is the max duration of the data ? Reduce it to at most 40 seconds, preferably 30 sec. We have some tools to segment data automatically.

Next, nccl timeout is hard to debug because NeMo code mostly uses pytorch, we don't do much at nccl level so it can be due to many different reasons. See if model fine-tuning on single gpu with small bs is working first then try two gpus.

LR and optimizer State is preserved in the ckpt files saved by Lightning during training. If you use exp manager, resuming a job is quite easy, see the docs for exp manager and tutorials showcasing training with it (just run the same script again with same output dir if you have set the two resume flags in exp manager).

We don't have much information about hardware effects on certain operation in our team, we rely on pytorch and pytorch lightning to provide stable training engine

github-actions · 2024-08-11T01:55:42Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

orena1 · 2024-08-11T18:22:36Z

It might be a memory leak, it happened sometime when the system RAM is full.

…

On Sat, Aug 10, 2024 at 9:56 PM github-actions[bot] < ***@***.***> wrote: This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. — Reply to this email directly, view it on GitHub <#9563 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACERJIKBOGY46SDYCTO5Z3TZQ3ADTAVCNFSM6AAAAABKCCCZV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBSGM2DOOBYGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

github-actions · 2024-09-11T01:55:19Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-09-18T01:55:47Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

duhtapioca added the bug Something isn't working label Jun 28, 2024

elliottnv assigned titu1994 Jul 3, 2024

github-actions bot added the stale label Aug 11, 2024

github-actions bot removed the stale label Aug 12, 2024

github-actions bot added the stale label Sep 11, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fastconformer-CTC crashing with Watchdog caught collective operation timeout #9563

Fastconformer-CTC crashing with Watchdog caught collective operation timeout #9563

duhtapioca commented Jun 28, 2024

zhang7346 commented Jul 11, 2024

titu1994 commented Jul 11, 2024

github-actions bot commented Aug 11, 2024

orena1 commented Aug 11, 2024 via email

github-actions bot commented Sep 11, 2024

github-actions bot commented Sep 18, 2024

Fastconformer-CTC crashing with Watchdog caught collective operation timeout #9563

Fastconformer-CTC crashing with Watchdog caught collective operation timeout #9563

Comments

duhtapioca commented Jun 28, 2024

zhang7346 commented Jul 11, 2024

titu1994 commented Jul 11, 2024

github-actions bot commented Aug 11, 2024

orena1 commented Aug 11, 2024 via email

github-actions bot commented Sep 11, 2024

github-actions bot commented Sep 18, 2024