-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory is fully eaten and training quit with errors for 40k hours ASR training #8897
Comments
Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ? What version of NeMo are you using ? What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory. If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution |
Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?
What version of NeMo are you using ? git log
Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch. |
On Fri, Apr 12, 2024 at 2:21 PM Somshubra Majumdar ***@***.***> wrote:
Is it GPU or CPU memory that is exhausted ? And how many nodes are you
using ?
What version of NeMo are you using ?
Without sufficient details it's not possible to debug.
What I can say is we train on nodes with 400 GB ram per node and A100 with
80GB gpu memory and train on 90-400K hours of speech without oom in either
CPU or GPU memory.
How many nodes have you used ? if you use a lot of nodes, then you might
not trigger the bugs. Say, you have used 8 nodes, then there might be no
issues ...
Regards,
Haihua
… If you can visibly see CPU ram constantly increase during training, a
pseudo fix could be to use exp_manager.max_time_per_run and set it to a
reasonable value like a day, then the job stops after a day and you can
restart it and avoid memory leak. It's not a fix but a temporary solution
—
Reply to this email directly, view it on GitHub
<#8897 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZBHYOSYLVXC677XXF2LWLY454PZAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGA3DKMRTHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes |
Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well) |
Hi there, Just checking here and wondering whether this is resolved? Thank you. |
using multiple nodes to train can avoid the problem.
…On Fri, May 24, 2024, 1:58 AM ROZBEH ***@***.***> wrote:
Hi there,
Just checking here and wondering whether this is resolved?
I am facing same issue.
Thank you.
—
Reply to this email directly, view it on GitHub
<#8897 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZBHYJPQ36M6WWYAGGERQLZDYU2RAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXG42DGNRVHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks @haihua |
Yes, that's it.
…On Fri, May 24, 2024, 8:06 PM ROZBEH ***@***.***> wrote:
Thanks @haihua <https://github.com/haihua>
I'm indeed 5 nodes with 5 GPU each. Is that what you mean?
—
Reply to this email directly, view it on GitHub
<#8897 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZBHYLT2XC5A4QKW26PJHDZD4ULRAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGM3DQNJYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see but the above issue is persistent with multi node and I'd like to get it working. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
During training, memory is noticed increasing as time goes on, until 74% training done and no memory available. The training is quit giving the following errors:
We are using Coformer+ CTC with 1.2 cpu memory, 8 workers.
We guess this might be related with pytorch lightning, and some of memory is hold till an epoch is done.
There are no errors for smaller data, but once training data is getting larger, the bugs are triggered.
Please give us tips for how to cure the problem, thanks !
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.$devices --accumulate_grad_batches 8 --accelerator 'cuda' --save_top_k 5 --val_check_interval 200 --checkpoint_path $ {ckpt_path} --model.optim.lr 0.25 --model.optim.sched.warmup_steps 10000 --model.train_ds.max_duration 18.2 --model.train_ds.num_workers 8 --model.optim.sched.name NoamAnnealing --model.train_ds.manifest_filepath ${train_data} --resume_from_checkpoint "${pretrained_mdl}" --model.tokenizer.dir ${tokenizer} --model.tokenizer.type 'bpe' --model.train_ds.batch_size 64 --model.validation_ds.batch_size 64 --model.validation_ds.manifest_filepath ${valid_data} --model.interctc.loss_weights "[]" --model.interctc.apply_at_layers "[]" --model.optim.sched.last_epoch 25000
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800190 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800421 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800380 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800345 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=252347, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800374 milliseconds before timing out.
asr_scripts_78/egs/sg/run.sh: line 547: 181694 Aborted (core dumped) python3 $scripts/asr_trainer.py --conf $cfg --devices
The text was updated successfully, but these errors were encountered: