-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable cuda graphs in training modes. #9338
Conversation
9a79543
to
cec4f53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change is capture error mode ?
@titu1994 yes, that is the only change, and I am very confident in it. I can elaborate if you want. |
Well, I also had to undo Vladimir's previous commit that turns cuda graphs off by default, except in transcribe_speech.py and transcribe_speech_parallel.py. This commit's changes: bb26e9846f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galv, very cool, thank you!
Please, add these changes also to nemo/collections/asr/parts/submodules/tdt_loop_labels_computer.py
@artbataev good point. I completely missed TDT. Done. I'm not sure how to test that one, but I suspect that the change is low risk anyway. |
deb8d28
to
9c52705
Compare
"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
9c52705
to
45a2981
Compare
Signed-off-by: galv <galv@users.noreply.github.com>
* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> * Apply isort and black reformatting Signed-off-by: galv <galv@users.noreply.github.com> --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com>
* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
* Re-enable cuda graphs in training modes. "global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0. Add relevant changs to TDT cuda graphs decoding as well. I didn't test the TDT change because I'm not sure how. But it seems low risk. * Apply isort and black reformatting --------- Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: galv <galv@users.noreply.github.com> Co-authored-by: Daniel Galvez <galv@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
"global" capture mode was sporadically crashing because of pinning host memory in other threads spawned by the data loader when num_workers > 0.
This would cause the ASR_dev_run_Speech_To_Text_HF_Finetuning CI/CD test to fail sporadically (maybe 1 out of 5 times).
What does this PR do ?
Fixes the crash by using "thread_local" stream capture instead of "global" stream capture.
Collection: ASR
Usage
Cuda graphs will now be used by default for inference in both training and inference scripts, following the previous behavior.
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
PR Type:
Note that I tested this by applying the following diff:
Then I ran this script:
Basically, I needed to add a way for speech_to_text_finetune.py to be able to modify the decoding algorithm, so I could test both the loop frames and loop labels code paths. I do not include this code in the PR, since it is not robust to all model types (e.g., AED). Since I run 100 times for each algorithm, we can be pretty sure that this fixes the problem.