fix(tdt): preserve disable_cuda_graphs state across change_decoding_strategy() calls#15457
Closed
CodersAcademy006 wants to merge 5 commits into
Closed
Conversation
…Mo#15432) * Add initial uv lock Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add build docs Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix docs build Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix docstring formatting in magpietts.py Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix docs build Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Rename broken links files Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add release docs jobs Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix docs Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Dry-run of docs publishing Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Dry-run of docs publishing" This reverts commit 1c3aa19. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Revert "Dry-run of docs publishing"" This reverts commit 43c19ae. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix dry run Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix broken links Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add retries for linke check Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix broken link Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix broken link Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Revert "Revert "Dry-run of docs publishing""" This reverts commit a28f306. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Revert "Revert "Revert "Dry-run of docs publishing"""" This reverts commit 9353dcb. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Test nightly publish Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Increase docs broken link retry and timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Test nightly publish" This reverts commit 9c8e7a4. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Revert "Revert "Revert "Revert "Dry-run of docs publishing""""" This reverts commit 850207e. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix docs footer Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Test nightly push Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "Test nightly push" This reverts commit 05894f2. Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Srijan Upadhyay <104912634+CodersAcademy006@users.noreply.github.com>
…trategy() Signed-off-by: Srijan Upadhyay <104912634+CodersAcademy006@users.noreply.github.com>
Signed-off-by: Srijan Upadhyay <104912634+CodersAcademy006@users.noreply.github.com>
f06c118 to
78474bd
Compare
Signed-off-by: CodersAcademy006 <CodersAcademy006@users.noreply.github.com>
Author
|
@rsclafani Please look into this as well, Thank You. And please provide me what else i can change in this? |
Collaborator
|
Thanks for the contribution. from omegaconf import open_dict
# greedy decoding
with open_dict(model.cfg.decoding.greedy):
model.cfg.decoding.greedy.use_cuda_graph_decoder = False
# beam decoding
with open_dict(model.cfg.decoding.beam):
model.cfg.decoding.beam.allow_cuda_graphs = False
model.change_decoding_strategy(model.cfg.decoding)We should not introduce such stateful changes - |
Author
|
@artbataev I sincerely apologize if I moved forward without formal confirmation. Please guide me on the necessary steps to rectify this and ensure the issue is completely resolved. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When
disable_cuda_graphs()is called on the decoding computer, the disabledstate is lost whenever
change_decoding_strategy()is invoked — because itcreates a brand new
decoding_computerobject with CUDA graphs re-enabledby default (
cuda_graphs_mode='full_graph').This silently re-enables CUDA graphs even when the user explicitly disabled
them, re-introducing the
torch.load()corruption bug reported in #15423.Concretely, calling
model.transcribe(timestamps=True)triggerschange_decoding_strategy()internally, which replaces the entiredecoding_computerobject — resetting CUDA graph state in the process.Root Cause
disable_cuda_graphs()state was only stored inside_decoding_computerchange_decoding_strategy()creates a completely new_decoding_computerFix
Two minimal changes across two files:
tdt_decoding.py—BeamBatchedTDTInfer.disable_cuda_graphs():Track the disabled state via a
_cuda_graphs_disabled = Trueflag on theouter object so it survives
_decoding_computerreplacement.rnnt_models.py—EncDecRNNTModel.change_decoding_strategy():_cuda_graphs_disabledfrom the currentdecoding_computerBEFOREreplacing
self.decodingself.decodingis created, restore the disabled state onthe new
decoding_computertry/except AttributeErrorto remainsafe across all model types that call this method
Testing
The reproducer from #15423 no longer corrupts
torch.load()aftertranscribe(timestamps=True)with this fix applied:Fixes #15423
Notes
try/except AttributeErrorguards make this safe for all RNNT modelvariants, not just TDT
torch.load()dispatch table entriesduring CUDA graph capture is a separate upstream issue and should be tracked
in pytorch/pytorch