-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #8891 by supported GPU-side batched CTC Greedy Decoding #9100
Conversation
To verify that latency does indeed fall, I ran the above prefixed by batch_inference=true case:
batched_inference=false case:
You can see that the decoder takes half the time it used to, and that its maximum runtime is 118ms, down from 171 ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new algorithm part looks great, but shouldn't be fused into CTC decoding but he treated as a strategy similar to how RNNT does it.
@nithinraok upto you if you want to fuse the two together like this with a flag or use rnnt strategy pattern here in CTC
Fixes NVIDIA#8891 Basically, doing max() on CPU one at a time is very very slow. It is better to do that all on the GPU before we do the copy over to CPU. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
e193293
to
3ebd1f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks far cleaner than before, minor comments but overall good enough to merge.
-
Can we make this the default algo ? If users want inference features that don't work in this mode they can switch to slower code path. NeMo 2.0 is the time for switching defaults if we want to.
-
can we name this greedy_batch similar to RNNT
@@ -213,7 +213,7 @@ def __init__(self, decoding_cfg, blank_id: int): | |||
self.batch_dim_index = self.cfg.get('batch_dim_index', 0) | |||
self.word_seperator = self.cfg.get('word_seperator', ' ') | |||
|
|||
possible_strategies = ['greedy', 'beam', 'pyctcdecode', 'flashlight'] | |||
possible_strategies = ['greedy', 'greedy_vectorized', 'beam', 'pyctcdecode', 'flashlight'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can't we call it greedy_batched similar to RNNT? I understand both are technically batched, and vectorized is more appropriate, however conformity is nice for both us and users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. I changed to "greedy_batched", and generally speaking changed "vectorized" to "batched" based on your observation about conformity.
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
a4e373e
to
edc2b13
Compare
Will need to get back to you about what it takes to make this the default algo. But the main concern is that many existing ".nemo" files will have the "strategy" field of the relevant config set to "greedy". It's not really clear to me how we can convert those users using pretrained models to use the "greedy_batched" strategy. For what it's worth, my tests check that the interfaces are effectively the same, so I could just use GreedyBatchedCTCInfer whenever "greedy" strategy is specified, unless that distresses you. |
I see CICD pipeline passed: https://github.com/NVIDIA/NeMo/actions/runs/8976148083 |
Hmm I see your point. Lets do this -
I'm not average to replacing "greedy" with greedy_batch, but then we need a different flag name to allow users to fall back to older codepath. Maybe we can call it "greedy_legacy" then? |
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
Warn when using greedy rather than greedy_batched strategy. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
for more information, see https://pre-commit.ci
@titu1994 done. To clarify, I went with the warn-once path rather than making "greedy" pointed to the new implementation. |
Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great ! Thanks for the refactor !
…IDIA#9100) * Support batched inference of greedy CTC decoding. Fixes NVIDIA#8891 Basically, doing max() on CPU one at a time is very very slow. It is better to do that all on the GPU before we do the copy over to CPU. This new algorithm has the same interface as the old one and can be accessed by setting strategy to "greedy_batched" rather than "greedy". Warn when using greedy rather than greedy_batched strategy. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
…IDIA#9100) * Support batched inference of greedy CTC decoding. Fixes NVIDIA#8891 Basically, doing max() on CPU one at a time is very very slow. It is better to do that all on the GPU before we do the copy over to CPU. This new algorithm has the same interface as the old one and can be accessed by setting strategy to "greedy_batched" rather than "greedy". Warn when using greedy rather than greedy_batched strategy. Signed-off-by: Daniel Galvez <dgalvez@nvidia.com>
What does this PR do ?
I add a new flag "batched_inference" to the CTC greedy decoder.
Collection: ASR
Changelog
Usage
Here is an example turning this on for transcribe_speech.py
Before your PR is "Ready for review"
Pre checks:
PR Type:
Additional Information
Here are the speedups in terms of throughput. First, I applied this diff to transcribe_speech.py in order to run multiple times:
Then I ran with and without this option turned on:
Without batched inference:
With batched inference:
You can see that RTFx throughput improvements are only modest. However, in terms of inference latency, there is a large speed up at the P99 latency levels, because the sometimes long calls to max() on the CPU no longer happen. We don't have a great way to measure this right now with NeMo's current tooling, so you will just have to believe me for now. See #8891 for details about this problem.