-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfixes for decoding with Flashlight decoder #8856
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the fix !
The PR is ready to merge, however we require all contributors to sign their commits. Could you follow the instructions in the dco bot to resolve this - https://github.com/NVIDIA/NeMo/pull/8856/checks?check_run_id=23589977417 |
Ok, will do so. But there is one more commit that I want to make which fixes the lexicon-free decoding. Or should I make a new PR for it? It's also only a few lines of code that will change and I think it's better to have it in a single PR. |
Signed-off-by: Michael Hentschel <hentschel.michael@worksmobile.com>
60348e7
to
a1dace8
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Michael Hentschel <hentschel.michael@line-works.com>
Signed-off-by: Michael Hentschel <hentschel.michael@line-works.com>
@titu1994 I hope everything is fine now. |
} | ||
self.word_dict = create_word_dict(d) | ||
# Dictionary contains a mapping of our ASR model's output IDs to the tokens in the vocabulary | ||
self.word_dict = Dictionary(self.tokenizer_wrapper.vocab) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dictionary has to contain the same number of IDs as the size of the ASR's output layer. I think we can't just add an <unk>
token if it isn't already in the tokeniser's vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need a review for this one, but I'm heading to a conference. @nithinraok can you ping the author of the file for review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if you remove the <unk>
it causes a lot of problems with silences. I remember I added this hack because we had a lot of problems with lexicon-free decoding in that we were not generating spaces correctly, but adding this in fixed it. Although this is from 2 years ago, so my brain is a bit hazy with specifics, but I would strongly urge that you test out your new code without the <unk>
in lexicon-free mode on English and Japanese to ensure you're not seeing any strange behaviour (it is also possible that whatever issue I was seeing 2 years ago when I added this hack has since been fixed in the C++ part of the flashlight code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trias702
Sorry for the late reply. I was at ICASSP and last week was Golden Week in Japan (default holiday season).
First things first, I haven't tested the implementation with a model/tokeniser that does not have an OOV token. All tokenisers that I'm using have an OOV token. I will have a look at the CTC probs for the OOV token, if it actually is output anywhere.
Regarding the Flashlight code, the corresponding decoding part is here
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/LexiconFreeDecoder.cpp#L30-L125
The decoder iterates over the top token_size
probabilities. If the AM doesn't have an OOV token, it should show up in the emission probabilities. Equally, if the LM doesn't have an OOV token, we should not be able to get a score for the OOV token, i.e., these lines should fail.
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/LexiconFreeDecoder.cpp#L72-L73
because we get this error
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/lm/KenLM.cpp#L66-L69
(I assume that an OOV token should be added after the tokeniser's vocab and have an index > usrToLmIdxMap_.size()
)
665d9c2
to
46589af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, however, I am not sure about wrapping self.tokenizer_wrapper.vocab
with the flashlight Dictionary class and not adding the <unk>
token if it's not there. Have you tested lexicon-free decoding with this change for both English and Japanese and confirm it works correctly for both? That is my only hesitation which I would ask is tested and verified with the Riva team, but everything else LGTM.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
I had some time to test the fix with an English model. The model is the same as the baseline from here (CTC Conformer) and the LM is also the same (sentencepiece 6gram). The tokeniser has an Without bug fix
With bug fix
For reference
|
Sounds good, I appreciate you running these, LGTM, I'll approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: hentschel.michael@line-works.com
7f5b5d2
to
1814646
Compare
Signed-off-by: michiahe <michiahe@users.noreply.github.com>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
What does this PR do ?
Bugfixes for CTC beam search decoding with Flashlight.
Fix the maximum decoding length for CTC beam search decoding with Flashlight. The info about the maximum sequence length has to passed to Flashlight decoder otherwise junk/masked frames are decoded (results in insertion errors etc.). This info is passed properly to OpenSeq2Seq decoders and Pyctcdecoders by restricting the sequence length.
Fix the dictionary creation for lexicon-free decoding. The Dictionary has to contain a mapping from the ASR model's output IDs to the token in the vocabulary, i.e., exactly the same as the vocabulary in the tokeniser. The order of this mapping matters.
Collection: ASR
Changelog
Usage
# Add a code snippet demonstrating how to use this
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information