Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfixes for decoding with Flashlight decoder #8856

Closed
wants to merge 7 commits into from

Conversation

michiahe
Copy link

@michiahe michiahe commented Apr 8, 2024

What does this PR do ?

Bugfixes for CTC beam search decoding with Flashlight.

  1. Fix the maximum decoding length for CTC beam search decoding with Flashlight. The info about the maximum sequence length has to passed to Flashlight decoder otherwise junk/masked frames are decoded (results in insertion errors etc.). This info is passed properly to OpenSeq2Seq decoders and Pyctcdecoders by restricting the sequence length.

  2. Fix the dictionary creation for lexicon-free decoding. The Dictionary has to contain a mapping from the ASR model's output IDs to the token in the vocabulary, i.e., exactly the same as the vocabulary in the tokeniser. The order of this mapping matters.

Collection: ASR

Changelog

  • Fix the maximum decoding length for CTC beam search decoding with Flashlight
  • Fix the dictionary creation for lexicon-free decoding

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the ASR label Apr 8, 2024
titu1994
titu1994 previously approved these changes Apr 11, 2024
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the fix !

@titu1994
Copy link
Collaborator

The PR is ready to merge, however we require all contributors to sign their commits. Could you follow the instructions in the dco bot to resolve this - https://github.com/NVIDIA/NeMo/pull/8856/checks?check_run_id=23589977417

@michiahe
Copy link
Author

Ok, will do so. But there is one more commit that I want to make which fixes the lexicon-free decoding. Or should I make a new PR for it? It's also only a few lines of code that will change and I think it's better to have it in a single PR.

Signed-off-by: Michael Hentschel <hentschel.michael@worksmobile.com>
pre-commit-ci bot and others added 2 commits April 11, 2024 07:33
Signed-off-by: Michael Hentschel <hentschel.michael@line-works.com>
@michiahe michiahe changed the title Fix decoding length for Flashlight decoder Bugfixes for decoding with Flashlight decoder Apr 11, 2024
Signed-off-by: Michael Hentschel <hentschel.michael@line-works.com>
@michiahe
Copy link
Author

michiahe commented Apr 11, 2024

@titu1994 I hope everything is fine now.

}
self.word_dict = create_word_dict(d)
# Dictionary contains a mapping of our ASR model's output IDs to the tokens in the vocabulary
self.word_dict = Dictionary(self.tokenizer_wrapper.vocab)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dictionary has to contain the same number of IDs as the size of the ASR's output layer. I think we can't just add an <unk> token if it isn't already in the tokeniser's vocab.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need a review for this one, but I'm heading to a conference. @nithinraok can you ping the author of the file for review

Copy link
Collaborator

@trias702 trias702 Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you remove the <unk> it causes a lot of problems with silences. I remember I added this hack because we had a lot of problems with lexicon-free decoding in that we were not generating spaces correctly, but adding this in fixed it. Although this is from 2 years ago, so my brain is a bit hazy with specifics, but I would strongly urge that you test out your new code without the <unk> in lexicon-free mode on English and Japanese to ensure you're not seeing any strange behaviour (it is also possible that whatever issue I was seeing 2 years ago when I added this hack has since been fixed in the C++ part of the flashlight code).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trias702
Sorry for the late reply. I was at ICASSP and last week was Golden Week in Japan (default holiday season).

First things first, I haven't tested the implementation with a model/tokeniser that does not have an OOV token. All tokenisers that I'm using have an OOV token. I will have a look at the CTC probs for the OOV token, if it actually is output anywhere.

Regarding the Flashlight code, the corresponding decoding part is here
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/LexiconFreeDecoder.cpp#L30-L125
The decoder iterates over the top token_size probabilities. If the AM doesn't have an OOV token, it should show up in the emission probabilities. Equally, if the LM doesn't have an OOV token, we should not be able to get a score for the OOV token, i.e., these lines should fail.
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/LexiconFreeDecoder.cpp#L72-L73
because we get this error
https://github.com/flashlight/text/blob/bbe9e3c201f5c9c3f3c0d553f0ea73af5e0a5209/flashlight/lib/text/decoder/lm/KenLM.cpp#L66-L69

(I assume that an OOV token should be added after the tokeniser's vocab and have an index > usrToLmIdxMap_.size())

Signed-off-by: Michael Hentschel <hentschel.michael@line-works.com>
@michiahe michiahe force-pushed the fix_flashlight_decoding_length branch from 665d9c2 to 46589af Compare April 23, 2024 00:29
Copy link
Collaborator

@trias702 trias702 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however, I am not sure about wrapping self.tokenizer_wrapper.vocab with the flashlight Dictionary class and not adding the <unk> token if it's not there. Have you tested lexicon-free decoding with this change for both English and Japanese and confirm it works correctly for both? That is my only hesitation which I would ask is tested and verified with the Riva team, but everything else LGTM.

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 14, 2024
@michiahe
Copy link
Author

@titu1994 @trias702

I had some time to test the fix with an English model. The model is the same as the baseline from here (CTC Conformer) and the LM is also the same (sentencepiece 6gram).

The tokeniser has an <unk> token, but the results clearly improve.

Without bug fix

  • beam width: 10, alpha: 0.7, beta: 0.0
Task WER(%) Ins(%) Del(%) Sub(%)
Librispeech_dev_clean 8.86 4.32 0.57 3.97
Librispeech_dev_other 14.98 4.78 1.94 8.26
Librispeech_test_clean 9.15 4.39 0.66 4.10
Librispeech_test_other 15.06 4.96 1.88 8.23

With bug fix

  • beam width: 10, alpha: 0.7, beta: 0.0
Task WER(%) Ins(%) Del(%) Sub(%)
Librispeech_dev_clean 2.54 0.22 0.24 2.08
Librispeech_dev_other 6.64 0.52 0.87 5.25
Librispeech_test_clean 2.90 0.29 0.30 2.30
Librispeech_test_other 6.64 0.53 0.86 5.25

For reference

  • OS2S beam search decoder results from above paper:
    • dev clean: 2.53
    • test clean: 2.79
    • dev other: 6.50
    • test other: 6.56
  • Note: these results used slightly different settings for alpha and beta

@trias702
Copy link
Collaborator

@titu1994 @trias702

I had some time to test the fix with an English model. The model is the same as the baseline from here (CTC Conformer) and the LM is also the same (sentencepiece 6gram).

The tokeniser has an <unk> token, but the results clearly improve.

Without bug fix

* beam width: 10, alpha: 0.7, beta: 0.0

Task WER(%) Ins(%) Del(%) Sub(%)
Librispeech_dev_clean 8.86 4.32 0.57 3.97
Librispeech_dev_other 14.98 4.78 1.94 8.26
Librispeech_test_clean 9.15 4.39 0.66 4.10
Librispeech_test_other 15.06 4.96 1.88 8.23

With bug fix

* beam width: 10, alpha: 0.7, beta: 0.0

Task WER(%) Ins(%) Del(%) Sub(%)
Librispeech_dev_clean 2.54 0.22 0.24 2.08
Librispeech_dev_other 6.64 0.52 0.87 5.25
Librispeech_test_clean 2.90 0.29 0.30 2.30
Librispeech_test_other 6.64 0.53 0.86 5.25

For reference

* OS2S beam search decoder results from above paper:
  
  * dev clean: 2.53
  * test clean: 2.79
  * dev other: 6.50
  * test other: 6.56

* Note: these results used slightly different settings for alpha and beta

Sounds good, I appreciate you running these, LGTM, I'll approve

trias702
trias702 previously approved these changes May 20, 2024
Copy link
Collaborator

@trias702 trias702 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: hentschel.michael@line-works.com
@michiahe michiahe force-pushed the fix_flashlight_decoding_length branch from 7f5b5d2 to 1814646 Compare May 23, 2024 04:30
Signed-off-by: michiahe <michiahe@users.noreply.github.com>
@github-actions github-actions bot removed the stale label Jul 26, 2024
Copy link
Contributor

github-actions bot commented Aug 9, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Aug 9, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants