-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738
Comments
In NeMo, getting the length of the Tokenizer is only the length of the vocab. CTC and RNNT models need ar least one additional token (the last token) which corresponds to the "blank" token used by CTC RNNT loss. So 1025 is correct - index 1024 (0 based indexing) is the index of the blank token. |
Thanks @titu1994 . In that case, can you point me (if you are aware) to any resources on using pyctcdecode with Nemo (conformer-ctc) model. I have tried following the tutorials available on github etc., but when I try those I am getting error that says "the logits dimension should be same as vocab dimension" and stopping me there. Any info in solving this would be of great help. I am completely aware that nemo has it's own ctc decoder support (eval_beamsearch_ngram.py), in fact I am using that too. But, just wanted to explore pyctcdecode. Additionally, just wanted to know does nemo ctc decoder supports hot words boosting & Beam pruning?. Thanks a lot. |
Hmm I think pyctcdecode has inbuilt NeMo support in it's library examples. Could you take a look ? It should be as simple as passing the vocab and the logprobs |
Ya, Thanks @titu1994 . That's precisely what I did. I followed https://pypi.org/project/pyctcdecode/ and few others. It throws this error "ValueError: Input logits shape is (36, 513), but vocabulary is size 512. Need logits of shape: (time, vocabulary)", when I try to run on fine-tuned model. However, it works very well (no errors) for pre-trained models like "stt_en_conformer_ctc_small" But, if I try to use my finetuned model as below, it throws above error. Thanks |
Hmm I dunno why it would work for a pretrained model but not a finetuned one. What was the finetuned model trained on ? And what is its vocab ? |
Thanks @titu1994 |
And @titu1994 , One difference I could see between these two vocabularies was '<pad>' token -->was present in finetuned model, While '<pad>' was not in pre-trained models. Is it causing any issue? Thanks. |
Is there any use of having token (through --spe_pad option) during fine-tuning? Is it required to have token? What does that do? Thanks |
No you don't need spe_pad or any of the spe special tokens. However it should also not affect this decoding of pyctcdecode. |
Hi @titu1994, |
There's no workaround as such. Probably will need some fixes in the decoding framework in pyctcdecode |
Sure, Thanks Raised bug with pyctcdecode at kensho-technologies/pyctcdecode#93 |
Hi @titu1994 , I am getting this error, if I pass "kenlm_model_path" arguement as below to build_ctcdecoder() of pyctcdecode: There is absolutely no error, if I do not include kenlm_model_path ( & pass only vocabulary) as below: Pls suggest. Thanks |
Dunno about that. Maybe you're passing in a binary file not arpa or maybe they don't support kenlm + subword models. I'd ask on their GitHub |
Sure, Thanks a lot for your response. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Hi All,
I am working on pyctcdecode integration with Nemo ASR models. However, there is a mismatch in the dimensions of the log probabilities matrix output by Nemo ASR transcribe function and the length of vocabulary size. Because of this I am unable to proceed further. Any suggestions, will be very helpful.
asr_model1 = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_medium")
print(len(asr_model1.decoder.vocabulary)) ---> this outputs "1024"
logits = asr_model1.transcribe(["/data/manju/tamil/data/segWav/tvs_test_9976337304-in-Speaker_1-11.wav"], logprobs=True)[0]
print(logits.shape) --> This outputs " (36, 1025)". Here, it should have produced "1024" in place of "1025"
Pls suggest @titu1994 @jbalam-nv . Thanks
The text was updated successfully, but these errors were encountered: