Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

manjuke · 2023-01-05T12:33:27Z

Hi All,
I am working on pyctcdecode integration with Nemo ASR models. However, there is a mismatch in the dimensions of the log probabilities matrix output by Nemo ASR transcribe function and the length of vocabulary size. Because of this I am unable to proceed further. Any suggestions, will be very helpful.

asr_model1 = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_medium")

print(len(asr_model1.decoder.vocabulary)) ---> this outputs "1024"

logits = asr_model1.transcribe(["/data/manju/tamil/data/segWav/tvs_test_9976337304-in-Speaker_1-11.wav"], logprobs=True)[0]

print(logits.shape) --> This outputs " (36, 1025)". Here, it should have produced "1024" in place of "1025"

Pls suggest @titu1994 @jbalam-nv . Thanks

titu1994 · 2023-01-05T19:30:09Z

In NeMo, getting the length of the Tokenizer is only the length of the vocab. CTC and RNNT models need ar least one additional token (the last token) which corresponds to the "blank" token used by CTC RNNT loss.

So 1025 is correct - index 1024 (0 based indexing) is the index of the blank token.

manjuke · 2023-01-06T04:18:26Z

Thanks @titu1994 . In that case, can you point me (if you are aware) to any resources on using pyctcdecode with Nemo (conformer-ctc) model. I have tried following the tutorials available on github etc., but when I try those I am getting error that says "the logits dimension should be same as vocab dimension" and stopping me there. Any info in solving this would be of great help.

I am completely aware that nemo has it's own ctc decoder support (eval_beamsearch_ngram.py), in fact I am using that too. But, just wanted to explore pyctcdecode.

Additionally, just wanted to know does nemo ctc decoder supports hot words boosting & Beam pruning?. Thanks a lot.

titu1994 · 2023-01-06T04:51:06Z

Hmm I think pyctcdecode has inbuilt NeMo support in it's library examples. Could you take a look ? It should be as simple as passing the vocab and the logprobs

manjuke · 2023-01-06T05:19:01Z

Ya, Thanks @titu1994 . That's precisely what I did. I followed https://pypi.org/project/pyctcdecode/ and few others.

It throws this error "ValueError: Input logits shape is (36, 513), but vocabulary is size 512. Need logits of shape: (time, vocabulary)", when I try to run on fine-tuned model.

However, it works very well (no errors) for pre-trained models like "stt_en_conformer_ctc_small"
import nemo.collections.asr as nemo_asr
myFile=['sample-in-Speaker_1-11.wav']
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained( model_name='stt_en_conformer_ctc_small')
logits = asr_model.transcribe(myFile, logprobs=True)[0]
print((logits.shape, len(asr_model.decoder.vocabulary)))
decoder = build_ctcdecoder(asr_model.decoder.vocabulary)
decoder.decode(logits)

But, if I try to use my finetuned model as below, it throws above error.
asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<path of finetune model>")

Thanks

titu1994 · 2023-01-06T05:31:02Z

Hmm I dunno why it would work for a pretrained model but not a finetuned one. What was the finetuned model trained on ? And what is its vocab ?

manjuke · 2023-01-06T05:36:02Z

Thanks @titu1994
Finetuned model was trained on stt_en_conformer_ctc_medium, and finetuned for Tamil data.
It's vocab was 512 dimension containing both English & Tamil (unicode) sub-word tokens.
Thanks

manjuke · 2023-01-06T05:42:54Z

And @titu1994 , One difference I could see between these two vocabularies was '<pad>' token -->was present in finetuned model, While '<pad>' was not in pre-trained models. Is it causing any issue? Thanks.

manjuke · 2023-01-06T05:47:30Z

Is there any use of having token (through --spe_pad option) during fine-tuning? Is it required to have token? What does that do? Thanks

titu1994 · 2023-01-06T05:56:42Z

No you don't need spe_pad or any of the spe special tokens. However it should also not affect this decoding of pyctcdecode.

manjuke · 2023-01-06T11:47:58Z

Hi @titu1994,
I confirmed this is the issue with <pad> tokens in vocabulary. I took conformer-en-ctc-medium model and finetuned using the same Tamil dataset without <pad> tokens, and tried to decode with pyctcdecode. It worked without any issues. And, produces expected output. So, Please check if this a bug. Or, suggest if there is any work around to exclude <pad> symbols (for decoding using pyctcdecode) using models that are trained with <pad> symbols. Thanks.

titu1994 · 2023-01-06T18:08:02Z

There's no workaround as such. Probably will need some fixes in the decoding framework in pyctcdecode

manjuke · 2023-01-07T10:07:23Z

Sure, Thanks Raised bug with pyctcdecode at kensho-technologies/pyctcdecode#93

manjuke · 2023-01-09T07:41:11Z

Hi @titu1994 ,
Any idea about this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 62: invalid start byte

I am getting this error, if I pass "kenlm_model_path" arguement as below to build_ctcdecoder() of pyctcdecode:
decoder = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path=kenlm_model_file)
This error comes for both pretrained & fine-tuned models (without <pad> token).

There is absolutely no error, if I do not include kenlm_model_path ( & pass only vocabulary) as below:
decoder = build_ctcdecoder(asr_model.decoder.vocabulary)

Pls suggest. Thanks

titu1994 · 2023-01-09T08:46:16Z

Dunno about that. Maybe you're passing in a binary file not arpa or maybe they don't support kenlm + subword models. I'd ask on their GitHub

manjuke · 2023-01-09T09:39:34Z

Sure, Thanks a lot for your response.

github-actions · 2023-02-09T02:00:20Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2023-02-16T02:00:36Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

manjuke added the bug Something isn't working label Jan 5, 2023

manjuke mentioned this issue Jan 7, 2023

pyctcdecode not working with Nemo finetuned model kensho-technologies/pyctcdecode#93

Open

github-actions bot added the stale label Feb 9, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

manjuke commented Jan 5, 2023 •

edited

Loading

titu1994 commented Jan 5, 2023

manjuke commented Jan 6, 2023 •

edited

Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 •

edited

Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 •

edited

Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 7, 2023

manjuke commented Jan 9, 2023

titu1994 commented Jan 9, 2023

manjuke commented Jan 9, 2023

github-actions bot commented Feb 9, 2023

github-actions bot commented Feb 16, 2023

Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

Comments

manjuke commented Jan 5, 2023 • edited Loading

titu1994 commented Jan 5, 2023

manjuke commented Jan 6, 2023 • edited Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 • edited Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 • edited Loading

manjuke commented Jan 6, 2023 • edited Loading

manjuke commented Jan 6, 2023

titu1994 commented Jan 6, 2023

manjuke commented Jan 6, 2023 • edited Loading

titu1994 commented Jan 6, 2023

manjuke commented Jan 7, 2023

manjuke commented Jan 9, 2023

titu1994 commented Jan 9, 2023

manjuke commented Jan 9, 2023

github-actions bot commented Feb 9, 2023

github-actions bot commented Feb 16, 2023

manjuke commented Jan 5, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading

manjuke commented Jan 6, 2023 •

edited

Loading