Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch in the dimensions length of log probabilities of Nemo ASR transcribe function and Vocab size #5738

Closed
manjuke opened this issue Jan 5, 2023 · 17 comments
Labels
bug Something isn't working stale

Comments

@manjuke
Copy link

manjuke commented Jan 5, 2023

Hi All,
I am working on pyctcdecode integration with Nemo ASR models. However, there is a mismatch in the dimensions of the log probabilities matrix output by Nemo ASR transcribe function and the length of vocabulary size. Because of this I am unable to proceed further. Any suggestions, will be very helpful.

asr_model1 = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_medium")

print(len(asr_model1.decoder.vocabulary)) ---> this outputs "1024"

logits = asr_model1.transcribe(["/data/manju/tamil/data/segWav/tvs_test_9976337304-in-Speaker_1-11.wav"], logprobs=True)[0]

print(logits.shape) --> This outputs " (36, 1025)". Here, it should have produced "1024" in place of "1025"

Pls suggest @titu1994 @jbalam-nv . Thanks

@manjuke manjuke added the bug Something isn't working label Jan 5, 2023
@titu1994
Copy link
Collaborator

titu1994 commented Jan 5, 2023

In NeMo, getting the length of the Tokenizer is only the length of the vocab. CTC and RNNT models need ar least one additional token (the last token) which corresponds to the "blank" token used by CTC RNNT loss.

So 1025 is correct - index 1024 (0 based indexing) is the index of the blank token.

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

Thanks @titu1994 . In that case, can you point me (if you are aware) to any resources on using pyctcdecode with Nemo (conformer-ctc) model. I have tried following the tutorials available on github etc., but when I try those I am getting error that says "the logits dimension should be same as vocab dimension" and stopping me there. Any info in solving this would be of great help.

I am completely aware that nemo has it's own ctc decoder support (eval_beamsearch_ngram.py), in fact I am using that too. But, just wanted to explore pyctcdecode.

Additionally, just wanted to know does nemo ctc decoder supports hot words boosting & Beam pruning?. Thanks a lot.

@titu1994
Copy link
Collaborator

titu1994 commented Jan 6, 2023

Hmm I think pyctcdecode has inbuilt NeMo support in it's library examples. Could you take a look ? It should be as simple as passing the vocab and the logprobs

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

Ya, Thanks @titu1994 . That's precisely what I did. I followed https://pypi.org/project/pyctcdecode/ and few others.

It throws this error "ValueError: Input logits shape is (36, 513), but vocabulary is size 512. Need logits of shape: (time, vocabulary)", when I try to run on fine-tuned model.

However, it works very well (no errors) for pre-trained models like "stt_en_conformer_ctc_small"
import nemo.collections.asr as nemo_asr
myFile=['sample-in-Speaker_1-11.wav']
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained( model_name='stt_en_conformer_ctc_small')
logits = asr_model.transcribe(myFile, logprobs=True)[0]
print((logits.shape, len(asr_model.decoder.vocabulary)))
decoder = build_ctcdecoder(asr_model.decoder.vocabulary)
decoder.decode(logits)

But, if I try to use my finetuned model as below, it throws above error.
asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<path of finetune model>")

Thanks

@titu1994
Copy link
Collaborator

titu1994 commented Jan 6, 2023

Hmm I dunno why it would work for a pretrained model but not a finetuned one. What was the finetuned model trained on ? And what is its vocab ?

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

Thanks @titu1994
Finetuned model was trained on stt_en_conformer_ctc_medium, and finetuned for Tamil data.
It's vocab was 512 dimension containing both English & Tamil (unicode) sub-word tokens.
Thanks

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

And @titu1994 , One difference I could see between these two vocabularies was '<pad>' token -->was present in finetuned model, While '<pad>' was not in pre-trained models. Is it causing any issue? Thanks.

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

Is there any use of having token (through --spe_pad option) during fine-tuning? Is it required to have token? What does that do? Thanks

@titu1994
Copy link
Collaborator

titu1994 commented Jan 6, 2023

No you don't need spe_pad or any of the spe special tokens. However it should also not affect this decoding of pyctcdecode.

@manjuke
Copy link
Author

manjuke commented Jan 6, 2023

Hi @titu1994,
I confirmed this is the issue with <pad> tokens in vocabulary. I took conformer-en-ctc-medium model and finetuned using the same Tamil dataset without <pad> tokens, and tried to decode with pyctcdecode. It worked without any issues. And, produces expected output. So, Please check if this a bug. Or, suggest if there is any work around to exclude <pad> symbols (for decoding using pyctcdecode) using models that are trained with <pad> symbols. Thanks.

@titu1994
Copy link
Collaborator

titu1994 commented Jan 6, 2023

There's no workaround as such. Probably will need some fixes in the decoding framework in pyctcdecode

@manjuke
Copy link
Author

manjuke commented Jan 7, 2023

Sure, Thanks Raised bug with pyctcdecode at kensho-technologies/pyctcdecode#93

@manjuke
Copy link
Author

manjuke commented Jan 9, 2023

Hi @titu1994 ,
Any idea about this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 62: invalid start byte

I am getting this error, if I pass "kenlm_model_path" arguement as below to build_ctcdecoder() of pyctcdecode:
decoder = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path=kenlm_model_file)
This error comes for both pretrained & fine-tuned models (without <pad> token).

There is absolutely no error, if I do not include kenlm_model_path ( & pass only vocabulary) as below:
decoder = build_ctcdecoder(asr_model.decoder.vocabulary)

Pls suggest. Thanks

@titu1994
Copy link
Collaborator

titu1994 commented Jan 9, 2023

Dunno about that. Maybe you're passing in a binary file not arpa or maybe they don't support kenlm + subword models. I'd ask on their GitHub

@manjuke
Copy link
Author

manjuke commented Jan 9, 2023

Sure, Thanks a lot for your response.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2023

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Feb 9, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants