New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have some questions about RNNT loss. #3750
Comments
You can find my implementation at |
@csukuangfj Thank you. I said that in a misleading way. What I'm curious about is why target_length +1 needs to be entered as the RNNT loss's 3rd input. Looking at your code, I noticed that you wrote target length+1 because it includes a blank label. Isn't the blank input already included in n_class? (When setting n_class, I think len(vocab)+1 should be set. Similar to CTC loss.) I don't quite understand |
You need to differentiate between The transcript of an utterance is converted to tokens. The target length is the number of tokens of the transcript. It is not |
So the number of classes should be len(vocab)? |
Great to hear it resolves your issue. |
@csukuangfj |
hello
I would like to ask you a question that may be somewhat trivial.
The shape of logits of RNN T loss is Batch, max_seq_len, max_target_len+1, class.
Why is max_target_len+1 here?
Shouldn't the number of classes be +1 to the size of the total vocab? Because blank is included.
I don't understand at all.
Is there anyone who can help?
https://pytorch.org/audio/main/generated/torchaudio.functional.rnnt_loss.html
The text was updated successfully, but these errors were encountered: