Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

your greed decode implement is wrong. #1

Closed
Duum opened this issue May 24, 2018 · 9 comments
Closed

your greed decode implement is wrong. #1

Duum opened this issue May 24, 2018 · 9 comments

Comments

@Duum
Copy link

Duum commented May 24, 2018

I think your transducer greed decode implement is wrong.
here is my implement of pytorch.

@HawkAaron
Copy link
Owner

Its based on the assumption that for each acoustic feature frame, there is at most one corresponding label, so we just need to move one step up, then turn right.

I implemented the greedy decode several weeks ago as your way, but the PER is worse in TIMIT.

The decoding algorithm is still under developing, so any comments are welcome.

@Duum
Copy link
Author

Duum commented May 24, 2018

but in my dataset my greedy decode implement CER is 15% lower than yours.

@HawkAaron
Copy link
Owner

Which dataset did you use?

I'll check that again in TIMIT.

@Duum
Copy link
Author

Duum commented May 24, 2018

In the paper of alex grave, in a transducer path ,only the label is null, the frame will step. when the label is not null ,the U will increase,But the T will stop to wait.

@Duum
Copy link
Author

Duum commented May 24, 2018

in my private dataset,only when use the second method ,the rnn transducer will compare with ctc.

@HawkAaron
Copy link
Owner

@Duum
Yes, when non-null is predicted, u forward one step and t stop. But as what I said, at any frame t, the corresponding label will not be more than 1, so if non-null is predicted at frame t, it must move to the next frame after the next prediction. What I said is based on the physical meaning of speech feature. But according to the original transducer model definition, at any time t, there could be more than one up transitions, which is meanless in speech recognition.

I'll check your implementation and reply to you asap.

@HawkAaron
Copy link
Owner

@Duum By the way, what you mean "the second method" ? Have you ever try beam search ?

@Duum
Copy link
Author

Duum commented May 24, 2018

the second method is my implement of greedy decode,I haven't use beam search for now .
a frame corresponding to a label may be just your assumption. If the frame is between two phones, it is possible to have two phones in one frame.

@HawkAaron
Copy link
Owner

@Duum Thanks for your comments, I'll check that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants