Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] WER #52

Closed
justusschock opened this issue May 19, 2020 · 8 comments · Fixed by #383
Closed

[Metrics] WER #52

justusschock opened this issue May 19, 2020 · 8 comments · Fixed by #383
Labels
enhancement New feature or request help wanted Extra attention is needed topic: Text
Projects

Comments

@justusschock
Copy link
Member

Add WER cc @oplatek

Lightning-AI/pytorch-lightning#973 (comment)

@stas6626
Copy link

stas6626 commented Oct 7, 2020

I can work on this issue, but I need help with defining UI, originally this metric works over strings, is it okay if metrics takes list of strings as input? It is possible to run it on tokens, but it will be working right only with char based tokenizer(nowadays most people use BPE, etc)

@stas6626
Copy link

stas6626 commented Oct 7, 2020

@Borda what do you think?

@justusschock
Copy link
Member Author

@stas6626 we already have bleu working on strings. So this wouldn't be a problem. Why do you want to have list of strings here? Shouldn't a predicted and a groundtruth string be sufficient?

@justusschock
Copy link
Member Author

@stas6626 How is it going there?

@Borda Borda transferred this issue from Lightning-AI/pytorch-lightning Mar 12, 2021
@Borda Borda added enhancement New feature or request help wanted Extra attention is needed labels Mar 17, 2021
@VinhLoiIT
Copy link

I believe both CER (Character Error Rate) and WER (Word Error Rate) metrics are pretty commonly used in OCR problems and might be in audio/speech as well.

I did implement the metrics in Lightning before but in the older version, so I think it would not be a problem to upgrade the code. However, it might rely on another library such as editdistance that I don't know whether I could be able to use the external library or not when making a contribution.

@janvainer
Copy link

It could be interesting to use pytorch-edit-distance instead of editdistance. It should support CUDA operations.

@VinhLoiIT
Copy link

I think this implementation has some redundant features, such as remove_blank and strip, which are not necessary, but the CUDA support is useful though.

Moreover, the current implementation requires both blank and space characters, as I skim through the CUDA code, which does not cover the general use of this metric. For example:

  • In OCR, we also need the CER (Character Error Rate) metric, which is basically the same as WER but it keeps the space character as a token as well.
  • In OCR, splitting words should be based on a user-defined function that should not be coalesence with compute_wer.
  • Lastly, in my opinion, the user when using this metric does not have to know about the blank character, because the blank character is only specifically used when we use CTC Loss but sometimes we don't.

I think, in general, both CER and WER metrics are basically computed using the Levenshtein distance whose inputs might only require a list of string of predicted tokens and the list of coresponding target tokens. Thus, editdistance package is just fine though.

@carmocca
Copy link
Contributor

I agree with you @VinhLoiIT

Although I have found the package https://github.com/life4/textdistance to be better supported, for example: roy-ht/editdistance#46

If we agree on a package, we can move forward with the implementation

@stale stale bot added the wontfix label Jun 1, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Jun 1, 2021
@stale stale bot removed the wontfix label Jun 1, 2021
@SkafteNicki SkafteNicki added this to To do in Text via automation Jul 7, 2021
This was referenced Jul 15, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Jul 16, 2021
@Lightning-AI Lightning-AI deleted a comment from stale bot Jul 16, 2021
@Borda Borda added this to the v0.5 milestone Jul 16, 2021
Text automation moved this from To do to Done Jul 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed topic: Text
Projects
No open projects
Text
Done
Development

Successfully merging a pull request may close this issue.

6 participants