WIP: Additional evaluators #184

Adamits · 2024-05-01T14:47:53Z

Addresses #183

I was working on evaluating with character error rate, and went down a small rabbit hole of how to support multiple eval metrics. I am very open to different proposals, especially given that I was not thinking about #175 when writing this (we may need to resolve the semantics of this, at the very least). Basically, the idea is that every eval metric should have some definition of a per-sample score (for accuracy this is binary), and we simply collect a list of these across samples, and again across batches, before reporting a micro-average.

There is some controller code that may be a bit clunky right now that entails reading the requested metrics from the command line, and then initializing those evaluators on the model. Then, during validation, we just loop over the requested evaluators and aggregate their metric.

CER

In doing this, I also implemented a CEREvaluator. This is currently a bit rough---I just used torchmetrics for the actual metric, which I believe is all in python. Additionally, I did not bother decoding gold/predicted tensors back into strings and instead convert the list of integers into a string, where each integer is white-space seperated byt he next integer. I think this works fine conceptually, but of course we cannot ensure then that this is really character error rate---it is essentially whatever-you-define-one-symbol-to-be error rate, which in my case happens to be character (or phoneme).

@kylebgorman I was hoping you might have strong opinions on how we need to implement cer (we probably need to convert tensors to strings at each eval call, but if there are tricks s.t. we do not have to that would be great).

Please also feel free to comment on the general framework here. Even just critique with no alternate proposal would be very helpful as I just basically followed how we have been doing eval, but I am happy to rethink the system at a higher level.

…uracy.

kylebgorman

Thanks for this.

I think what's done here makes sense.

I think for CER (and accuracy too) you should work with the tensors, or lists of integer labels if the data becomes ragged after you take care of EOS stuff, instead of strings if at all possible. This is less work and ought to save you some plumbing. So I would say the joining ought not to be necessary---you just need to nest the list some more. Is there a technical reason why that's not correct?

My question is yes, how does this interact with #175? I think you're just adding evaluation metrics to the logging, essentially. Then you should be able to specify:

as many --eval_metrics as are available and you want (default: just accuracy, other option is CER and you can have both)
one --checkpoint_metric (default: accuracy, but why not add CER to that too in addition to loss?)
one --reduceonplateau_metric (default: loss, other option accuracy)
one --patience_metric (default: loss, other option accuracy)

Any issues there? I put some notes below on the implementation.

yoyodyne/evaluators.py

kylebgorman · 2024-05-01T18:51:50Z

yoyodyne/evaluators.py

        )

-    def __radd__(self, start_val: int) -> EvalItem:
+    def __radd__(self, start_int: int) -> EvalItem:


My understanding of modern Python's overloading is hazy on a good day, so can you confirm this is necessary and won't work without this overload? What happens if you do return NotImplemented instead?

Actually I will have to think harder about this works, but iirc, basically if you want to call sum([object1, object2]), python first calls __radd__ on object1 to compute the left most summand, so we need to return something.

Ok, what is actually happening here is that when we call sum(list), there is a second argument for the "start" term in the summation, which defaults to 0: sum(list, start=0). __radd__ is called because int + EvalItem is not defined.

So we could instead use sum like sum(v[metric_name] for v in validation_step_outputs, EvalItem(np.array()))---though of course this needs to allocate an empty array, which is kind of a waste.

This is hacky though and makes me think we should reduce the list of per-batch eval metrics with a different function. Maybe I can just unroll the numpy vectors of metrics in the per_sample_metrics and call mean there... Though I did like combining them in a way that returns a new EvalItem.

I think the empty-array solution is what I'd reach for, maybe?

Ok. Before I do that---do you think there is a more intuitive python function than sum/add for the operation we define here? Maybe extend, or perhaps something from itertools would be clearer e.g. https://docs.python.org/3/library/itertools.html#itertools.chain.from_iterable?

Another note here:

Allocating a python List is much faster than allocating a numpy array, but numpy.mean is much faster than statistics.mean. I think it is worth using Lists in this tradeoff as I think we will call mean only once per eval epoch, whereas we will allocate Lists/arrays a lot.

kylebgorman · 2024-05-01T18:53:20Z

yoyodyne/evaluators.py

+        pad_idx: int,
+    ) -> EvalItem:
+        cer_calc = CharErrorRate()
+        cers = []


I think I'd prefer this to work with a list of strings than to join. It is actually possible to have a space as a symbol (say if I do --target_sep '\t') and this ought to break it.

Good point. I think I did it this way b/c I think its what the torchmetrics.CharErrorRate seemed to expect. I can check their docs and try to avoid this.

Okay, actually I realize there is a semantic ambiguity here. I think we could have the following, in theory:

"symbol error rate" or SER which is done at the tensor level

actually join the targets using --target_sep (which has to be passed in for correctness) and then computing true CER

I suspect when there's a difference most people want the former.

In the second bullet, the joining entails decoding into strings right? In that case I agree. We could have an option for both, though the naming could get confusing... E.g. I might ask for cer because I know what it is, not realizing that ser will be the same, but computed faster.

It seems very likely that if you're requesting CER (and SER won't do) you want to join on the empty string. Maybe we could just assume that and document it.

Did you mean to edit my comment haha?

Oh god no sorry I meant to quote-reply to it with that.

Just gonna repost the original down here:

Right! This is what I meant with:

it is essentially whatever-you-define-one-symbol-to-be error rate, which in my case happens to be character (or phoneme).

In the second bullet, the joining entails decoding into strings right? In that case I agree. We could have an option for both, though the naming could get confusing... E.g. I might ask for cer because I know what it is, not realizing that ser will be the same, but computed faster.

I do like the idea of getting metrics faster by leaving them as tensors, but I think there could be many such cases where we might want to evaluate against strings, and for a developer, adding a new evaluator is probably easier if you can get strings rather than figuring out the right tensor logic. Maybe this is why e.g. huggingface seems to decode into strings by default.

Another example eval that I need to implement is transliteration, where typically for one word you have several valid targets. Eval can then compute the number of top-k hypothesis that are in the list of targets, or do some fuzzy matching, or ranking, etc. This could all be fine in that a particular evaluator could just decode into strings if need be, but then we have to potentially decoding the same preds into strings many times, and maybe the same golds as well!

yoyodyne/evaluators.py

Adamits · 2024-05-01T19:30:30Z

Thanks for the comments.

I was writing you an email about related things, but might move that to here as it relates to some of your comments. Probably won't get back to addressing this until tomorrow though.

Adamits · 2024-05-01T21:55:54Z

as many --eval_metrics as are available and you want (default: just accuracy, other option is CER and you can have both)

one --checkpoint_metric (default: accuracy, but why not add CER to that too in addition to loss?)

one --reduceonplateau_metric (default: loss, other option accuracy)

one --patience_metric (default: loss, other option accuracy)

Any issues there? I put some notes below on the implementation.

You're right, it should work fine!

Adamits · 2024-05-02T16:31:39Z

I updated how we compute CER a bit and accidentally left it in an unintuitively named commit.

I just import the torchmetrics edit distance computation now (we could replace it with something faster), and then define a _compute_cer that calls it once. This avoids the unnecessary overhead of converting to strings, and always expects a single comparison at once.

kylebgorman · 2024-05-02T19:43:16Z

I think numpy’s mean is fine. Anything to reduce complexity in EvalItem…

kylebgorman

This looks pretty good to me (some tiny nits/suggestions) but maybe you want to play with the __add__ stuff more, up to you? (And if you want move to numpy.mean fine with me.)

yoyodyne/evaluators.py

Adamits · 2024-05-03T15:30:51Z

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

kylebgorman · 2024-05-03T15:32:17Z

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

kylebgorman

LGTM. One last note. Let me know if you want me to just merge.

kylebgorman · 2024-05-03T15:33:23Z

yoyodyne/evaluators.py

 import torch
 from torch.nn import functional

+# from torchmetrics.text import CharErrorRate


Not sure why this was still there, I did not have it locally.

kylebgorman · 2024-05-03T15:36:37Z

yoyodyne/evaluators.py

 import torch
 from torch.nn import functional

+# from torchmetrics.text import CharErrorRate
+from torchmetrics.functional.text.helper import _edit_distance


It occurs to me that this is protected and we're using it anyways we ought to idk, alias it, or just use our own, or use a different third-party one? I have this one lying around which uses a numpy array instead of Python lists if you want it.

def _edit_distance(x: Iterable[Any], y: Iterable[Any]) -> int: idim = len(x) + 1 jdim = len(y) + 1 table = numpy.zeros((idim, jdim), dtype=numpy.uint16) table[1:, 0] = 1 table[0, 1:] = 1 for i in range(1, idim): for j in range(1, jdim): if x[i - 1] == y[j - 1]: table[i][j] = table[i - 1][j - 1] else: c1 = table[i - 1][j] c2 = table[i][j - 1] c3 = table[i - 1][j - 1] table[i][j] = min(c1, c2, c3) + 1 return int(table[-1][-1])

but don't hold the PR up for just that.

I can test it really quick?

Sure. You can turn it up to uint32 if you expect strings longer than 65k.

Hmm I am getting a different SER value and not sure why.

That's worrisome if only because i've been using this basic implementation for a very long time and in many places...

Weirdly the implementation looks ~the same https://github.com/Lightning-AI/torchmetrics/blob/v1.3.2/src/torchmetrics/functional/text/helper.py#L329

Their lines 340-343 are different than my:

table[1:, 0] = 1 table[0, 1:] = 1

And I think theirs might be right...that should be increasing, I thought?

Sorry, have some meetings now but will return to this in an hour or two.

Ok as a sanity check, if I do this instead:

table = numpy.zeros((idim, jdim), dtype=numpy.uint16) # table[1:, 0] = 1 # table[0, 1:] = 1 for i in range(idim): table[i][0] = i for j in range(jdim): table[0][j] = j

I reproduce the results. Off the top of my head I think this one is right, and your implementation might be for LCS? I will try to find references...

Also, I assume

table = numpy.zeros((idim, jdim), dtype=numpy.uint16) table[:, 0] = range(idim) table[0, :] = range(jdim)

is faster.

Yes, I think your diagnosis is correct. It also occurred to me that you could use empty rather than zeros that way. I put it into a gist:

https://gist.github.com/kylebgorman/14cc4e238bc6d80784df4511c29b3d55.

Adamits · 2024-05-03T15:49:31Z

Ok I think I have addressed the changes.
Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

Right but we still need to get symbols somewhere (currently we just turn ints into chars---but if an int represents several chars we don't get CER), and I think it makes sense to not do this once per evaluator, but maybe it doesn't matter. Still, at least it requires giving evaluators access to the index.

kylebgorman · 2024-05-03T17:05:04Z

Yea, instead of = 1 it should be range(1, …). I tested and confirmed.

…

On Fri, May 3, 2024 at 12:57 PM Adam ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In yoyodyne/evaluators.py <#184 (comment)>: > import torch from torch.nn import functional +# from torchmetrics.text import CharErrorRate +from torchmetrics.functional.text.helper import _edit_distance Ok as a sanity check, if I do this instead: table = numpy.zeros((idim, jdim), dtype=numpy.uint16) # table[1:, 0] = 1 # table[0, 1:] = 1 for i in range(idim): table[i][0] = i for j in range(jdim): table[0][j] = j I reproduce the results. Off the top of my head I think this one is right, and your implementation might be for LCS? I will try to find references... — Reply to this email directly, view it on GitHub <#184 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OPSOLQHANAFVO7LPBDZAO6VZAVCNFSM6AAAAABHCCNHM6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZYGYZTQNJUGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…yodyne into additional-evaluators

kylebgorman

One quick note...

kylebgorman · 2024-05-06T14:45:33Z

yoyodyne/evaluators.py

+                    c2 = table[i][j - 1]
+                    c3 = table[i - 1][j - 1]
+                    table[i][j] = min(c1, c2, c3) + 1
+        return table[-1][-1]


Suggest you cast this to int before returning because we don't need the uint16 precision (or any other numpy weirdness). I don't know if it's actually important...

Done. (Also updated README)

kylebgorman

LGTM.

I made one minor change: I made _edit_distance a static method. Could you do a very quick check to make sure that doesn't break anything before I merge?

Adamits · 2024-05-06T15:13:02Z

LGTM.

I made one minor change: I made _edit_distance a static method. Could you do a very quick check to make sure that doesn't break anything before I merge?

Thanks! Ran my test and it seems to work fine.

Adamits added 5 commits April 30, 2024 17:15

Refactors evaluators for adding additional types. Adds WIP cer eval

42833fa

Updates base to use multiple eval metrics. Sets default metric as acc…

f6bc1b4

…uracy.

Adds eval metrics option to train.py

62daec1

Fixes val_ naming

b3cf92e

Fixes formatting

b83a398

Adamits requested a review from kylebgorman May 1, 2024 14:52

kylebgorman reviewed May 1, 2024

View reviewed changes

Adamits added 4 commits May 1, 2024 16:44

average_metric -> metric; uses numpy.mean for evaluators

64e6900

Better naming and interface for new evaluators

3bcdfb8

formatting

69f75b3

Changes numpy EvalItem sample vectors to Lists

9d68775

Adamits added 2 commits May 2, 2024 10:38

cleaner cer accumulation

14f05c7

cer -> ser (symbol error rate)

2cfe4e3

kylebgorman reviewed May 3, 2024

View reviewed changes

yoyodyne/evaluators.py Outdated Show resolved Hide resolved

yoyodyne/evaluators.py Show resolved Hide resolved

yoyodyne/evaluators.py Outdated Show resolved Hide resolved

yoyodyne/evaluators.py Outdated Show resolved Hide resolved

Cleans up new evaluators

44a558e

Update base.py

af1cf31

kylebgorman approved these changes May 3, 2024

View reviewed changes

Adamits added 3 commits May 3, 2024 11:07

Fixes edit distance

2e86a32

Formatting

0d91dc2

Merge branch 'additional-evaluators' of https://github.com/Adamits/yo…

237b220

…yodyne into additional-evaluators

kylebgorman reviewed May 6, 2024

View reviewed changes

Updates README for evaluators

d676a28

Adamits and others added 4 commits May 6, 2024 08:57

Casts edit_distance to int.

f3ed0ae

Update README.md

744e226

Update evaluators.py

fc3f0cd

Update base.py

31b95f4

kylebgorman approved these changes May 6, 2024

View reviewed changes

kylebgorman merged commit ba8a516 into CUNY-CL:master May 6, 2024
5 checks passed

kylebgorman mentioned this pull request May 6, 2024

Add more evaluation metrics #183

Closed

Adamits mentioned this pull request Jun 4, 2024

Hard monotonic #186

Merged

Adamits mentioned this pull request Jun 15, 2024

Pad tensor after eos #201

Merged

WIP: Additional evaluators #184

WIP: Additional evaluators #184

Conversation

Adamits commented May 1, 2024

CER

kylebgorman left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adamits May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adamits May 1, 2024 • edited by kylebgorman Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adamits May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Adamits commented May 1, 2024

Adamits commented May 1, 2024

Adamits commented May 2, 2024

kylebgorman commented May 2, 2024

kylebgorman left a comment

Choose a reason for hiding this comment

Adamits commented May 3, 2024

kylebgorman commented May 3, 2024

kylebgorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebgorman May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebgorman May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Adamits May 3, 2024 • edited Loading

Choose a reason for hiding this comment

kylebgorman May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Adamits commented May 3, 2024

kylebgorman commented May 3, 2024 via email

kylebgorman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebgorman left a comment

Choose a reason for hiding this comment

Adamits commented May 6, 2024

kylebgorman left a comment •

edited

Loading

Adamits May 1, 2024 •

edited

Loading

Adamits May 1, 2024 •

edited by kylebgorman

Loading

Adamits May 2, 2024 •

edited

Loading

kylebgorman May 3, 2024 •

edited

Loading

kylebgorman May 3, 2024 •

edited

Loading

Adamits May 3, 2024 •

edited

Loading

kylebgorman May 4, 2024 •

edited

Loading