Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Additional evaluators #184

Merged
merged 21 commits into from
May 6, 2024
Merged

Conversation

Adamits
Copy link
Collaborator

@Adamits Adamits commented May 1, 2024

Addresses #183

I was working on evaluating with character error rate, and went down a small rabbit hole of how to support multiple eval metrics. I am very open to different proposals, especially given that I was not thinking about #175 when writing this (we may need to resolve the semantics of this, at the very least). Basically, the idea is that every eval metric should have some definition of a per-sample score (for accuracy this is binary), and we simply collect a list of these across samples, and again across batches, before reporting a micro-average.

There is some controller code that may be a bit clunky right now that entails reading the requested metrics from the command line, and then initializing those evaluators on the model. Then, during validation, we just loop over the requested evaluators and aggregate their metric.

CER

In doing this, I also implemented a CEREvaluator. This is currently a bit rough---I just used torchmetrics for the actual metric, which I believe is all in python. Additionally, I did not bother decoding gold/predicted tensors back into strings and instead convert the list of integers into a string, where each integer is white-space seperated byt he next integer. I think this works fine conceptually, but of course we cannot ensure then that this is really character error rate---it is essentially whatever-you-define-one-symbol-to-be error rate, which in my case happens to be character (or phoneme).

@kylebgorman I was hoping you might have strong opinions on how we need to implement cer (we probably need to convert tensors to strings at each eval call, but if there are tricks s.t. we do not have to that would be great).

Please also feel free to comment on the general framework here. Even just critique with no alternate proposal would be very helpful as I just basically followed how we have been doing eval, but I am happy to rethink the system at a higher level.

@Adamits Adamits requested a review from kylebgorman May 1, 2024 14:52
Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this.

I think what's done here makes sense.

I think for CER (and accuracy too) you should work with the tensors, or lists of integer labels if the data becomes ragged after you take care of EOS stuff, instead of strings if at all possible. This is less work and ought to save you some plumbing. So I would say the joining ought not to be necessary---you just need to nest the list some more. Is there a technical reason why that's not correct?

My question is yes, how does this interact with #175? I think you're just adding evaluation metrics to the logging, essentially. Then you should be able to specify:

  1. as many --eval_metrics as are available and you want (default: just accuracy, other option is CER and you can have both)
  2. one --checkpoint_metric (default: accuracy, but why not add CER to that too in addition to loss?)
  3. one --reduceonplateau_metric (default: loss, other option accuracy)
  4. one --patience_metric (default: loss, other option accuracy)

Any issues there? I put some notes below on the implementation.

yoyodyne/evaluators.py Outdated Show resolved Hide resolved
)

def __radd__(self, start_val: int) -> EvalItem:
def __radd__(self, start_int: int) -> EvalItem:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of modern Python's overloading is hazy on a good day, so can you confirm this is necessary and won't work without this overload? What happens if you do return NotImplemented instead?

Copy link
Collaborator Author

@Adamits Adamits May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I will have to think harder about this works, but iirc, basically if you want to call sum([object1, object2]), python first calls __radd__ on object1 to compute the left most summand, so we need to return something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what is actually happening here is that when we call sum(list), there is a second argument for the "start" term in the summation, which defaults to 0: sum(list, start=0). __radd__ is called because int + EvalItem is not defined.

So we could instead use sum like sum(v[metric_name] for v in validation_step_outputs, EvalItem(np.array()))---though of course this needs to allocate an empty array, which is kind of a waste.

This is hacky though and makes me think we should reduce the list of per-batch eval metrics with a different function. Maybe I can just unroll the numpy vectors of metrics in the per_sample_metrics and call mean there... Though I did like combining them in a way that returns a new EvalItem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the empty-array solution is what I'd reach for, maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Before I do that---do you think there is a more intuitive python function than sum/add for the operation we define here? Maybe extend, or perhaps something from itertools would be clearer e.g. https://docs.python.org/3/library/itertools.html#itertools.chain.from_iterable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note here:

Allocating a python List is much faster than allocating a numpy array, but numpy.mean is much faster than statistics.mean. I think it is worth using Lists in this tradeoff as I think we will call mean only once per eval epoch, whereas we will allocate Lists/arrays a lot.

pad_idx: int,
) -> EvalItem:
cer_calc = CharErrorRate()
cers = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer this to work with a list of strings than to join. It is actually possible to have a space as a symbol (say if I do --target_sep '\t') and this ought to break it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think I did it this way b/c I think its what the torchmetrics.CharErrorRate seemed to expect. I can check their docs and try to avoid this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, actually I realize there is a semantic ambiguity here. I think we could have the following, in theory:

  • "symbol error rate" or SER which is done at the tensor level
  • actually join the targets using --target_sep (which has to be passed in for correctness) and then computing true CER

I suspect when there's a difference most people want the former.

Copy link
Collaborator Author

@Adamits Adamits May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the second bullet, the joining entails decoding into strings right? In that case I agree. We could have an option for both, though the naming could get confusing... E.g. I might ask for cer because I know what it is, not realizing that ser will be the same, but computed faster.

It seems very likely that if you're requesting CER (and SER won't do) you want to join on the empty string. Maybe we could just assume that and document it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to edit my comment haha?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh god no sorry I meant to quote-reply to it with that.

Copy link
Collaborator Author

@Adamits Adamits May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just gonna repost the original down here:

Right! This is what I meant with:

it is essentially whatever-you-define-one-symbol-to-be error rate, which in my case happens to be character (or phoneme).

In the second bullet, the joining entails decoding into strings right? In that case I agree. We could have an option for both, though the naming could get confusing... E.g. I might ask for cer because I know what it is, not realizing that ser will be the same, but computed faster.

I do like the idea of getting metrics faster by leaving them as tensors, but I think there could be many such cases where we might want to evaluate against strings, and for a developer, adding a new evaluator is probably easier if you can get strings rather than figuring out the right tensor logic. Maybe this is why e.g. huggingface seems to decode into strings by default.

Another example eval that I need to implement is transliteration, where typically for one word you have several valid targets. Eval can then compute the number of top-k hypothesis that are in the list of targets, or do some fuzzy matching, or ranking, etc. This could all be fine in that a particular evaluator could just decode into strings if need be, but then we have to potentially decoding the same preds into strings many times, and maybe the same golds as well!

yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
@Adamits
Copy link
Collaborator Author

Adamits commented May 1, 2024

Thanks for the comments.

I was writing you an email about related things, but might move that to here as it relates to some of your comments. Probably won't get back to addressing this until tomorrow though.

@Adamits
Copy link
Collaborator Author

Adamits commented May 1, 2024

  1. as many --eval_metrics as are available and you want (default: just accuracy, other option is CER and you can have both)
  2. one --checkpoint_metric (default: accuracy, but why not add CER to that too in addition to loss?)
  3. one --reduceonplateau_metric (default: loss, other option accuracy)
  4. one --patience_metric (default: loss, other option accuracy)

Any issues there? I put some notes below on the implementation.

You're right, it should work fine!

@Adamits
Copy link
Collaborator Author

Adamits commented May 2, 2024

I updated how we compute CER a bit and accidentally left it in an unintuitively named commit.

I just import the torchmetrics edit distance computation now (we could replace it with something faster), and then define a _compute_cer that calls it once. This avoids the unnecessary overhead of converting to strings, and always expects a single comparison at once.

@kylebgorman
Copy link
Contributor

I think numpy’s mean is fine. Anything to reduce complexity in EvalItem…

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me (some tiny nits/suggestions) but maybe you want to play with the __add__ stuff more, up to you? (And if you want move to numpy.mean fine with me.)

yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
yoyodyne/evaluators.py Outdated Show resolved Hide resolved
@Adamits
Copy link
Collaborator Author

Adamits commented May 3, 2024

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

@kylebgorman
Copy link
Contributor

Ok I think I have addressed the changes.

Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One last note. Let me know if you want me to just merge.

import torch
from torch.nn import functional

# from torchmetrics.text import CharErrorRate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this was still there, I did not have it locally.

import torch
from torch.nn import functional

# from torchmetrics.text import CharErrorRate
from torchmetrics.functional.text.helper import _edit_distance
Copy link
Contributor

@kylebgorman kylebgorman May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that this is protected and we're using it anyways we ought to idk, alias it, or just use our own, or use a different third-party one? I have this one lying around which uses a numpy array instead of Python lists if you want it.

def _edit_distance(x: Iterable[Any], y: Iterable[Any]) -> int:
    idim = len(x) + 1
    jdim = len(y) + 1
    table = numpy.zeros((idim, jdim), dtype=numpy.uint16)
    table[1:, 0] = 1
    table[0, 1:] = 1
    for i in range(1, idim):
        for j in range(1, jdim):
            if x[i - 1] == y[j - 1]:
                table[i][j] = table[i - 1][j - 1]
            else:
                c1 = table[i - 1][j]
                c2 = table[i][j - 1]
                c3 = table[i - 1][j - 1]
                table[i][j] = min(c1, c2, c3) + 1
    return int(table[-1][-1])

but don't hold the PR up for just that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can test it really quick?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. You can turn it up to uint32 if you expect strings longer than 65k.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I am getting a different SER value and not sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's worrisome if only because i've been using this basic implementation for a very long time and in many places...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kylebgorman kylebgorman May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their lines 340-343 are different than my:

table[1:, 0] = 1
table[0, 1:] = 1

And I think theirs might be right...that should be increasing, I thought?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, have some meetings now but will return to this in an hour or two.

Copy link
Collaborator Author

@Adamits Adamits May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok as a sanity check, if I do this instead:

table = numpy.zeros((idim, jdim), dtype=numpy.uint16)
# table[1:, 0] = 1
# table[0, 1:] = 1
for i in range(idim):
    table[i][0] = i
for j in range(jdim):
    table[0][j] = j

I reproduce the results. Off the top of my head I think this one is right, and your implementation might be for LCS? I will try to find references...

Also, I assume

table = numpy.zeros((idim, jdim), dtype=numpy.uint16)
table[:, 0] = range(idim)
table[0, :] = range(jdim)

is faster.

Copy link
Contributor

@kylebgorman kylebgorman May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think your diagnosis is correct. It also occurred to me that you could use empty rather than zeros that way. I put it into a gist:

https://gist.github.com/kylebgorman/14cc4e238bc6d80784df4511c29b3d55.

@Adamits
Copy link
Collaborator Author

Adamits commented May 3, 2024

Ok I think I have addressed the changes.
Later I can work on another PR to wire up an option for decoding symbols at validation. This will support e.g. actual CER if you have >character-sized symbols. I think it will also be nice from a developer standpoint: if I want to use yoyodyne for my task and need to implement a new eval, I can just ask for characters and implement it in a straightforward way rather than needing to take the time to figure out how to make pytorch tensors do what I want.

My suggestion there is to just "".join(symbols) since it feels like separators ought not to count for any sensible application of CER?

Right but we still need to get symbols somewhere (currently we just turn ints into chars---but if an int represents several chars we don't get CER), and I think it makes sense to not do this once per evaluator, but maybe it doesn't matter. Still, at least it requires giving evaluators access to the index.

@kylebgorman
Copy link
Contributor

kylebgorman commented May 3, 2024 via email

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick note...

c2 = table[i][j - 1]
c3 = table[i - 1][j - 1]
table[i][j] = min(c1, c2, c3) + 1
return table[-1][-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest you cast this to int before returning because we don't need the uint16 precision (or any other numpy weirdness). I don't know if it's actually important...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (Also updated README)

Copy link
Contributor

@kylebgorman kylebgorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I made one minor change: I made _edit_distance a static method. Could you do a very quick check to make sure that doesn't break anything before I merge?

@Adamits
Copy link
Collaborator Author

Adamits commented May 6, 2024

LGTM.

I made one minor change: I made _edit_distance a static method. Could you do a very quick check to make sure that doesn't break anything before I merge?

Thanks! Ran my test and it seems to work fine.

@kylebgorman kylebgorman merged commit ba8a516 into CUNY-CL:master May 6, 2024
5 checks passed
@Adamits Adamits mentioned this pull request Jun 4, 2024
@Adamits Adamits mentioned this pull request Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants