-
Notifications
You must be signed in to change notification settings - Fork 463
Closed
Labels
bug / fixSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomerstopic: Textv1.1.x
Description
🐛 Bug
Hi,
when using the rouge_score with accumulate="best", the results are dependent on the order of the labels. As of my understanding, accumulate="best" should return the best f score over all references.
Minimal example:
from torchmetrics.functional.text import rouge_score
preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]
print(rouge_score(preds, references, accumulate='best'))
print(rouge_score(preds, references_rev, accumulate='best'))
gives different results:
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(1.), 'rouge2_precision': tensor(1.), 'rouge2_recall': tensor(1.), 'rougeL_fmeasure': tensor(1.), 'rougeL_precision': tensor(1.), 'rougeL_recall': tensor(1.), 'rougeLsum_fmeasure': tensor(1.), 'rougeLsum_precision': tensor(1.), 'rougeLsum_recall': tensor(1.)}
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(0.), 'rouge2_precision': tensor(0.), 'rouge2_recall': tensor(0.), 'rougeL_fmeasure': tensor(0.3333), 'rougeL_precision': tensor(0.3333), 'rougeL_recall': tensor(0.3333), 'rougeLsum_fmeasure': tensor(0.3333), 'rougeLsum_precision': tensor(0.3333), 'rougeLsum_recall': tensor(0.3333)}
Did I missread the documentation or is this a bug. Accumulate='avg' works as expected.
Maybe the bug is in https://github.com/Lightning-AI/torchmetrics/blob/v1.1.0/src/torchmetrics/functional/text/rouge.py#L378
where there is a todo comment.
I compared the results to the rouge-score package:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]
print(scorer.score_multi(references, preds))
print(scorer.score_multi(references_rev, preds))
which gives the same results in both cases:
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
Environment
- TorchMetrics Version: 1.1.2
- Python 3.10.12
- torch 2.0.1
stancld
Metadata
Metadata
Assignees
Labels
bug / fixSomething isn't workingSomething isn't workinggood first issueGood for newcomersGood for newcomerstopic: Textv1.1.x