`rouge_score` with `accumulate='best'` gives mixed results

## 🐛 Bug

Hi,

when using the rouge_score with accumulate="best", the results are dependent on the order of the labels. As of my understanding, accumulate="best" should return the best f score over all references.

Minimal example:
```py
from torchmetrics.functional.text import rouge_score

preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]

print(rouge_score(preds, references, accumulate='best'))
print(rouge_score(preds, references_rev, accumulate='best'))
```
gives different results:
```py
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(1.), 'rouge2_precision': tensor(1.), 'rouge2_recall': tensor(1.), 'rougeL_fmeasure': tensor(1.), 'rougeL_precision': tensor(1.), 'rougeL_recall': tensor(1.), 'rougeLsum_fmeasure': tensor(1.), 'rougeLsum_precision': tensor(1.), 'rougeLsum_recall': tensor(1.)}
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(0.), 'rouge2_precision': tensor(0.), 'rouge2_recall': tensor(0.), 'rougeL_fmeasure': tensor(0.3333), 'rougeL_precision': tensor(0.3333), 'rougeL_recall': tensor(0.3333), 'rougeLsum_fmeasure': tensor(0.3333), 'rougeLsum_precision': tensor(0.3333), 'rougeLsum_recall': tensor(0.3333)}
```

 Did I missread the documentation or is this a bug. Accumulate='avg' works as expected.
Maybe the bug is in https://github.com/Lightning-AI/torchmetrics/blob/v1.1.0/src/torchmetrics/functional/text/rouge.py#L378
where there is a todo comment.

I compared the results to the rouge-score package:
```py
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]
print(scorer.score_multi(references, preds))
print(scorer.score_multi(references_rev, preds))
```
which gives the same results in both cases:
```py
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
```

### Environment

- TorchMetrics Version: 1.1.2
- Python 3.10.12
- torch 2.0.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`rouge_score` with `accumulate='best'` gives mixed results #2148

🐛 Bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rouge_score with accumulate='best' gives mixed results #2148

Description

🐛 Bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`rouge_score` with `accumulate='best'` gives mixed results #2148