2.5. mt-bench changes by ErlisLushtaku · Pull Request #55 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-05-28T08:58:52Z

Summary

This PR pulls the MT-Bench-specific follow-up work out of the prompt preset PR and keeps it as a focused stacked cleanup.

It now does two things:

refactors the non-delegated MT-Bench judging path to reuse shared pairwise helpers instead of duplicating batching, prompt grouping, answer swapping, and item construction
adds MT-Bench-specific reproducibility/orchestration follow-ups that were originally mixed into PR 52, without bringing over the deferred truncation tracking or cache-key changes

Concretely, this PR:

extracts shared MT-Bench pairwise judging helpers into judgearena/mt_bench/pairwise_judging.py
keeps FastChat-specific verdict parsing and preset-specific score parsing separate, while reusing the same lower-level MT-Bench mechanics
replaces loose shared item dicts with a typed MTBenchJudgeItem
makes the two swap_mode="both" semantics explicit in code:
- FastChat: conservative agreement
- preset judging: append inverted swapped scores
factors shared MT-Bench result finalization in mt_bench_utils.py
aligns pre-generated or cached MT-Bench completions to the requested question order and raises a clear error if rows are missing
writes MT-Bench run metadata, including input payload and prompt metadata, alongside saved artifacts
lets run_mt_bench derive its default result folder and artifact name when callers do not pass them explicitly
adds focused MT-Bench tests for both the FastChat and preset judging paths, plus coverage for completion alignment and metadata writing

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

refactor and add tests

ddde004

ErlisLushtaku changed the base branch from main to pr32-split-v2/02-prompt-presets-only May 28, 2026 08:59

This was referenced May 28, 2026

2. Add prompt presets and mt-bench changes #48

Closed

2. Add prompt presets #54

Open

ErlisLushtaku requested a review from kargibora May 28, 2026 09:04

ErlisLushtaku added 2 commits May 29, 2026 14:21

add mt-bench reproducibility metadata and completion alignment

76809e0

Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.

Remove unnecessary cases added

0d6bc76

This was referenced Jun 1, 2026

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf #56

Closed

3.5. Add runtime robustness, judge-kwarg defaults, CLI/orchestration cleanup, point to judgearena hf #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.5. mt-bench changes#55

2.5. mt-bench changes#55
ErlisLushtaku wants to merge 3 commits into
pr32-split-v2/02-prompt-presets-onlyfrom
pr32-split-v2/02.5-mt-bench-preset-judging

ErlisLushtaku commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ErlisLushtaku commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ErlisLushtaku commented May 28, 2026 •

edited

Loading