Skip to content

2.5. mt-bench changes#55

Open
ErlisLushtaku wants to merge 3 commits into
pr32-split-v2/02-prompt-presets-onlyfrom
pr32-split-v2/02.5-mt-bench-preset-judging
Open

2.5. mt-bench changes#55
ErlisLushtaku wants to merge 3 commits into
pr32-split-v2/02-prompt-presets-onlyfrom
pr32-split-v2/02.5-mt-bench-preset-judging

Conversation

@ErlisLushtaku
Copy link
Copy Markdown
Collaborator

@ErlisLushtaku ErlisLushtaku commented May 28, 2026

Summary

This PR pulls the MT-Bench-specific follow-up work out of the prompt preset PR and keeps it as a focused stacked cleanup.

It now does two things:

  • refactors the non-delegated MT-Bench judging path to reuse shared pairwise helpers instead of duplicating batching, prompt grouping, answer swapping, and item construction
  • adds MT-Bench-specific reproducibility/orchestration follow-ups that were originally mixed into PR 52, without bringing over the deferred truncation tracking or cache-key changes

Concretely, this PR:

  • extracts shared MT-Bench pairwise judging helpers into judgearena/mt_bench/pairwise_judging.py
  • keeps FastChat-specific verdict parsing and preset-specific score parsing separate, while reusing the same lower-level MT-Bench mechanics
  • replaces loose shared item dicts with a typed MTBenchJudgeItem
  • makes the two swap_mode="both" semantics explicit in code:
    • FastChat: conservative agreement
    • preset judging: append inverted swapped scores
  • factors shared MT-Bench result finalization in mt_bench_utils.py
  • aligns pre-generated or cached MT-Bench completions to the requested question order and raises a clear error if rows are missing
  • writes MT-Bench run metadata, including input payload and prompt metadata, alongside saved artifacts
  • lets run_mt_bench derive its default result folder and artifact name when callers do not pass them explicitly
  • adds focused MT-Bench tests for both the FastChat and preset judging paths, plus coverage for completion alignment and metadata writing

@ErlisLushtaku ErlisLushtaku changed the base branch from main to pr32-split-v2/02-prompt-presets-only May 28, 2026 08:59
@ErlisLushtaku ErlisLushtaku requested a review from kargibora May 28, 2026 09:04
Record MT-Bench run metadata alongside results and fail early when cached or baseline completions are missing question rows, so dedicated MT-Bench runs are easier to reproduce and debug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant