Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
ba4220d
Add llamacpp dependency and update gitignore with generated directories
ErlisLushtaku Feb 14, 2026
d2a5a42
Add documentation for llamacpp in Readme
ErlisLushtaku Feb 14, 2026
a828adb
Document direnv usage for environment variables management
ErlisLushtaku Feb 15, 2026
0dcebf9
narrow down transformers dependency to fix version mismatch
ErlisLushtaku Feb 15, 2026
d60073b
Add max_model_len param for VLLM in order to prevent OOM errors
ErlisLushtaku Feb 15, 2026
38f63ee
Fix completion loading and EuroLLM-9B example
ErlisLushtaku Feb 15, 2026
6f5e0fc
Remove `direnv` documentation
ErlisLushtaku Feb 17, 2026
42ff2ae
Revert stylistic (formatting) changes and add more documentation for …
ErlisLushtaku Feb 17, 2026
8fcb032
Rename OPENJURY_EVAL_DATA to OPENJURY_DATA
ErlisLushtaku Feb 17, 2026
df958af
Merge main
ErlisLushtaku Feb 21, 2026
35856f2
Revert changes in gitignore
ErlisLushtaku Feb 21, 2026
6a11182
Handle models with max_position_embeddings when we pass max_model_len
ErlisLushtaku Feb 21, 2026
fecd3ed
Revert EuroLLM-9B-Instruct to EuroLLM-9B since there is a default cha…
ErlisLushtaku Feb 21, 2026
0b4eaec
fix tests
ErlisLushtaku Feb 22, 2026
29340b0
Change test github workflow to use uv instead of pip for a more robus…
ErlisLushtaku Feb 22, 2026
2c294f1
Move dev dependencies to dependency-group
ErlisLushtaku Feb 22, 2026
4be61bf
Revert comment removal
ErlisLushtaku Feb 22, 2026
51d2597
Add pre-commit hook
ErlisLushtaku Feb 22, 2026
8dee7b2
add project scripts and move slurmpilot to dev group
ErlisLushtaku Feb 23, 2026
fdc9410
fix LlamaCpp bug with ChatTemplate
ErlisLushtaku Mar 2, 2026
48c5373
Add MT-Bench multi-turn evaluation support
ErlisLushtaku Mar 2, 2026
648a9be
Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 2, 2026
14f747e
fix result formatting
ErlisLushtaku Mar 2, 2026
e67ea79
remove double environment variable
ErlisLushtaku Mar 2, 2026
4089be8
remove accidental duplications
ErlisLushtaku Mar 2, 2026
03f5cce
Refactor
ErlisLushtaku Mar 4, 2026
8ffe3a6
Remove duplication between prompt templates
ErlisLushtaku Mar 4, 2026
b877f11
add temperature argument
ErlisLushtaku Mar 9, 2026
c2056b5
add option for making mt-bench consistent with the original one from …
ErlisLushtaku Mar 9, 2026
41cd15d
Merge branch 'main' into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 9, 2026
0ca66c5
remove redundant print statement
ErlisLushtaku Mar 10, 2026
a295305
move mt-bench logic from the entrypoint
ErlisLushtaku Mar 17, 2026
0fb9700
Remove stale unused entries for fastchat mode
ErlisLushtaku Mar 17, 2026
e5670ea
Merge origin/main into erlislushtaku/feat/add-mt-bench-support
ErlisLushtaku Mar 17, 2026
6dd78fd
Refactor mt-bench eval helpers into shared runtime module
ErlisLushtaku Mar 17, 2026
0094eea
move cli args and parsing to separate util to remove dependencies on …
ErlisLushtaku Mar 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Compared to other libraries, here is a breakdown of features:
| **Arena-Hard-Auto** | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
| **Lighteval** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Evalchemy** | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
| **OpenJury** | 🔜 | ✅ | ✅ | ✅ | ✅ | ✅ |
| **OpenJury** | | ✅ | ✅ | ✅ | ✅ | ✅ |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪


The table has been done on Oct 2025, in case some libraries implemented missing features, please open an issue
or send a PR, we will be happy to update the information.
Expand Down Expand Up @@ -191,10 +191,29 @@ python openjury/generate_and_evaluate.py \

This override applies to all vLLM models in the run. For remote providers (OpenAI, Together, OpenRouter), the flag is ignored since they handle templates server-side.

### MT-Bench (Multi-Turn Evaluation)

MT-Bench evaluates multi-turn conversation ability using 80 two-turn questions across 8 categories
(writing, roleplay, reasoning, math, coding, extraction, STEM, humanities).
It uses category-dependent judge prompts and reference answers for math/reasoning/coding.
Questions are automatically downloaded from the [LMSYS MT-Bench HuggingFace space](https://huggingface.co/spaces/lmsys/mt-bench).

```bash
uv run python openjury/generate_and_evaluate.py \
--dataset mt-bench \
--model_A VLLM/Qwen/Qwen2.5-7B-Instruct \
--model_B OpenRouter/openai/gpt-4o \
--judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
--n_instructions 10
```

Results include per-category and per-turn win rate breakdowns. Use `--swap_mode both` to correct for judge position bias.

## 📊 Supported Datasets

| Dataset | Description |
|-----------------------|------------------------------------------------------------------------------------------------|
| `mt-bench` | 80 multi-turn (2-turn) questions across 8 categories ([LMSYS MT-Bench](https://arxiv.org/abs/2306.05685)) |
| `alpaca-eval` | General instruction-following benchmark |
| `arena-hard` | More challenging evaluation suite |
| `m-arena-hard` | Translated version of Arena-Hard in 23 languages |
Expand Down
212 changes: 212 additions & 0 deletions openjury/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
"""CLI argument configuration for generation and evaluation entrypoints."""

import argparse
import json
from dataclasses import dataclass, field


@dataclass
class CliArgs:
dataset: str
model_A: str
model_B: str
judge_model: str

n_instructions: int | None = None
provide_explanation: bool = False
swap_mode: str = "fixed"
ignore_cache: bool = False
use_tqdm: bool = False
truncate_all_input_chars: int = 8192
max_out_tokens_models: int = 32768
max_out_tokens_judge: int = 32768
max_model_len: int | None = None
chat_template: str | None = None
mt_bench_turns: str = "both"
mt_bench_compatibility: str = "openjury"
result_folder: str = "results"
engine_kwargs: dict = field(default_factory=dict)

def __post_init__(self):
supported_modes = ["fixed", "both"]
assert (
self.swap_mode in supported_modes
), f"Only {supported_modes} modes are supported but got {self.swap_mode}."
supported_mt_bench_modes = ["openjury", "fastchat"]
assert (
self.mt_bench_compatibility in supported_mt_bench_modes
), f"Only {supported_mt_bench_modes} are supported but got {self.mt_bench_compatibility}."

@classmethod
def parse_args(cls):
parser = argparse.ArgumentParser(
prog="Generate completion and evaluate with a judge",
)
parser.add_argument(
"--dataset",
help="The dataset to use. For instance `alpaca-eval`, `arena-hard`, `m-arena-hard-EU` for instruction "
"tuning cases or `french-contexts`, `spanish-contexts` for base models.",
)
parser.add_argument(
"--model_A",
required=True,
help="Name of the LLM to use for a generation, must be a valid choice for `generation_provider`",
)
parser.add_argument(
"--model_B",
required=True,
help="Name of the LLM to use for a generation, must be a valid choice for `generation_provider`",
)
parser.add_argument(
"--judge_model",
required=True,
help="Name of the LLM to use, for instance `Together/meta-llama/Meta-Llama-3-70B-Instruct-Turbo`, "
"`VLLM/meta-llama/Meta-Llama-3-70B-Instruct-Turbo`, `LangChain/LocalPath` etc",
)
parser.add_argument(
"--n_instructions",
type=int,
required=False,
)
parser.add_argument(
"--provide_explanation",
action="store_true",
help="If specified, judge will provide explanation before making a judgement. Does not necessarily improve"
"the accuracy of the judge but enables some result interpretation.",
)
parser.add_argument(
"--swap_mode",
type=str,
choices=["fixed", "both"],
default="fixed",
help="Model comparison order mode. 'fixed': always use model order A-B. 'both': correct for model order "
"bias by evaluating each instruction twice, once as A-B and once as B-A, and average. This helps account "
"for judge position bias. Default is 'fixed'.",
)
parser.add_argument(
"--ignore_cache",
action="store_true",
help="If specified, ignore cache of previous completions.",
)
parser.add_argument(
"--use_tqdm",
action="store_true",
help="If specified, use tqdm, does not work with all model providers, vLLM in particular.",
)
parser.add_argument(
"--result_folder",
type=str,
required=False,
default="results",
help="The folder to save the results. Defaults to `results`. Evaluation results will be saved in"
" `[result_folder]/[evaluation_name]`.",
)
parser.add_argument(
"--truncate_all_input_chars",
type=int,
required=False,
default=8192,
help="Character-level truncation applied before tokenization: truncates each instruction "
"before model A/B generation and truncates each completion before judge evaluation.",
)
parser.add_argument(
"--max_out_tokens_models",
type=int,
required=False,
default=32768,
help=(
"Generation token budget for each model A/B response. For VLLM, keep this <= "
"--max_model_len (if provided)."
),
)
parser.add_argument(
"--max_out_tokens_judge",
type=int,
required=False,
default=32768,
help=(
"Generation token budget for the judge response (reasoning + scores). For "
"VLLM, keep this <= --max_model_len (if provided)."
),
)
parser.add_argument(
"--max_model_len",
type=int,
required=False,
default=None,
help=(
"Optional total context window for VLLM models (prompt + generation). This is "
"independent from --max_out_tokens_models/--max_out_tokens_judge, which only cap "
"generated tokens. This is useful on smaller GPUs to avoid OOM."
),
)
parser.add_argument(
"--chat_template",
type=str,
required=False,
default=None,
help="Jinja2 chat template string to use instead of the model's tokenizer template. "
"If not provided, ChatML is used as fallback for models without a chat template.",
)
parser.add_argument(
"--mt_bench_turns",
type=str,
choices=["both", "single", "multi"],
default="both",
help="Which MT-Bench turns to evaluate. 'single': only turn 1, "
"'multi': only turn 2 (with full conversation context), "
"'both' (default): evaluate both turns.",
)
parser.add_argument(
"--mt_bench_compatibility",
type=str,
choices=["openjury", "fastchat"],
default="openjury",
help=(
"MT-Bench evaluation/generation mode. "
"'openjury' (default): OpenJury score_A/score_B prompt + softmax preference. "
"'fastchat': use FastChat/MT-Bench pairwise prompts with [[A]]/[[B]]/[[C]] verdict parsing, "
"conservative position-bias handling, judge temperature=0, and MT-Bench category temperatures."
),
)
parser.add_argument(
"--engine_kwargs",
type=str,
required=False,
default="{}",
help=(
"JSON dict of engine-specific kwargs forwarded to the underlying engine. "
"Example for vLLM: '{\"tensor_parallel_size\": 2, \"gpu_memory_utilization\": 0.9}'."
),
)
args = parser.parse_args()

try:
engine_kwargs = (
json.loads(args.engine_kwargs) if args.engine_kwargs else {}
)
if not isinstance(engine_kwargs, dict):
raise ValueError("engine_kwargs must be a JSON object")
except Exception as e:
raise SystemExit(f"Failed to parse --engine_kwargs: {e}")

return cls(
dataset=args.dataset,
model_A=args.model_A,
model_B=args.model_B,
judge_model=args.judge_model,
n_instructions=args.n_instructions,
provide_explanation=args.provide_explanation,
swap_mode=args.swap_mode,
ignore_cache=args.ignore_cache,
use_tqdm=args.use_tqdm,
truncate_all_input_chars=args.truncate_all_input_chars,
max_out_tokens_models=args.max_out_tokens_models,
max_out_tokens_judge=args.max_out_tokens_judge,
max_model_len=args.max_model_len,
chat_template=args.chat_template,
mt_bench_turns=args.mt_bench_turns,
mt_bench_compatibility=args.mt_bench_compatibility,
result_folder=args.result_folder,
engine_kwargs=engine_kwargs,
)
Loading
Loading