ProfBench introduces over 3000 expert-authored response–criterion pairs across 40 tasks in four professional domains in Business and Scientific Research - Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA - enabling evaluation of open-ended, document-grounded professional tasks beyond exam-style QA or code/math-only settings. Even frontier models find ProfBench challenging: the best report‑generator GPT‑5-high reaches only 65.9% overall, underscoring substantial headroom in realistic professional workflows that require synthesis and long-form analysis.
As part of ProfBench, we propose a robust, affordable LLM‑Judge is constructed by combining a Macro‑F1 measure with a Bias Index to mitigate self‑enhancement bias, achieving as low as <1% cross‑provider bias while reducing evaluation costs by 2−3 orders of magnitude compared to prior rubric benchmarks (costs only $12 for a single run versus ∼$300 for HealthBench and ∼$8000 for PaperBench when all use OpenAI o3), improving fairness and accessibility.
git clone https://github.com/NVlabs/ProfBench
cd ProfBench
pip install -r requirements.txt
# if you want to use google genai library, pip install this after installing other os-specific prerequisites using brew, apt-get etc
pip install google-generativeai
python run_llm_judge_on_provided_reports.py -m meta-llama/llama-3.2-1b-instruct -ak <your_openrouter_apikey> # can also use openai
python score_llm_judge.py <output_filename_of_prev_step>This will give something like
{
"Physics PhD": 66.5,
"Chemistry PhD": 60.3,
"Finance MBA": 61.4,
"Consulting MBA": 63.4,
"Extraction (recall)": 66.7,
"Reasoning": 63.8,
"Style": 54.3,
"Overall": 65.3,
"o3": 12.2,
"r1-0528": 14.2,
"grok4": 10.2,
"BIAS-INDEX": 4.0,
"MF1-BI": 61.3,
"prompt_tokens": 1633,
"completion_tokens": 1
}python run_report_generation.py -m meta-llama/llama-3.2-1b-instruct -ak <your_openrouter_apikey> # can also use openai or google
python run_best_llm_judge_on_generated_reports.py -f <output_filename_of_prev_step> -ak <your_openrouter_apikey>
python score_report_generation.py <output_filename_of_prev_step>This will give something like
{
"Consulting MBA": 28.9,
"Finance MBA": 6.0,
"Physics PhD": 3.4,
"Chemistry PhD": 7.1,
"Overall": 11.4,
"Reasoning": 11.2,
"Extraction (recall)": 8.7,
"Style": 22.9,
"prompt_tokens": 475,
"completion_tokens": 3392,
"response_len_chars": 10014
}If you found ProfBench helpful, please consider citing the below:
@misc{wang2025profbenchmultidomainrubricsrequiring,
title={ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge},
author={Zhilin Wang and Jaehun Jung and Ximing Lu and Shizhe Diao and Ellie Evans and Jiaqi Zeng and Pavlo Molchanov and Yejin Choi and Jan Kautz and Yi Dong},
year={2025},
eprint={2510.18941},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.18941},
}