Skip to content

NVlabs/ProfBench

Repository files navigation

Overview

Paper | Data | Code

ProfBench introduces over 3000 expert-authored response–criterion pairs across 40 tasks in four professional domains in Business and Scientific Research - Physics PhD, Chemistry PhD, Finance MBA, and Consulting MBA - enabling evaluation of open-ended, document-grounded professional tasks beyond exam-style QA or code/math-only settings. Even frontier models find ProfBench challenging: the best report‑generator GPT‑5-high reaches only 65.9% overall, underscoring substantial headroom in realistic professional workflows that require synthesis and long-form analysis.

As part of ProfBench, we propose a robust, affordable LLM‑Judge is constructed by combining a Macro‑F1 measure with a Bias Index to mitigate self‑enhancement bias, achieving as low as <1% cross‑provider bias while reducing evaluation costs by 2−3 orders of magnitude compared to prior rubric benchmarks (costs only $12 for a single run versus ∼$300 for HealthBench and ∼$8000 for PaperBench when all use OpenAI o3), improving fairness and accessibility.

Installation

git clone https://github.com/NVlabs/ProfBench

cd ProfBench

pip install -r requirements.txt

# if you want to use google genai library, pip install this after installing other os-specific prerequisites using brew, apt-get etc
pip install google-generativeai

Running LLM judge Evaluation

python run_llm_judge_on_provided_reports.py -m meta-llama/llama-3.2-1b-instruct -ak <your_openrouter_apikey> # can also use openai

python score_llm_judge.py <output_filename_of_prev_step>

This will give something like

{
    "Physics PhD": 66.5,
    "Chemistry PhD": 60.3,
    "Finance MBA": 61.4,
    "Consulting MBA": 63.4,
    "Extraction (recall)": 66.7,
    "Reasoning": 63.8,
    "Style": 54.3,
    "Overall": 65.3,
    "o3": 12.2,
    "r1-0528": 14.2,
    "grok4": 10.2,
    "BIAS-INDEX": 4.0,
    "MF1-BI": 61.3,
    "prompt_tokens": 1633,
    "completion_tokens": 1
}

Running Report Generation

python run_report_generation.py -m meta-llama/llama-3.2-1b-instruct -ak <your_openrouter_apikey>  # can also use openai or google

python run_best_llm_judge_on_generated_reports.py -f <output_filename_of_prev_step> -ak <your_openrouter_apikey>

python score_report_generation.py <output_filename_of_prev_step>

This will give something like

{
    "Consulting MBA": 28.9,
    "Finance MBA": 6.0,
    "Physics PhD": 3.4,
    "Chemistry PhD": 7.1,
    "Overall": 11.4,
    "Reasoning": 11.2,
    "Extraction (recall)": 8.7,
    "Style": 22.9,
    "prompt_tokens": 475,
    "completion_tokens": 3392,
    "response_len_chars": 10014
}

Citation:

If you found ProfBench helpful, please consider citing the below:

@misc{wang2025profbenchmultidomainrubricsrequiring,
      title={ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge}, 
      author={Zhilin Wang and Jaehun Jung and Ximing Lu and Shizhe Diao and Ellie Evans and Jiaqi Zeng and Pavlo Molchanov and Yejin Choi and Jan Kautz and Yi Dong},
      year={2025},
      eprint={2510.18941},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.18941}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages