You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proprietary LMs such as GPT-4 are often employed to assess the quality ofresponses from various LMs. However, concerns including transparency,controllability, and affordability strongly motivate the development ofopen-source LMs specialized in evaluations. On the other hand, existing openevaluator LMs exhibit critical shortcomings: 1) they issue scores thatsignificantly diverge from those assigned by humans, and 2) they lack theflexibility to perform both direct assessment and pairwise ranking, the twomost prevalent forms of assessment. Additionally, they do not possess theability to evaluate based on custom evaluation criteria, focusing instead ongeneral attributes like helpfulness and harmlessness. To address these issues,we introduce Prometheus 2, a more powerful evaluator LM than its predecessorthat closely mirrors human and GPT-4 judgements. Moreover, it is capable ofprocessing both direct assessment and pair-wise ranking formats grouped with auser-defined evaluation criteria. On four direct assessment benchmarks and fourpairwise ranking benchmarks, Prometheus 2 scores the highest correlation andagreement with humans and proprietary LM judges among all tested open evaluatorLMs. Our models, code, and data are all publicly available athttps://github.com/prometheus-eval/prometheus-eval.
AkihikoWatanabe
changed the title
あ
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models, Seungone Kim+, N/A, arXiv'24
May 3, 2024
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: