Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 #1300

AkihikoWatanabe · 2024-05-03T13:08:43Z

URL

https://arxiv.org/abs/2405.01535

Affiliations

Seungone Kim, N/A
Juyoung Suk, N/A
Shayne Longpre, N/A
Bill Yuchen Lin, N/A
Jamin Shin, N/A
Sean Welleck, N/A
Graham Neubig, N/A
Moontae Lee, N/A
Kyungjae Lee, N/A
Minjoon Seo, N/A

Abstract

Proprietary LMs such as GPT-4 are often employed to assess the quality ofresponses from various LMs. However, concerns including transparency,controllability, and affordability strongly motivate the development ofopen-source LMs specialized in evaluations. On the other hand, existing openevaluator LMs exhibit critical shortcomings: 1) they issue scores thatsignificantly diverge from those assigned by humans, and 2) they lack theflexibility to perform both direct assessment and pairwise ranking, the twomost prevalent forms of assessment. Additionally, they do not possess theability to evaluate based on custom evaluation criteria, focusing instead ongeneral attributes like helpfulness and harmlessness. To address these issues,we introduce Prometheus 2, a more powerful evaluator LM than its predecessorthat closely mirrors human and GPT-4 judgements. Moreover, it is capable ofprocessing both direct assessment and pair-wise ranking formats grouped with auser-defined evaluation criteria. On four direct assessment benchmarks and fourpairwise ranking benchmarks, Prometheus 2 scores the highest correlation andagreement with humans and proprietary LM judges among all tested open evaluatorLMs. Our models, code, and data are all publicly available athttps://github.com/prometheus-eval/prometheus-eval.

Translation (by gpt-3.5-turbo)

プロプライエタリな言語モデル（GPT-4など）は、さまざまな言語モデルの応答の品質を評価するためによく使用されています。しかし、透明性、制御可能性、手頃な価格などの懸念事項から、評価に特化したオープンソースの言語モデルの開発が強く促進されています。一方で、既存のオープンな評価言語モデルには重大な欠点があります。1つは、人間が割り当てたスコアと大きく乖離すること、2つ目は、直接評価とペアワイズランキングの両方を行う柔軟性がないことです。さらに、これらのモデルは、有用性や無害性などの一般的な属性に焦点を当てるため、カスタム評価基準に基づいて評価する能力を持っていません。これらの問題に対処するために、前身よりも強力な評価言語モデルであるPrometheus 2を紹介します。このモデルは、人間とGPT-4の判断に密接に追随し、ユーザー定義の評価基準に基づいてグループ化された直接評価とペアワイズランキング形式の両方を処理する能力を持っています。直接評価ベンチマーク4つとペアワイズランキングベンチマーク4つにおいて、Prometheus 2は、すべてのテストされたオープンな評価言語モデルの中で、人間とプロプライエタリな言語モデルの判断と最も高い相関と一致を示しました。当社のモデル、コード、データはすべてhttps://github.com/prometheus-eval/prometheus-evalで公開されています。

Summary (by gpt-3.5-turbo)

GPT-4などのプロプライエタリな言語モデルの評価に対する懸念から、オープンソースの評価言語モデルの開発が進んでいる。既存のオープンな評価言語モデルには欠点があり、これらの問題に対処するために、Prometheus 2という強力な評価言語モデルが紹介された。Prometheus 2は、人間とGPT-4の判断に密接に追随し、ユーザー定義の評価基準に基づいてグループ化された直接評価とペアワイズランキング形式の両方を処理する能力を持っている。Prometheus 2は、すべてのテストされたオープンな評価言語モデルの中で、人間とプロプライエタリな言語モデルの判断と最も高い相関と一致を示した。

AkihikoWatanabe added the Pocket label May 3, 2024

AkihikoWatanabe changed the title あ Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 #1300

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 #1300

AkihikoWatanabe commented May 3, 2024 •

edited

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 #1300

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models, Seungone Kim+, N/A, arXiv'24 #1300

Comments

AkihikoWatanabe commented May 3, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented May 3, 2024 •

edited