Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 #903

AkihikoWatanabe · 2023-07-26T04:52:57Z

URL

https://arxiv.org/abs/2306.05685

Affiliations

Lianmin Zheng, N/A
Wei-Lin Chiang, N/A
Ying Sheng, N/A
Siyuan Zhuang, N/A
Zhanghao Wu, N/A
Yonghao Zhuang, N/A
Zi Lin, N/A
Zhuohan Li, N/A
Dacheng Li, N/A
Eric. P Xing, N/A
Hao Zhang, N/A
Joseph E. Gonzalez, N/A
Ion Stoica, N/A

Abstract

Evaluating large language model (LLM) based chat assistants is challengingdue to their broad capabilities and the inadequacy of existing benchmarks inmeasuring human preferences. To address this, we explore using strong LLMs asjudges to evaluate these models on more open-ended questions. We examine theusage and limitations of LLM-as-a-judge, including position, verbosity, andself-enhancement biases, as well as limited reasoning ability, and proposesolutions to mitigate some of them. We then verify the agreement between LLMjudges and human preferences by introducing two benchmarks: MT-bench, amulti-turn question set; and Chatbot Arena, a crowdsourced battle platform. Ourresults reveal that strong LLM judges like GPT-4 can match both controlled andcrowdsourced human preferences well, achieving over 80% agreement, the samelevel of agreement between humans. Hence, LLM-as-a-judge is a scalable andexplainable way to approximate human preferences, which are otherwise veryexpensive to obtain. Additionally, we show our benchmark and traditionalbenchmarks complement each other by evaluating several variants of LLaMA andVicuna. We will publicly release MT-bench questions, 3K expert votes, and 30Kconversations with human preferences from Chatbot Arena.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLM）ベースのチャットアシスタントの評価は、その広範な機能と既存のベンチマークが人間の好みを測定するのに不十分であるため、困難です。
これに対処するために、よりオープンエンドの質問に対してこれらのモデルを評価するために、強力なLLMを判定者として使用することを検討します。
LLM-as-a-judgeの使用法と制限、位置、冗長性、自己強化バイアス、限られた推論能力などの問題を検討し、それらのいくつかを軽減するための解決策を提案します。
その後、2つのベンチマークを導入して、LLMの判定者と人間の好みの一致を検証します：MT-bench（マルチターンの質問セット）とChatbot Arena（クラウドソーシングされたバトルプラットフォーム）。
結果は、GPT-4などの強力なLLM判定者が制御された状況とクラウドソーシングされた人間の好みの両方とよく一致し、80％以上の一致度を達成することを明らかにしました。これは、人間の好みを近似するためのスケーラブルで説明可能な方法であり、それを得るのは非常に高価です。
さらに、私たちはベンチマークと従来のベンチマークが相補的であることを示し、LLaMAとVicunaのいくつかのバリアントを評価します。
MT-benchの質問、3,000の専門家の投票、およびChatbot Arenaからの30,000の人間の好みを持つ会話を公開します。

Summary (by gpt-3.5-turbo)

大規模言語モデル（LLM）を判定者として使用して、オープンエンドの質問に対する性能を評価する方法を提案する。LLMの制限や問題を軽減するための解決策を提案し、2つのベンチマークでLLMの判定者と人間の好みの一致を検証する。結果は、強力なLLM判定者が人間の好みとよく一致し、スケーラブルで説明可能な方法で人間の好みを近似できることを示した。さらに、新しいベンチマークと従来のベンチマークの相補性を示し、いくつかのバリアントを評価する。

AkihikoWatanabe · 2023-09-28T15:37:27Z

MT-Bench（MTBench）スコアとは、multi-turnのQAを出題し、その回答の質をGPT-4でスコアリングしたスコアのこと。
GPT-4の判断とhuman expertの判断とのagreementも検証しており、agreementは80%以上を達成している。

AkihikoWatanabe added the Pocket label Jul 26, 2023

AkihikoWatanabe changed the title a Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 Jul 26, 2023

AkihikoWatanabe added NLP LanguageModel Evaluation labels Oct 15, 2023

AkihikoWatanabe added the LLM-as-a-Judge label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 #903

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 #903

AkihikoWatanabe commented Jul 26, 2023 •

edited

AkihikoWatanabe commented Sep 28, 2023 •

edited

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 #903

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Lianmin Zheng+, N/A, arXiv'23 #903

Comments

AkihikoWatanabe commented Jul 26, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Sep 28, 2023 • edited

AkihikoWatanabe commented Jul 26, 2023 •

edited

AkihikoWatanabe commented Sep 28, 2023 •

edited