Likelihood-based Mitigation of Evaluation Bias in Large Language Models, Masanari Ohi+, N/A, arXiv'24 #1241

AkihikoWatanabe · 2024-03-01T04:03:44Z

URL

Large Language Models (LLMs) are widely used to evaluate natural languagegeneration tasks as automated metrics. However, the likelihood, a measure ofLLM's plausibility for a sentence, can vary due to superficial differences insentences, such as word order and sentence structure. It is therefore possiblethat there might be a likelihood bias if LLMs are used for evaluation: theymight overrate sentences with higher likelihoods while underrating those withlower likelihoods. In this paper, we investigate the presence and impact oflikelihood bias in LLM-based evaluators. We also propose a method to mitigatethe likelihood bias. Our method utilizes highly biased instances as few-shotexamples for in-context learning. Our experiments in evaluating thedata-to-text and grammatical error correction tasks reveal that several LLMs wetest display a likelihood bias. Furthermore, our proposed method successfullymitigates this bias, also improving evaluation performance (in terms ofcorrelation of models with human scores) significantly.

大規模言語モデル（LLMs）は、自然言語生成タスクの自動評価に広く使用されています。しかし、文の可能性（likelihood）は、単語の順序や文の構造などの表面的な違いによって変化するため、LLMの文に対する妥当性の尺度としての可能性は異なる可能性があります。したがって、LLMsを評価に使用すると、可能性の高い文を過大評価し、可能性の低い文を過小評価する可能性があります。本論文では、LLMベースの評価者における可能性のバイアスの存在と影響を調査します。また、可能性のバイアスを緩和する方法を提案します。提案手法は、コンテキスト内学習のための少数ショット例として高度にバイアスのかかったインスタンスを活用します。データからテキストへの変換タスクや文法エラー修正タスクを評価する実験では、いくつかのLLMsが可能性のバイアスを示すことが明らかになりました。さらに、提案された方法はこのバイアスを成功裏に緩和し、評価パフォーマンス（モデルと人間のスコアの相関）も大幅に向上させました。

LLMsを使用した評価者における可能性のバイアスとその影響を調査し、バイアスを緩和する方法を提案。提案手法は、バイアスのかかったインスタンスを活用し、評価パフォーマンスを向上させた。

AkihikoWatanabe added the Pocket label Mar 1, 2024

AkihikoWatanabe changed the title あ Likelihood-based Mitigation of Evaluation Bias in Large Language Models, Masanari Ohi+, N/A, arXiv'24 Mar 1, 2024