You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large Language Models (LLMs) are widely used to evaluate natural languagegeneration tasks as automated metrics. However, the likelihood, a measure ofLLM's plausibility for a sentence, can vary due to superficial differences insentences, such as word order and sentence structure. It is therefore possiblethat there might be a likelihood bias if LLMs are used for evaluation: theymight overrate sentences with higher likelihoods while underrating those withlower likelihoods. In this paper, we investigate the presence and impact oflikelihood bias in LLM-based evaluators. We also propose a method to mitigatethe likelihood bias. Our method utilizes highly biased instances as few-shotexamples for in-context learning. Our experiments in evaluating thedata-to-text and grammatical error correction tasks reveal that several LLMs wetest display a likelihood bias. Furthermore, our proposed method successfullymitigates this bias, also improving evaluation performance (in terms ofcorrelation of models with human scores) significantly.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: