INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 #1224

AkihikoWatanabe · 2024-01-25T09:24:10Z

URL

https://arxiv.org/abs/2305.14282

Affiliations

Wenda Xu, N/A
Danqing Wang, N/A
Liangming Pan, N/A
Zhenqiao Song, N/A
Markus Freitag, N/A
William Yang Wang, N/A
Lei Li, N/A

Abstract

Automatically evaluating the quality of language generation is critical.Although recent learned metrics show high correlation with human judgement,these metrics can not explain their verdict or associate the scores withdefects in generated text. To address this limitation, we presentInstructScore, an explainable evaluation metric for text generation. Byharnessing both explicit human instruction and the implicit knowledge of GPT-4,we fine-tune a text evaluation metric based on LLaMA, producing both a scorefor generated text and a human readable diagnostic report. We evaluateInstructScore on a variety of generation tasks, including translation,captioning, data-to-text and commonsense generation. Experiments show that our7B model surpasses all other unsupervised metrics, including those based on175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without directsupervision from human-rated data, achieves performance levels on par withstate-of-the-art metrics like COMET22, which were fine-tuned on human ratings.

Translation (by gpt-3.5-turbo)

自動的に言語生成の品質を評価することは重要です。
最近の学習済みメトリクスは人間の判断と高い相関を示していますが、
これらのメトリクスはその判定を説明したり、生成されたテキストの欠陥とスコアを関連付けることができません。
この制限に対処するために、私たちはInstructScoreという説明可能なテキスト生成の評価メトリクスを提案します。
明示的な人間の指示とGPT-4の暗黙の知識の両方を活用することで、LLaMAに基づいたテキスト評価メトリクスを微調整し、生成されたテキストのスコアと人間が読める診断レポートを生成します。
翻訳、キャプション、データからテキストへの変換、常識的な生成など、さまざまな生成タスクでInstructScoreを評価します。
実験の結果、私たちの7Bモデルは、175BのGPT-3やGPT-4を含む他の非教師ありメトリクスを上回ります。
驚くべきことに、InstructScoreは直接の人間による評価データの監督なしでも、COMET22などの最先端のメトリクスと同等のパフォーマンスレベルを達成します。これらのメトリクスは人間の評価に基づいて微調整されています。

Summary (by gpt-3.5-turbo)

自動的な言語生成の品質評価には説明可能なメトリクスが必要であるが、既存のメトリクスはその判定を説明したり欠陥とスコアを関連付けることができない。そこで、InstructScoreという新しいメトリクスを提案し、人間の指示とGPT-4の知識を活用してテキストの評価と診断レポートを生成する。さまざまな生成タスクでInstructScoreを評価し、他のメトリクスを上回る性能を示した。驚くべきことに、InstructScoreは人間の評価データなしで最先端のメトリクスと同等の性能を達成する。

AkihikoWatanabe · 2024-01-25T09:25:04Z

伝統的なNLGの性能指標の解釈性が低いことを主張する研究

AkihikoWatanabe added the Pocket label Jan 25, 2024

AkihikoWatanabe changed the title あ INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 #1224

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 #1224

AkihikoWatanabe commented Jan 25, 2024 •

edited

AkihikoWatanabe commented Jan 25, 2024

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 #1224

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback, Wenda Xu+, N/A, arXiv'23 #1224

Comments

AkihikoWatanabe commented Jan 25, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jan 25, 2024

AkihikoWatanabe commented Jan 25, 2024 •

edited