Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 #869

AkihikoWatanabe · 2023-07-18T01:51:59Z

https://virtual2023.aclweb.org/paper_P3833.html#abstract

AkihikoWatanabe · 2023-07-18T05:44:42Z

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale, and an in-depth analysis of human evaluation is lacking. Therefore, we address the shortcomings of existing summarization evaluation along the following axes: (1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units and allows for a high inter-annotator agreement. (2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems on three datasets. (3) We conduct a comparative study of four human evaluation protocols, underscoring potential confounding factors in evaluation setups. (4) We evaluate 50 automatic metrics and their variants using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. The metrics we benchmarked include recent methods based on large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings have important implications for evaluating LLMs, as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.

Translation (by gpt-3.5-turbo)

要約システムと自動評価指標の評価の基盤となるのは、人間による評価です。しかし、既存の要約評価の研究は、アノテーター間の一致度が低いか、十分なスケールがないか、または人間による評価の詳細な分析が不足しています。そこで、私たちは既存の要約評価の欠点に対処し、以下の軸で改善を行います：(1) 細かい意味の単位に基づいた要約の重要性プロトコルであるAtomic Content Units (ACUs)を提案し、高いアノテーター間の一致度を実現します。(2) Robust Summarization Evaluation (RoSE)ベンチマークを作成し、3つのデータセット上で28のトップパフォーマンスシステムに対して22,000の要約レベルのアノテーションを含む大規模な人間評価データセットを収集します。(3) 4つの人間評価プロトコルを比較研究し、評価セットアップにおける潜在的な混乱要因を明らかにします。(4) 収集した人間のアノテーションを用いて50の自動評価指標とそのバリアントを評価し、私たちのベンチマークがより統計的に安定し有意な結果をもたらすことを示します。私たちがベンチマークした評価指標には、大規模言語モデル（LLMs）に基づく最近の手法であるGPTScoreとG-Evalも含まれます。さらに、私たちの研究結果は、LLMsの評価に重要な示唆を与えます。なぜなら、私たちはLLMsが人間のフィードバックによって調整された場合（例：GPT-3.5）、制約のない人間の評価に過適合する可能性があり、これはアノテーターの事前の入力に依存しない優先順位に影響を受けるため、より堅牢でターゲット指向の評価方法が求められるからです。

Summary (by gpt-3.5-turbo)

要約の評価には人間の評価が重要ですが、既存の評価方法には問題があります。そこで、私たちは新しい要約の重要性プロトコルを提案し、大規模な人間評価データセットを収集しました。さらに、異なる評価プロトコルを比較し、自動評価指標を評価しました。私たちの研究結果は、大規模言語モデルの評価に重要な示唆を与えます。

AkihikoWatanabe added the translation_required label Jul 18, 2023

AkihikoWatanabe changed the title ~~Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation~~ Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 Jul 18, 2023

AkihikoWatanabe added translation_required and removed translation_required labels Jul 18, 2023

AkihikoWatanabe added DocumentSummarization NLP Evaluation Metrics Dataset labels Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 #869

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 #869

AkihikoWatanabe commented Jul 18, 2023

AkihikoWatanabe commented Jul 18, 2023 •

edited

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 #869

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation, ACL'23 #869

Comments

AkihikoWatanabe commented Jul 18, 2023

AkihikoWatanabe commented Jul 18, 2023 • edited

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 18, 2023 •

edited