ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 #933

AkihikoWatanabe · 2023-08-13T09:08:32Z

URL

https://arxiv.org/abs/2303.15621

Affiliations

Zheheng Luo, N/A
Qianqian Xie, N/A
Sophia Ananiadou, N/A

Abstract

The performance of text summarization has been greatly boosted by pre-trainedlanguage models. A main concern of existing methods is that most generatedsummaries are not factually inconsistent with their source documents. Toalleviate the problem, many efforts have focused on developing effectivefactuality evaluation metrics based on natural language inference, questionanswering, and syntactic dependency et al. However, these approaches arelimited by either their high computational complexity or the uncertaintyintroduced by multi-component pipelines, resulting in only partial agreementwith human judgement. Most recently, large language models(LLMs) have shownexcellent performance in not only text generation but also languagecomprehension. In this paper, we particularly explore ChatGPT's ability toevaluate factual inconsistency under a zero-shot setting by examining it onboth coarse-grained and fine-grained evaluation tasks including binaryentailment inference, summary ranking, and consistency rating. Experimentalresults indicate that ChatGPT generally outperforms previous evaluation metricsacross the three tasks, indicating its great potential for factualinconsistency evaluation. However, a closer inspection of ChatGPT's outputreveals certain limitations including its preference for more lexically similarcandidates, false reasoning, and inadequate understanding of instructions.

Translation (by gpt-3.5-turbo)

テキスト要約の性能は、事前学習された言語モデルによって大幅に向上しています。既存の手法の主な懸念点は、ほとんどの生成された要約が元の文書と事実に矛盾しているということです。この問題を軽減するために、自然言語推論、質問応答、構文依存関係などに基づいた効果的な事実性評価メトリクスの開発に多くの努力が注がれています。しかし、これらのアプローチは、高い計算複雑性またはマルチコンポーネントパイプラインによって導入される不確実性によって制約されており、人間の判断との一部の一致にとどまっています。最近では、大規模言語モデル（LLMs）がテキスト生成だけでなく言語理解でも優れた性能を示しています。本論文では、ChatGPTのゼロショット設定における事実的な矛盾評価能力を特に探求し、バイナリエンテイルメント推論、要約ランキング、一貫性評価などの粗粒度および細粒度の評価タスクで評価しました。実験結果は、ChatGPTが一般的に3つのタスク全体で以前の評価メトリクスよりも優れた性能を示し、事実的な矛盾評価の大きな潜在能力を示しています。ただし、ChatGPTの出力をより詳しく調査すると、より語彙的に類似した候補を好む傾向、誤った推論、指示の不適切な理解など、特定の制限があることがわかります。

Summary (by gpt-3.5-turbo)

事前学習された言語モデルによるテキスト要約の性能向上が注目されているが、生成された要約が元の文書と矛盾することが問題となっている。この問題を解決するために、効果的な事実性評価メトリクスの開発が進められているが、計算複雑性や不確実性の制約があり、人間の判断との一致に限定されている。最近の研究では、大規模言語モデル（LLMs）がテキスト生成と言語理解の両方で優れた性能を示していることがわかっている。本研究では、ChatGPTの事実的な矛盾評価能力を評価し、バイナリエンテイルメント推論、要約ランキング、一貫性評価などのタスクで優れた性能を示した。ただし、ChatGPTには語彙的な類似性の傾向や誤った推論、指示の不適切な理解などの制限があることがわかった。

AkihikoWatanabe added DocumentSummarization NLP Evaluation Pocket labels Aug 13, 2023

AkihikoWatanabe changed the title あ ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 Aug 13, 2023

AkihikoWatanabe added the FactualConsistency label Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 #933

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 #933

AkihikoWatanabe commented Aug 13, 2023 •

edited

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 #933

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization, Zheheng Luo+, N/A, arXiv'23 #933

Comments

AkihikoWatanabe commented Aug 13, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Aug 13, 2023 •

edited