You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The performance of text summarization has been greatly boosted by pre-trainedlanguage models. A main concern of existing methods is that most generatedsummaries are not factually inconsistent with their source documents. Toalleviate the problem, many efforts have focused on developing effectivefactuality evaluation metrics based on natural language inference, questionanswering, and syntactic dependency et al. However, these approaches arelimited by either their high computational complexity or the uncertaintyintroduced by multi-component pipelines, resulting in only partial agreementwith human judgement. Most recently, large language models(LLMs) have shownexcellent performance in not only text generation but also languagecomprehension. In this paper, we particularly explore ChatGPT's ability toevaluate factual inconsistency under a zero-shot setting by examining it onboth coarse-grained and fine-grained evaluation tasks including binaryentailment inference, summary ranking, and consistency rating. Experimentalresults indicate that ChatGPT generally outperforms previous evaluation metricsacross the three tasks, indicating its great potential for factualinconsistency evaluation. However, a closer inspection of ChatGPT's outputreveals certain limitations including its preference for more lexically similarcandidates, false reasoning, and inadequate understanding of instructions.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: