Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 #1304

AkihikoWatanabe · 2024-05-15T04:25:57Z

URL

https://arxiv.org/abs/2301.13848

Affiliations

Tianyi Zhang, N/A
Faisal Ladhak, N/A
Esin Durmus, N/A
Percy Liang, N/A
Kathleen McKeown, N/A
Tatsunori B. Hashimoto, N/A

Abstract

Large language models (LLMs) have shown promise for automatic summarizationbut the reasons behind their successes are poorly understood. By conducting ahuman evaluation on ten LLMs across different pretraining methods, prompts, andmodel scales, we make two important observations. First, we find instructiontuning, and not model size, is the key to the LLM's zero-shot summarizationcapability. Second, existing studies have been limited by low-qualityreferences, leading to underestimates of human performance and lower few-shotand finetuning performance. To better evaluate LLMs, we perform humanevaluation over high-quality summaries we collect from freelance writers.Despite major stylistic differences such as the amount of paraphrasing, we findthat LMM summaries are judged to be on par with human written summaries.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）は自動要約において有望性を示していますが、その成功の理由はあまり理解されていません。
異なる事前学習方法、プロンプト、およびモデルスケールにわたる10つのLLMsに対する人間の評価を行うことで、2つの重要な観察結果を得ました。
まず、モデルサイズではなく、指示の調整がLLMのゼロショット要約能力の鍵であることがわかりました。
第二に、既存の研究は低品質な参照によって制限されており、人間のパフォーマンスの過小評価や、少数ショットおよびファインチューニングのパフォーマンスの低下をもたらしています。
LLMsをより良く評価するために、フリーランスのライターから収集した高品質な要約に対する人間の評価を行います。
大幅なスタイルの違い（たとえば、言い換えの量など）にもかかわらず、LLMの要約は人間の執筆した要約と同等と判断されました。

Summary (by gpt-3.5-turbo)

LLMsの成功の理由を理解するために、異なる事前学習方法、プロンプト、およびモデルスケールにわたる10つのLLMsに対する人間の評価を行った。その結果、モデルサイズではなく、指示の調整がLLMのゼロショット要約能力の鍵であることがわかった。また、LLMsの要約は人間の執筆した要約と同等と判断された。

AkihikoWatanabe · 2024-05-15T04:26:58Z

ニュース記事の高品質な要約を人間に作成してもらい、gpt-3.5を用いてLLM-basedな要約も生成
annotatorにそれぞれの要約の品質をスコアリングさせたデータセットを作成

AkihikoWatanabe added the Pocket label May 15, 2024

AkihikoWatanabe changed the title a Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 May 15, 2024

AkihikoWatanabe mentioned this issue May 15, 2024

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment, Yang Liu+, N/A, EMNLP'23 #1223

Open

AkihikoWatanabe added DocumentSummarization NaturalLanguageGeneration NLP Dataset LanguageModel Annotation labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 #1304

Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 #1304

AkihikoWatanabe commented May 15, 2024 •

edited

AkihikoWatanabe commented May 15, 2024

Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 #1304

Benchmarking Large Language Models for News Summarization, Tianyi Zhang+, N/A, arXiv'23 #1304

Comments

AkihikoWatanabe commented May 15, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented May 15, 2024

AkihikoWatanabe commented May 15, 2024 •

edited