KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 #729

AkihikoWatanabe · 2023-06-16T12:24:54Z

URL

https://arxiv.org/abs//2306.09296

Affiliations

Jifan Yu, N/A
Xiaozhi Wang, N/A
Shangqing Tu, N/A
Shulin Cao, N/A
Daniel Zhang-Li, N/A
Xin Lv, N/A
Hao Peng, N/A
Zijun Yao, N/A
Xiaohan Zhang, N/A
Hanming Li, N/A
Chunyang Li, N/A
Zheyuan Zhang, N/A
Yushi Bai, N/A
Yantao Liu, N/A
Amy Xin, N/A
Nianyi Lin, N/A
Kaifeng Yun, N/A
Linlu Gong, N/A
Jianhui Chen, N/A
Zhili Wu, N/A
Yunjia Qi, N/A
Weikai Li, N/A
Yong Guan, N/A
Kaisheng Zeng, N/A
Ji Qi, N/A
Hailong Jin, N/A
Jinxin Liu, N/A
Yu Gu, N/A
Yuan Yao, N/A
Ning Ding, N/A
Lei Hou, N/A
Zhiyuan Liu, N/A
Bin Xu, N/A
Jie Tang, N/A
Juanzi Li, N/A

Abstract

The unprecedented performance of large language models (LLMs) necessitatesimprovements in evaluations. Rather than merely exploring the breadth of LLMabilities, we believe meticulous and thoughtful designs are essential tothorough, unbiased, and applicable evaluations. Given the importance of worldknowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark(KoLA), in which we carefully design three crucial factors: (1) For abilitymodeling, we mimic human cognition to form a four-level taxonomy ofknowledge-related abilities, covering $19$ tasks. (2) For data, to ensure faircomparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs,along with continuously collected emerging corpora, aiming to evaluate thecapacity to handle unseen data and evolving knowledge. (3) For evaluationcriteria, we adopt a contrastive system, including overall standard scores forbetter numerical comparability across tasks and models and a uniqueself-contrast metric for automatically evaluating knowledge hallucination. Weevaluate $21$ open-source and commercial LLMs and obtain some intriguingfindings. The KoLA dataset and open-participation leaderboard are publiclyreleased at https://kola.xlore.cn and will be continuously updated to providereferences for developing LLMs and knowledge-related systems.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）の前例のないパフォーマンスは、評価の改善を必要としています。LLMの能力の幅を探索するだけでなく、入念で思慮深い設計が徹底的で偏りのない、そして適用可能な評価に不可欠であると考えています。LLMにとって世界知識の重要性が高いため、Knowledge-oriented LLM Assessment benchmark（KoLA）を構築しました。このベンチマークでは、以下の3つの重要な要素を慎重に設計しています。1つ目は能力モデリングで、19のタスクをカバーする4段階の知識関連能力の人間の認知を模倣しています。2つ目はデータで、LLMによって広く事前学習されたコーパスであるWikipediaと、新興コーパスを連続的に収集して、未知のデータと進化する知識を扱う能力を評価することを目的としています。3つ目は評価基準で、タスクとモデル間の数値的比較を向上させるための全体的な標準スコアと、知識の幻覚を自動的に評価する独自の自己対照メトリックを含む対照的なシステムを採用しています。21のオープンソースと商用のLLMを評価し、興味深い結果を得ました。KoLAデータセットとオープン参加のリーダーボードは、https://kola.xlore.cnで公開され、LLMや知識関連システムの開発の参考資料として継続的に更新されます。

Summary (by gpt-3.5-turbo)

LLMの評価を改善するために、KoLAという知識指向のベンチマークを構築した。このベンチマークは、19のタスクをカバーし、Wikipediaと新興コーパスを使用して、知識の幻覚を自動的に評価する独自の自己対照メトリックを含む対照的なシステムを採用している。21のオープンソースと商用のLLMを評価し、KoLAデータセットとオープン参加のリーダーボードは、LLMや知識関連システムの開発の参考資料として継続的に更新される。

AkihikoWatanabe added the Pocket label Jun 16, 2023

AkihikoWatanabe changed the title あ KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 Jun 16, 2023

AkihikoWatanabe added the action_wanted label Jun 18, 2023

AkihikoWatanabe added Dataset NLP LanguageModel Evaluation and removed action_wanted labels Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 #729

KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 #729

AkihikoWatanabe commented Jun 16, 2023 •

edited

KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 #729

KoLA: Carefully Benchmarking World Knowledge of Large Language Models, Jifan Yu+, N/A, arXiv'23 #729

Comments

AkihikoWatanabe commented Jun 16, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jun 16, 2023 •

edited