You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The unprecedented performance of large language models (LLMs) necessitatesimprovements in evaluations. Rather than merely exploring the breadth of LLMabilities, we believe meticulous and thoughtful designs are essential tothorough, unbiased, and applicable evaluations. Given the importance of worldknowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark(KoLA), in which we carefully design three crucial factors: (1) For abilitymodeling, we mimic human cognition to form a four-level taxonomy ofknowledge-related abilities, covering $19$ tasks. (2) For data, to ensure faircomparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs,along with continuously collected emerging corpora, aiming to evaluate thecapacity to handle unseen data and evolving knowledge. (3) For evaluationcriteria, we adopt a contrastive system, including overall standard scores forbetter numerical comparability across tasks and models and a uniqueself-contrast metric for automatically evaluating knowledge hallucination. Weevaluate $21$ open-source and commercial LLMs and obtain some intriguingfindings. The KoLA dataset and open-participation leaderboard are publiclyreleased at https://kola.xlore.cn and will be continuously updated to providereferences for developing LLMs and knowledge-related systems.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: