You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large Language Models (LLMs) have made significant strides in handling longsequences exceeding 32K tokens. However, their performance evaluation haslargely been confined to metrics like perplexity and synthetic tasks, which maynot fully capture their abilities in more nuanced, real-world scenarios. Thisstudy introduces a specialized benchmark (LongICLBench) focusing on longin-context learning within the realm of extreme-label classification. Wemeticulously selected six datasets with a label range spanning 28 to 174classes covering different input (few-shot demonstration) lengths from 2K to50K tokens. Our benchmark requires LLMs to comprehend the entire input torecognize the massive label spaces to make correct predictions. We evaluate 13long-context LLMs on our benchmarks. We find that the long-context LLMs performrelatively well on less challenging tasks with shorter demonstration lengths byeffectively utilizing the long context window. However, on the most challengingtask Discovery with 174 labels, all the LLMs struggle to understand the taskdefinition, thus reaching a performance close to zero. This suggests a notablegap in current LLM capabilities for processing and understanding long,context-rich sequences. Further analysis revealed a tendency among models tofavor predictions for labels presented toward the end of the sequence. Theirability to reason over multiple pieces in the long sequence is yet to beimproved. Our study reveals that long context understanding and reasoning isstill a challenging task for the existing LLMs. We believe LongICLBench couldserve as a more realistic evaluation for the future long-context LLMs.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: