You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Large Language Models (LLMs) have made significant strides in handling longsequences exceeding 32K tokens. However, their performance evaluation haslargely been confined to metrics like perplexity and synthetic tasks, which maynot fully capture their abilities in more nuanced, real-world scenarios. Thisstudy introduces a specialized benchmark (LIConBench) focusing on longin-context learning within the realm of extreme-label classification. Wemeticulously selected six datasets with a label range spanning 28 to 174classes covering different input (few-shot demonstration) length from 2K to50K. Our benchmark requires LLMs to comprehend the entire input to recognizethe massive label spaces to make correct prediction. We evaluate 13long-context LLMs on our benchmarks. We find that the long-context LLMs performrelatively well under the token length of 20K and the performance benefits fromutilizing the long context window. However, after the context window exceeds20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gapin current LLM capabilities for processing and understanding long, context-richsequences. Further analysis revealed a tendency among models to favorpredictions for labels presented towards the end at the sequence. Their abilityto reason over multiple pieces in the long sequence is yet to be improved. Ourstudy reveals that long context understanding and reasoning is still achallenging task for the existing LLMs. We believe LIConBench could serve as amore realistic evaluation for the future long context LLMs.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: