Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 #1264

AkihikoWatanabe · 2024-04-03T03:48:56Z

URL

https://arxiv.org/abs/2404.02060

Affiliations

Tianle Li, N/A
Ge Zhang, N/A
Quy Duc Do, N/A
Xiang Yue, N/A
Wenhu Chen, N/A

Abstract

Large Language Models (LLMs) have made significant strides in handling longsequences exceeding 32K tokens. However, their performance evaluation haslargely been confined to metrics like perplexity and synthetic tasks, which maynot fully capture their abilities in more nuanced, real-world scenarios. Thisstudy introduces a specialized benchmark (LIConBench) focusing on longin-context learning within the realm of extreme-label classification. Wemeticulously selected six datasets with a label range spanning 28 to 174classes covering different input (few-shot demonstration) length from 2K to50K. Our benchmark requires LLMs to comprehend the entire input to recognizethe massive label spaces to make correct prediction. We evaluate 13long-context LLMs on our benchmarks. We find that the long-context LLMs performrelatively well under the token length of 20K and the performance benefits fromutilizing the long context window. However, after the context window exceeds20K, most LLMs except GPT-4 will dip dramatically. This suggests a notable gapin current LLM capabilities for processing and understanding long, context-richsequences. Further analysis revealed a tendency among models to favorpredictions for labels presented towards the end at the sequence. Their abilityto reason over multiple pieces in the long sequence is yet to be improved. Ourstudy reveals that long context understanding and reasoning is still achallenging task for the existing LLMs. We believe LIConBench could serve as amore realistic evaluation for the future long context LLMs.

Translation (by gpt-3.5-turbo)

大規模言語モデル（LLMs）は、32Kトークンを超える長いシーケンスを処理する能力で大きな進展を遂げてきた。しかし、その性能評価は主にperplexityや合成タスクなどのメトリクスに限定されており、より微妙で現実世界のシナリオでの能力を十分に捉えていない可能性がある。本研究では、極端なラベル分類の領域での長いコンテキスト学習に焦点を当てた特化したベンチマーク（LIConBench）を紹介する。28から174のクラスにわたるラベル範囲をカバーする6つのデータセットを厳選し、入力（few-shot demonstration）の長さが2Kから50Kまで異なるものを選択した。当該ベンチマークでは、LLMsには正しい予測を行うために巨大なラベル空間を認識するために入力全体を理解する必要がある。我々は13の長いコンテキストLLMsを当該ベンチマークで評価した。長いコンテキストLLMsは、トークン長が20K以下の範囲で比較的良いパフォーマンスを発揮し、長いコンテキストウィンドウを利用することでパフォーマンスが向上することがわかった。しかし、コンテキストウィンドウが20Kを超えると、GPT-4を除くほとんどのLLMsは急激に低下する。これは、現在のLLMsの長く、コンテキスト豊かなシーケンスを処理し理解する能力における著しいギャップを示唆している。さらなる分析では、モデルの間には、シーケンスの末尾に提示されたラベルに対する予測を好む傾向があることが明らかになった。長いシーケンス内の複数の要素に対する推論能力はまだ改善されていない。我々の研究は、既存のLLMsにとって長いコンテキストの理解と推論は依然として難しい課題であることを示している。LIConBenchは、将来の長いコンテキストLLMsにとってより現実的な評価となり得ると考えている。

Summary (by gpt-3.5-turbo)

LLMsは長いシーケンスを処理する能力で進歩しているが、その評価は限定されている。本研究では、極端なラベル分類の領域での長いコンテキスト学習に焦点を当てた特化したベンチマーク（LIConBench）を紹介する。LLMsは20K以下のトークン長で比較的良いパフォーマンスを示し、長いコンテキストウィンドウを利用することで性能が向上することがわかった。しかし、20Kを超えると性能が急激に低下する。現在のLLMsは長くコンテキスト豊かなシーケンスを処理し理解する能力にギャップがあることを示唆している。LIConBenchは、将来のLLMsの評価に役立つ可能性がある。

AkihikoWatanabe added the Pocket label Apr 3, 2024

AkihikoWatanabe changed the title あ Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 #1264

Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 #1264

AkihikoWatanabe commented Apr 3, 2024 •

edited

Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 #1264

Long-context LLMs Struggle with Long In-context Learning, Tianle Li+, N/A, arXiv'24 #1264

Comments

AkihikoWatanabe commented Apr 3, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Apr 3, 2024 •

edited