- [2024/05/15] π₯ Marathon is accepted by ACL 2024 Main Conference.
Marathon benchmark is a new long-context multiple-choice benchmark, mainly based on LooGLE, with some original data from LongBench. The context length can reach up to 200K+. Marathon benchmark comprises six tasks: Comprehension and Reasoning, Multiple Information Retrieval, Timeline Reorder, Computation, Passage Retrieval, and Short Dependency Question Answering. Each test case includes a Long Context, a question, and multiple candidate options. Large Language Models (LLMs) need to select the correct answer from the given options based on the Long Context in the test.
Marathon is also available at Hugging Face dataset: Marathon.
An example of test looks as follows. This is a toy example.
{
"id": "7",
"type": "comprehension_and_reasoning",
"context": " Early life. Picardo was born in Jerez de la Frontera, in the Province of CΓ‘diz in AndalucΓa, Spain on 18 June 1919. His father was Alvaro Picardo de Celis and his mother's family name was CastellΓ³n. He had four brothers, one of whom died in infancy. His father died in 1929 when Picardo was ten years old. With his mother and his brothers he moved to Madrid, Spain. [Truncated for display purpose] ",
"question": "How many people were in Picardo's family when he was twelve?",
"options": {
"A": "five",
"B": "eight",
"C": "nine",
"D": "ten"
},
"length": 268760
}
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM
def build_prompt(item):
context = item['context']
question = item['question']
options = '\n'.join([key + '. ' + str(value) for key, value in item['options'].items()])
prompt = f"According to following context answer question.\n\n"
prompt += f"Context:\n{context}\n\n"
prompt += f"Question:\nBased on the description above, {question}?\n\n"
prompt += f"Options:\n{options}\n\n"
prompt += "Please answer this question with JSON format, for example {\"option\":\"A\"}.\n"
prompt += f"Answer:\n"
return prompt
data_path = "marathon.json"
with open(data_path, "r") as f:
data = json.load(f)
item = data[0]
prompt = build_prompt(item)
model_name_or_path = "01-ai/Yi-34B-200K"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_name_or_path,
torch_dtype=torch.bfloat16,
device_map='auto',
).eval()
max_new_tokens = 32
max_length = tokenizer.model_max_length - max_new_tokens
tokenized_prompt = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0]
length = len(tokenized_prompt)
if len(tokenized_prompt) > tokenizer.model_max_length - max_new_tokens:
half = max_length // 2
prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True) \
+ tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
- Methods (optimizing methods):
- π Vanilla
- πΎ RAG (Retrieval Augmented Generation)
- π PC (LongLLMLingua Prompt Compression)
- Embedding Models:
- πΏ OpenAI: text-embedding-ada-002
- π Jina: Jina-Embedding-base
Tag | Model | Parameters | Context Window | Method | Embedding | Avg. Accuracy β¬οΈ |
---|---|---|---|---|---|---|
π | GPT-4 | - | 128K | π Vanilla | - | 78.59 |
πΎπ | Yi-chat | 34B | 200K | πΎ RAG | π Jina | 63.81 |
πΎπΏ | Yi-chat | 34B | 200K | πΎ RAG | πΏ OpenAI | 63.56 |
πΎπΏ | Tutu2-DPO | 70B | 8K | πΎ RAG | πΏ OpenAI | 61.97 |
πΎπ | Tutu2-DPO | 70B | 8K | πΎ RAG | π Jina | 61.52 |
πΎπ | Qwen | 14B | 8K | πΎ RAG | π Jina | 58.12 |
π | ChatGPT | - | 16K | π Vanilla | - | 57.37 |
π | Yi-chat | 34B | 200K | π Vanilla | - | 55.91 |
πΎπ | Beluga2 | 70B | 4K | πΎ RAG | π Jina | 55.72 |
π | ChatGLM3 | 6B | 32K | π Vanilla | - | 55.05 |
πΎπ | Zephyr | 7B | 32K | πΎ RAG | π Jina | 53.79 |
πΎπΏ | Qwen | 14B | 8K | πΎ RAG | πΏ OpenAI | 53.46 |
π | Beluga2 | 70B | 4K | π PC | - | 52.29 |
πΎπ | Mistral | 7B | 32K | πΎ RAG | π Jina | 52.04 |
πΎπΏ | Alfred | 40B | 8K | πΎ RAG | πΏ OpenAI | 51.35 |
πΎπ | Alfred | 40B | 8K | πΎ RAG | π Jina | 51.24 |
πΎπΏ | ChatGLM3 | 6B | 32K | πΎ RAG | πΏ OpenAI | 50.99 |
πΎπ | ChatGLM3 | 6B | 32K | πΎ RAG | π Jina | 50.60 |
πΎπΏ | Mistral | 7B | 32K | πΎ RAG | πΏ OpenAI | 50.18 |
πΎπΏ | Zephyr | 7B | 32K | πΎ RAG | πΏ OpenAI | 49.63 |
π | Beluga2 | 70B | 4K | π Vanilla | - | 49.51 |
π | Yi | 34B | 200K | π PC | - | 48.66 |
πΎπΏ | Beluga2 | 70B | 4K | πΎ RAG | πΏ OpenAI | 48.24 |
π | ChatGLM3 | 6B | 32K | π PC | - | 47.91 |
π | Tulu2-DPO | 70B | 8K | π PC | - | 46.56 |
π | Qwen | 14B | 8K | π PC | - | 44.12 |
π | Mistral | 7B | 32K | π Vanilla | - | 39.81 |
π | Qwen | 14B | 8K | π Vanilla | - | 39.27 |
π | Alfred | 40B | 8K | π PC | - | 38.82 |
π | Zephyr | 7B | 32K | π Vanilla | - | 37.97 |
π | Tulu2-DPO | 7B | 8K | π Vanilla | - | 37.92 |
πΎπ | Longchat | 13B | 16K | πΎ RAG | π Jina | 37.78 |
π | Alfred | 40B | 8K | π Vanilla | - | 37.31 |
π | Mistral | 7B | 32K | π PC | - | 37.01 |
π | Longchat | 13B | 16K | π Vanilla | - | 35.87 |
π | Longchat | 13B | 16K | π PC | - | 35.61 |
π | Zephyr | 7B | 32K | π PC | - | 30.23 |
πΎπΏ | Longchat | 13B | 16K | πΎ RAG | πΏ OpenAI | 29.95 |
Welcome to Marathon Race, online evaluation is now available at https://openbenchmark.online/marathon.
Answer File Format
The file should be a JSON file containing a list of dictionaries with a length of 1530. Each dictionary must include at least two fields: 'id' and 'answer'. Here is a sample answer file:
[
{
"id": "0",
"answer": "C"
},
{
"id": "1",
"answer": "B"
},
{
"id": "2",
"answer": "B"
},
...
{
"id": "1529",
"answer": "C"
}
]
Results File Format
The Results file is a JSON file that includes the accuracy of the LLM (Language Learning Model) in 6 tasks within the Marathon, as well as the average accuracy across all tasks. Here is a sample results file:
{
"comprehension_and_reasoning": {
"accuracy": 0.46218487394957986,
"correct": 165,
"total": 357
},
"multiple_information_retrieval": {
"accuracy": 0.41935483870967744,
"correct": 143,
"total": 341
},
"timeline_reorder": {
"accuracy": 0.2894736842105263,
"correct": 44,
"total": 152
},
"computation": {
"accuracy": 0.23711340206185566,
"correct": 23,
"total": 97
},
"passage_retrieval": {
"accuracy": 0.49666666666666665,
"correct": 149,
"total": 300
},
"shortdep_qa": {
"accuracy": 0.4840989399293286,
"correct": 137,
"total": 283
},
"average": 0.39814873425460573
}
If you find our work useful, please cite us.
@article{zhang2023marathon,
title={Marathon: A Race Through the Realm of Long Context with Large Language Models},
author={Zhang, Lei and Li, Yunshui and Liu, Ziqiang and Liu, Junhao and Yang, Jiaxi and Yang, Min},
url={https://huggingface.co/datasets/Lemoncoke/Marathon},
year={2023}
}
When citing our work, please kindly consider citing the original dataset papers.
@misc{li2023loogle,
title={Can Long-Context Language Models Understand Long Contexts?},
author={ Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan },
url={https://github.com/bigai-nlco/LooGLE},
year={2023}
}
@article{bai2023longbench,
title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
journal={arXiv preprint arXiv:2308.14508},
year={2023}
}