Skip to content

Hambaobao/Marathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Marathon: A Race Through the Realm of Long Context with Large Language Models

Release

  • [2024/05/15] πŸ”₯ Marathon is accepted by ACL 2024 Main Conference.

Dataset Summary

Marathon benchmark is a new long-context multiple-choice benchmark, mainly based on LooGLE, with some original data from LongBench. The context length can reach up to 200K+. Marathon benchmark comprises six tasks: Comprehension and Reasoning, Multiple Information Retrieval, Timeline Reorder, Computation, Passage Retrieval, and Short Dependency Question Answering. Each test case includes a Long Context, a question, and multiple candidate options. Large Language Models (LLMs) need to select the correct answer from the given options based on the Long Context in the test.

HuggingFace Dataset

Marathon is also available at Hugging Face dataset: Marathon.

Data Instances

An example of test looks as follows. This is a toy example.

{
	"id": "7",
  "type": "comprehension_and_reasoning",
  "context": " Early life. Picardo was born in Jerez de la Frontera, in the Province of CΓ‘diz in AndalucΓ­a, Spain on 18 June 1919. His father was Alvaro Picardo de Celis and his mother's family name was CastellΓ³n. He had four brothers, one of whom died in infancy. His father died in 1929 when Picardo was ten years old. With his mother and his brothers he moved to Madrid, Spain. [Truncated for display purpose] ",
  "question": "How many people were in Picardo's family when he was twelve?",
  "options": {
    "A": "five",
    "B": "eight",
    "C": "nine",
    "D": "ten"
  },
  "length": 268760
}

Inference Example

import torch
import json

from transformers import AutoTokenizer, AutoModelForCausalLM


def build_prompt(item):
    context = item['context']
    question = item['question']
    options = '\n'.join([key + '. ' + str(value) for key, value in item['options'].items()])

    prompt = f"According to following context answer question.\n\n"
    prompt += f"Context:\n{context}\n\n"
    prompt += f"Question:\nBased on the description above, {question}?\n\n"
    prompt += f"Options:\n{options}\n\n"
    prompt += "Please answer this question with JSON format, for example {\"option\":\"A\"}.\n"
    prompt += f"Answer:\n"

    return prompt


data_path = "marathon.json"

with open(data_path, "r") as f:
    data = json.load(f)

item = data[0]
prompt = build_prompt(item)

model_name_or_path = "01-ai/Yi-34B-200K"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map='auto',
).eval()

max_new_tokens = 32
max_length = tokenizer.model_max_length - max_new_tokens
tokenized_prompt = tokenizer(prompt, truncation=False, return_tensors="pt").input_ids[0]
length = len(tokenized_prompt)
if len(tokenized_prompt) > tokenizer.model_max_length - max_new_tokens:
    half = max_length // 2
    prompt = tokenizer.decode(tokenized_prompt[:half], skip_special_tokens=True) \
        + tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

Leaderboard

  • Methods (optimizing methods):
    • 🏐 Vanilla
    • 🎾 RAG (Retrieval Augmented Generation)
    • πŸ€ PC (LongLLMLingua Prompt Compression)
  • Embedding Models:
    • 🍿 OpenAI: text-embedding-ada-002
    • πŸ” Jina: Jina-Embedding-base
Tag Model Parameters Context Window Method Embedding Avg. Accuracy ⬆️
🏐 GPT-4 - 128K 🏐 Vanilla - 78.59
πŸŽΎπŸ” Yi-chat 34B 200K 🎾 RAG πŸ” Jina 63.81
🎾🍿 Yi-chat 34B 200K 🎾 RAG 🍿 OpenAI 63.56
🎾🍿 Tutu2-DPO 70B 8K 🎾 RAG 🍿 OpenAI 61.97
πŸŽΎπŸ” Tutu2-DPO 70B 8K 🎾 RAG πŸ” Jina 61.52
πŸŽΎπŸ” Qwen 14B 8K 🎾 RAG πŸ” Jina 58.12
🏐 ChatGPT - 16K 🏐 Vanilla - 57.37
🏐 Yi-chat 34B 200K 🏐 Vanilla - 55.91
πŸŽΎπŸ” Beluga2 70B 4K 🎾 RAG πŸ” Jina 55.72
🏐 ChatGLM3 6B 32K 🏐 Vanilla - 55.05
πŸŽΎπŸ” Zephyr 7B 32K 🎾 RAG πŸ” Jina 53.79
🎾🍿 Qwen 14B 8K 🎾 RAG 🍿 OpenAI 53.46
πŸ€ Beluga2 70B 4K πŸ€ PC - 52.29
πŸŽΎπŸ” Mistral 7B 32K 🎾 RAG πŸ” Jina 52.04
🎾🍿 Alfred 40B 8K 🎾 RAG 🍿 OpenAI 51.35
πŸŽΎπŸ” Alfred 40B 8K 🎾 RAG πŸ” Jina 51.24
🎾🍿 ChatGLM3 6B 32K 🎾 RAG 🍿 OpenAI 50.99
πŸŽΎπŸ” ChatGLM3 6B 32K 🎾 RAG πŸ” Jina 50.60
🎾🍿 Mistral 7B 32K 🎾 RAG 🍿 OpenAI 50.18
🎾🍿 Zephyr 7B 32K 🎾 RAG 🍿 OpenAI 49.63
🏐 Beluga2 70B 4K 🏐 Vanilla - 49.51
πŸ€ Yi 34B 200K πŸ€ PC - 48.66
🎾🍿 Beluga2 70B 4K 🎾 RAG 🍿 OpenAI 48.24
πŸ€ ChatGLM3 6B 32K πŸ€ PC - 47.91
πŸ€ Tulu2-DPO 70B 8K πŸ€ PC - 46.56
πŸ€ Qwen 14B 8K πŸ€ PC - 44.12
🏐 Mistral 7B 32K 🏐 Vanilla - 39.81
🏐 Qwen 14B 8K 🏐 Vanilla - 39.27
πŸ€ Alfred 40B 8K πŸ€ PC - 38.82
🏐 Zephyr 7B 32K 🏐 Vanilla - 37.97
🏐 Tulu2-DPO 7B 8K 🏐 Vanilla - 37.92
πŸŽΎπŸ” Longchat 13B 16K 🎾 RAG πŸ” Jina 37.78
🏐 Alfred 40B 8K 🏐 Vanilla - 37.31
πŸ€ Mistral 7B 32K πŸ€ PC - 37.01
🏐 Longchat 13B 16K 🏐 Vanilla - 35.87
πŸ€ Longchat 13B 16K πŸ€ PC - 35.61
πŸ€ Zephyr 7B 32K πŸ€ PC - 30.23
🎾🍿 Longchat 13B 16K 🎾 RAG 🍿 OpenAI 29.95

Online Evaluation

Welcome to Marathon Race, online evaluation is now available at https://openbenchmark.online/marathon.

Answer File Format

The file should be a JSON file containing a list of dictionaries with a length of 1530. Each dictionary must include at least two fields: 'id' and 'answer'. Here is a sample answer file:

[
  {
    "id": "0",
    "answer": "C"
  },
  {
    "id": "1",
    "answer": "B"
  },
  {
    "id": "2",
    "answer": "B"
  },
  ...
   {
    "id": "1529",
    "answer": "C"
  }
]

Results File Format

The Results file is a JSON file that includes the accuracy of the LLM (Language Learning Model) in 6 tasks within the Marathon, as well as the average accuracy across all tasks. Here is a sample results file:

{
    "comprehension_and_reasoning": {
        "accuracy": 0.46218487394957986,
        "correct": 165,
        "total": 357
    },
    "multiple_information_retrieval": {
        "accuracy": 0.41935483870967744,
        "correct": 143,
        "total": 341
    },
    "timeline_reorder": {
        "accuracy": 0.2894736842105263,
        "correct": 44,
        "total": 152
    },
    "computation": {
        "accuracy": 0.23711340206185566,
        "correct": 23,
        "total": 97
    },
    "passage_retrieval": {
        "accuracy": 0.49666666666666665,
        "correct": 149,
        "total": 300
    },
    "shortdep_qa": {
        "accuracy": 0.4840989399293286,
        "correct": 137,
        "total": 283
    },
    "average": 0.39814873425460573
}

Citations

If you find our work useful, please cite us.

@article{zhang2023marathon,
  title={Marathon: A Race Through the Realm of Long Context with Large Language Models},
  author={Zhang, Lei and Li, Yunshui and Liu, Ziqiang and Liu, Junhao and Yang, Jiaxi and Yang, Min},
  url={https://huggingface.co/datasets/Lemoncoke/Marathon},
  year={2023}
}

When citing our work, please kindly consider citing the original dataset papers.

@misc{li2023loogle,
  title={Can Long-Context Language Models Understand Long Contexts?},
  author={ Li, Jiaqi and Wang, Mengmeng and Zheng, Zilong and Zhang, Muhan },
  url={https://github.com/bigai-nlco/LooGLE},
  year={2023}
}
@article{bai2023longbench,
  title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
  author={Bai, Yushi and Lv, Xin and Zhang, Jiajie and Lyu, Hongchang and Tang, Jiankai and Huang, Zhidian and Du, Zhengxiao and Liu, Xiao and Zeng, Aohan and Hou, Lei and Dong, Yuxiao and Tang, Jie and Li, Juanzi},
  journal={arXiv preprint arXiv:2308.14508},
  year={2023}
}

Star History

Star History Chart

About

Marathon: A Multiple-choice Long Context Evaluation Benchmark for Large Language Models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages