LiveBench

Leaderboard as of 25th June 2024:

Update 25th June 2024: we removed a reasoning sub-task, house_traversal, because of ambiguous parsing causing misleading results. We will replace it in a future release.

Introduction

Introducing LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind.

LiveBench has the following properties:

LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.

We will evaluate your model! Open an issue or email us at mailto:livebench.ai@gmail.com!

Installation Quickstart

Tested on Python 3.10

cd LiveBench
pip install torch packaging # These need to be installed prior to other dependencies.
pip install -e .

Note about fschat: The fschat package version on pip (i.e., lmsys/fastchat) is currently out of date, so we strongly recommend pip uninstall fschat before running the above, since it will then automatically install a more recent commit of fastchat.

Note for CPU users: If installing on a CPU-only machine (e.g. to run api models only), you will need to manually remove flash-attn from the requirements list in pyproject.toml.

Our repo is adapted from FastChat's excellent llm_judge module, and it also contains code from LiveCodeBench and IFEval.

Usage

cd livebench

To generate model answers on LiveBench, run:

python gen_model_answer.py --model-path /path/to/Mistral-7B-Instruct-v0.2/ --model-id Mistral-7B-Instruct-v0.2 --dtype bfloat16 --bench-name live_bench

For API-based models, first set the appropriate key and then run the gen_api_answer.py. We currently support the following APIs: OpenAI, Anthropic, Mistral, Cohere, and Gemini. The command to run all of LiveBench on an api_model_name, run this command:

export OPENAI_API_KEY=<your_key>
export ANTHROPIC_API_KEY=<your_key>
export MISTRAL_API_KEY=<your_key>
export CO_API_KEY=<your_key>
export GEMINI_API_KEY=<your_key>
export DEEPSEEK_API_KEY=<your_key>
python gen_api_answer.py --model <api_model_name> --bench-name live_bench

To generate model answers with VLLM or other arbitrary APIs matching the OpenAI API format, run:

export LIVEBENCH_API_KEY=<your API key if needed. Usually not needed for VLLM>
python gen_api_answer.py --model <api_model_name> --bench-name live_bench --api-base <your endpoint. Often, for VLLM, this is http://localhost:8000/v1>

To score the model outputs:

python gen_ground_truth_judgment.py --bench-name live_bench

To show all the results:

python show_livebench_results.py

You may want to run these commands on just some models. To run any of the above python files (gen_model_answer.py, gen_api_answer.py, gen_ground_truth_judgment, or show_livebench_results) for specific models, use the following argument styles:

python gen_model_answer.py          --bench-name live_bench --model-path /path/to/Mistral-7B-Instruct-v0.2/ --model-id Mistral-7B-Instruct-v0.2 --dtype bfloat16 
python gen_api_answer.py            --bench-name live_bench --model gpt-4-turbo
python gen_ground_truth_judgment.py --bench-name live_bench --model-list Mistral-7B-Instruct-v0.2 Llama-2-7b-chat-hf claude-3-opus-20240229
python show_livebench_results.py    --bench-name live_bench --model-list Mistral-7B-Instruct-v0.2 Llama-2-7b-chat-hf claude-3-opus-20240229

Or, you may want to show results for a specific category or task of LiveBench by using the --bench-name argument. Here, we run show_livebench_results.py on just the web_of_lies_v2 task:

python show_livebench_results.py --bench-name live_bench/reasoning/web_of_lies_v2

To optionally download all model answers and judgments from the 34 models on the leaderboard, use

python download_leaderboard.py
python show_livebench_results.py # will now display the results for all models on the leaderboard

To optionally download question.jsonl files (for inspection), use

python download_questions.py

Data

The questions for each of the categories can be found below:

Also available are the model answers and the model judgments.

Documentation

Here, we describe our dataset documentation. This information is also available in our paper.

Citation

@misc{livebench,
  author    = {White, Colin and Dooley, Samuel and Roberts, Manley and Pal, Arka and Feuer, Ben and Jain, Siddhartha and Shwartz-Ziv, Ravid and Jain, Neel and Saifullah, Khalid and Naidu, Siddartha and Hegde, Chinmay and LeCun, Yann and Goldstein, Tom and Neiswanger, Willie and Goldblum, Micah},
  title     = {LiveBench: A Challenging, Contamination-Free LLM Benchmark},
  url       = {https://livebench.ai},
  year      = {2024},
}```

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
assets		assets
docs		docs
livebench		livebench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiveBench

Introduction

Installation Quickstart

Usage

Data

Documentation

Citation

About

Releases

Packages

Contributors 7

Languages

License

LiveBench/LiveBench

Folders and files

Latest commit

History

Repository files navigation

LiveBench

Introduction

Installation Quickstart

Usage

Data

Documentation

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages