<h1 align="center">🚀 RocketEval 🚀</h1>
<h3 align="center">Efficiently Evaluate LLMs using Google Colab</h3>

This notebook will run a full evaluation process of [RocketEval](https://github.com/Joinn99/RocketEval-ICLR) using a single Tesla T4 GPU in Google Colab.

RocketEval is an efficient automated evaluation framework for Large Language Models (LLMs) that uses a checklist-based grading approach. This notebook demonstrates how to:

1. Set up the evaluation environment on Google Colab
2. Run evaluations using a lightweight model (*Qwen2.5-0.5B-Instruct*) as the judge
3. Evaluate multiple LLM responses on the *MT-Bench* dataset
4. Generate evaluation scores and rankings efficiently with limited compute resources

The entire process is optimized to run on a single Tesla T4 GPU, making it accessible for users with basic GPU resources.

------

First, clone our github repo. This repository contains the implementation of RocketEval, an efficient automated evaluation framework for Large Language Models (LLMs) that uses a checklist-based grading approach. The repository includes:

- Evaluation scripts and utilities
- Example benchmark datasets (MT-Bench, AlpacaEval, etc.)
- Model configurations for both API and local deployment
- Documentation and examples

In [1]:
# Clone the repo
!git clone https://github.com/Joinn99/RocketEval-ICLR.git
# Download the data
!git clone https://huggingface.co/datasets/Joinn/RocketEval && mv RocketEval RocketEval-ICLR/data

Cloning into 'RocketEval-ICLR'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 43 (delta 4), reused 43 (delta 4), pack-reused 0 (from 0)[K
Receiving objects: 100% (43/43), 40.25 KiB | 6.71 MiB/s, done.
Resolving deltas: 100% (4/4), done.
Cloning into 'RocketEval'...
remote: Enumerating objects: 607, done.[K
remote: Counting objects: 100% (604/604), done.[K
remote: Compressing objects: 100% (597/597), done.[K
remote: Total 607 (delta 92), reused 0 (delta 0), pack-reused 3 (from 1)[K
Receiving objects: 100% (607/607), 296.68 MiB | 5.94 MiB/s, done.
Resolving deltas: 100% (92/92), done.
Updating files: 100% (573/573), done.


------

Go to the target dir, and install the necessary dependencies.

We need to:
1. Change to the RocketEval directory
2. Install the required packages from requirements.txt

The requirements.txt file contains all the necessary Python packages including:
- `vllm` for local model deployment
- `openai` for API access
- `scikit-learn` for evaluation metrics
- Other utility packages

In [2]:
%cd RocketEval-ICLR
!pip install -r requirements.txt

/content/RocketEval-ICLR
Collecting datasets (from -r requirements.txt (line 2))
  Downloading datasets-3.3.1-py3-none-any.whl.metadata (19 kB)
Collecting vllm>=0.7.2 (from -r requirements.txt (line 3))
  Downloading vllm-0.7.2-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->-r requirements.txt (line 2))
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->-r requirements.txt (line 2))
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets->-r requirements.txt (line 2))
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting blake3 (from vllm>=0.7.2->-r requirements.txt (line 3))
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm>=0.7.2->-r requirements.txt (line 3))
  D

------

Next, we'll configure the evaluation parameters. This notebook demonstrates a minimal evaluation setup with the following key arguments:

- `dataset`: [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench/tree/main), a compact LLM evaluation dataset
- `generator`: [GPT-4](https://chatgpt.com/) for checklist generation (we'll use pre-generated checklists)
- `judge`: [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), chosen for efficient evaluation on a Tesla T4 GPU
- `train_test`: Enables split evaluation using model sets from:
  - Training: `config/rankings/mt-bench_train.json`
  - Testing: `config/rankings/mt-bench_test.json`
- `mode`: "offline" to utilize the local vLLM framework
- `gpu_ids`: "0" for single GPU execution
- `offline_config`: vLLM engine configuration file path

In [3]:
# Arguments for the evaluation task
default_args="""--dataset mt-bench --generator gpt-4o --judge Qwen2.5-0.5B-Instruct --train_test --mode offline --gpu_ids 0 --offline_config config/offline/colab.yaml""".split()

------

Next, we start to run our task.

We'll use the default arguments defined above to:
1. Load the MT-Bench dataset
2. Initialize the Qwen2.5-0.5B-Instruct judge model using vLLM
3. Run the evaluation pipeline including:
   - Checklist-based grading
   - Score aggregation
   - Model ranking

In [4]:
#@title Running Evaluation Task

import sys
import os

rocketeval_dir = os.path.join(os.path.abspath(os.curdir), "src")
sys.path.insert(0, rocketeval_dir)

import time
import openai
import logging
import argparse

from rich.logging import RichHandler
from rich.console import Console
from rich.markdown import Markdown

from rocketeval.data.data_loader import load_target_models
from rocketeval.task import checklist_task, judgment_task, ranking_task, score_task

logging.getLogger('RootLogger').setLevel(logging.INFO)

logging.basicConfig(
    level=logging.INFO,
    format="%(message)s",
    datefmt="[%X]",
    handlers=[RichHandler()],
    force=True
)

parser = argparse.ArgumentParser(description="RocketEval Task Runner")

# Data
parser.add_argument("--data_dir", default="data/", help="Data directory")
parser.add_argument("--config_dir", default="config/", help="Config directory")

# Model
parser.add_argument("--dataset", default="mt-bench", help="Dataset name")
parser.add_argument("--generator", default="gpt-4o", help="Generator model")
parser.add_argument("--judge", default="gpt-4o", help="Judge model")
parser.add_argument("--labeler", default="gpt-4o", help="Labeler judge that provides labels")
parser.add_argument("--train_test", action="store_true", help="Use specific train-test split")
parser.add_argument("--gen_checklist", action="store_true", help="Generate checklist")

# Running Mode
parser.add_argument("--mode", choices=["api", "offline"], help="Running mode, set to 'api' to use OpenAI API, set to 'offline' to use local models through vLLM")
parser.add_argument("--instant_api", action="store_true", help="Run using instant API.")
parser.add_argument("--api_parallel_size", default=1, help="Number of parallel API calls, adjust based on your API rate limit.")
parser.add_argument("--offline_config", default="config/offline/default.yaml", help="Path to vLLM config file")

# Others
parser.add_argument("--resume_from_task_id", default=None, help="Task ID")
parser.add_argument("--keep_batch_files", action="store_true", help="Keep batch files")
parser.add_argument("--gpu_ids", default="0", help="GPU IDs, split by comma")

args = parser.parse_args(default_args)
kwargs = vars(args)

task_id = f"{args.dataset}_{int(time.time())}" \
    if args.resume_from_task_id is None \
    else args.resume_from_task_id

if args.mode == "api":
    client = openai.OpenAI()
else:
    client = None
    if not os.path.exists(os.path.join(args.data_dir, "batch")):
      os.makedirs(os.path.join(args.data_dir, "batch"))

task_id = f"{args.dataset}_{args.judge}_{int(time.time())}" \
    if args.resume_from_task_id is None \
    else args.resume_from_task_id

train_model_names = load_target_models(
    data_dir=args.data_dir,
    config_dir=args.config_dir,
    dataset_name=args.dataset,
    split="train" if args.train_test else "full"
)

test_model_names = load_target_models(
    data_dir=args.data_dir,
    config_dir=args.config_dir,
    dataset_name=args.dataset,
    split="test" if args.train_test else "full"
)

logger = logging.getLogger("rich")

start_message = f"""[underline bold red on white blink]RocketEval[/]
[bold yellow on red blink] Task Information[/]
- Dataset: "{args.dataset}"
- Judge: "{args.judge}"
- Labeler: "{args.labeler}"
- Task ID: "{task_id}"
""".replace("\t", "")

logger.info(start_message, extra={"markup": True})


logger.info(f"[bold yellow on red blink]RocketEval Completed[/]", extra={"markup": True})

if args.gen_checklist:
    # I - Checklist Creation
    logger.info(
        "[bold yellow on red blink]I. Checklist Creation[/]", extra={"markup": True}
    )

    checklist_task(
        client=client,
        task_id=task_id,
        **kwargs
    )

    logger.info(
        f"[yellow]Checklist Creation completed.[/]\n\n",
        extra={"markup": True}
    )
else:
    logger.info(
        f"[bold yellow on red blink]Checklist Creation skipped.[/]", extra={"markup": True},
    )

# II - Judgment Creation
logger.info(
    "[bold yellow on red blink]II. Judgment Creation[/]", extra={"markup": True}
)

judgment_task(
    model_names=train_model_names + test_model_names,
    client=client,
    task_id=task_id,
    **kwargs
)

logger.info(
    f"[yellow]Judgment Creation completed.[/]\n\n",
    extra={"markup": True}
)


# III - Score Creation
logger.info(
    f"[bold yellow on red blink]III. Score Creation[/]",
    extra={"markup": True}
)

score_task(
    train_model_names=train_model_names,
    test_model_names=test_model_names,
    task_id=task_id,
    **kwargs
)

logger.info(
    f"[yellow]Score Creation completed.[/]\n\n",
    extra={"markup": True}
)


# IV - Ranking
logger.info(
    f"[bold yellow on red blink]IV. Ranking[/]",
    extra={"markup": True}
)

ranking = ranking_task(
    model_names=test_model_names,
    **kwargs
)

Console().print(Markdown(ranking.to_markdown()), justify="center")

logger.info(
    f"[yellow]Ranking completed.[/]\n\n",
    extra={"markup": True}
)


# Finish
logger.info(
    f"[bold yellow on red blink]RocketEval Completed[/]",
    extra={"markup": True}
)

INFO 02-19 06:24:54 __init__.py:190] Automatically detected platform cuda.


100%|██████████| 20/20 [00:01<00:00, 11.34it/s]


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

INFO 02-19 06:25:25 config.py:542] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 02-19 06:25:25 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2400, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, num_sc

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

INFO 02-19 06:25:30 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-19 06:25:30 cuda.py:227] Using XFormers backend.
INFO 02-19 06:25:31 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-0.5B-Instruct...
INFO 02-19 06:25:31 weight_utils.py:252] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

INFO 02-19 06:25:55 weight_utils.py:297] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-19 06:25:57 model_runner.py:1115] Loading model weights took 0.9277 GB
INFO 02-19 06:25:58 worker.py:267] Memory profiling takes 1.30 seconds
INFO 02-19 06:25:58 worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 02-19 06:25:58 worker.py:267] model weights take 0.93GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 2.77GiB; the rest of the memory reserved for KV Cache is 9.53GiB.
INFO 02-19 06:25:59 executor_base.py:110] # CUDA blocks: 52027, # CPU blocks: 21845
INFO 02-19 06:25:59 executor_base.py:115] Maximum concurrency for 2400 tokens per request: 346.85x
INFO 02-19 06:26:05 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████| 67/67 [01:04<00:00,  1.04it/s]


INFO 02-19 06:27:09 model_runner.py:1562] Graph capturing finished in 64 secs, took 0.23 GiB
INFO 02-19 06:27:09 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 72.42 seconds
INFO 02-19 06:27:12 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Processed prompts: 100%|██████████| 24120/24120 [06:33<00:00, 61.22it/s, est. speed input: 51786.82 toks/s, output: 61.22 toks/s]
100%|██████████| 20/20 [00:00<00:00, 42.28it/s]


  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:00<00:00, 202.24it/s]


------

After deriving the scores and rankings, you can also export the data to the [LMSYS Chatbot Arena](https://lmarena.ai/) format for further analysis using the official [notebook](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH).

In [5]:
from rocketeval.tools.export import chatbot_arena_match
from rocketeval.data.data_loader import load_target_models

load_target_models(dataset_name="mt-bench", split="test")
result = chatbot_arena_match(dataset_name="mt-bench", judge="Qwen2.5-0.5B-Instruct", model_names=test_model_names)
result.to_json("matches.jsonl", orient="records", lines=True)

result.head(5)

Unnamed: 0,model_a,model_b,winner,judge,turn,anony,language,tstamp,conv_metadata,is_code,is_refusal,dedup_tag,category_tag
0,gpt-4o-mini,yi-large-preview,model_b,Qwen2.5-0.5B-Instruct,1,True,English,580830,"{'sum_user_tokens': 8, 'sum_assistant_a_tokens...",False,False,"{'high_freq': False, 'sampled': True}","{'if_v0.1': {'if': True, 'score': 4}, 'math_v0..."
1,gpt-4o-mini,Meta-Llama3-70B-Instruct,tie,Qwen2.5-0.5B-Instruct,1,True,English,580238,"{'sum_user_tokens': 8, 'sum_assistant_a_tokens...",False,False,"{'high_freq': False, 'sampled': True}","{'if_v0.1': {'if': True, 'score': 4}, 'math_v0..."
2,gpt-4,gpt-4o-mini,model_b,Qwen2.5-0.5B-Instruct,1,True,English,525352,"{'sum_user_tokens': 8, 'sum_assistant_a_tokens...",False,False,"{'high_freq': False, 'sampled': True}","{'if_v0.1': {'if': True, 'score': 4}, 'math_v0..."
3,Yi-1.5-34B-Chat,gpt-4o-mini,model_a,Qwen2.5-0.5B-Instruct,1,True,English,250561,"{'sum_user_tokens': 8, 'sum_assistant_a_tokens...",False,False,"{'high_freq': False, 'sampled': True}","{'if_v0.1': {'if': True, 'score': 4}, 'math_v0..."
4,gpt-4o-mini,Meta-Llama-3-8B-Instruct,model_b,Qwen2.5-0.5B-Instruct,1,True,English,61569,"{'sum_user_tokens': 8, 'sum_assistant_a_tokens...",False,False,"{'high_freq': False, 'sampled': True}","{'if_v0.1': {'if': True, 'score': 4}, 'math_v0..."


You can try on more powerful judge models and increase the number of test models to get a better and comprehensive evaluation result.

------

If you find this work useful in your research, please consider citing the following paper:
```bibtex
@inproceedings{wei2025rocketeval,
    title={RocketEval: Efficient automated {LLM} evaluation via grading checklist},
    author={Tianjun Wei and Wei Wen and Ruizhi Qiao and Xing Sun and Jianghong Ma},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=zJjzNj6QUe}
}
```