<h1 align="center">🚀 RocketEval 🚀</h1>
<h3 align="center">Efficiently Evaluate LLMs using Google Colab</h3>

This notebook will run a full evaluation process of [RocketEval](https://github.com/Joinn99/RocketEval-ICLR) using a single Tesla T4 GPU in Google Colab.

RocketEval is an efficient automated evaluation framework for Large Language Models (LLMs) that uses a checklist-based grading approach. This notebook demonstrates how to:

1. Set up the evaluation environment on Google Colab
2. Run evaluations using a lightweight model (*Qwen2.5-0.5B-Instruct*) as the judge
3. Evaluate multiple LLM responses on the *MT-Bench* dataset
4. Generate evaluation scores and rankings efficiently with limited compute resources

The entire process is optimized to run on a single Tesla T4 GPU, making it accessible for users with basic GPU resources.

------

First, clone our github repo. This repository contains the implementation of RocketEval, an efficient automated evaluation framework for Large Language Models (LLMs) that uses a checklist-based grading approach. The repository includes:

- Evaluation scripts and utilities
- Example benchmark datasets (MT-Bench, AlpacaEval, etc.)
- Model configurations for both API and local deployment
- Documentation and examples

In [None]:
# Clone the repo
!git clone https://github.com/Joinn99/RocketEval-ICLR.git
# Download the data
!git clone https://huggingface.co/datasets/Joinn/RocketEval && mv RocketEval RocketEval-ICLR/data

------

Go to the target dir, and install the necessary dependencies.

We need to:
1. Change to the RocketEval directory
2. Install the required packages from requirements.txt

The requirements.txt file contains all the necessary Python packages including:
- `vllm` for local model deployment
- `openai` for API access
- `scikit-learn` for evaluation metrics
- Other utility packages

In [None]:
%cd RocketEval-ICLR
!pip install -r requirements.txt

------

Next, we'll configure the evaluation parameters. This notebook demonstrates a minimal evaluation setup with the following key arguments:

- `dataset`: [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench/tree/main), a compact LLM evaluation dataset
- `generator`: [GPT-4](https://chatgpt.com/) for checklist generation (we'll use pre-generated checklists)
- `judge`: [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), chosen for efficient evaluation on a Tesla T4 GPU
- `train_test`: Enables split evaluation using model sets from:
  - Training: `config/rankings/mt-bench_train.json`
  - Testing: `config/rankings/mt-bench_test.json`
- `mode`: "offline" to utilize the local vLLM framework
- `gpu_ids`: "0" for single GPU execution
- `offline_config`: vLLM engine configuration file path

In [None]:
# Arguments for the evaluation task
default_args="""--dataset mt-bench --generator gpt-4o --judge Qwen2.5-0.5B-Instruct --train_test --mode offline --gpu_ids 0 --offline_config config/offline/colab.yaml""".split()

------

Next, we start to run our task.

We'll use the default arguments defined above to:
1. Load the MT-Bench dataset
2. Initialize the Qwen2.5-0.5B-Instruct judge model using vLLM
3. Run the evaluation pipeline including:
   - Checklist-based grading
   - Score aggregation
   - Model ranking

In [None]:
#@title Running Evaluation Task

import sys
import os

rocketeval_dir = os.path.join(os.path.abspath(os.curdir), "src")
sys.path.insert(0, rocketeval_dir)

import time
import openai
import logging
import argparse

from rich.logging import RichHandler
from rich.console import Console
from rich.markdown import Markdown

from rocketeval.data.data_loader import load_target_models
from rocketeval.task import checklist_task, judgment_task, ranking_task, score_task

logging.getLogger('RootLogger').setLevel(logging.INFO)

logging.basicConfig(
    level=logging.INFO,
    format="%(message)s",
    datefmt="[%X]",
    handlers=[RichHandler()],
    force=True
)

parser = argparse.ArgumentParser(description="RocketEval Task Runner")

# Data
parser.add_argument("--data_dir", default="data/", help="Data directory")
parser.add_argument("--config_dir", default="config/", help="Config directory")

# Model
parser.add_argument("--dataset", default="mt-bench", help="Dataset name")
parser.add_argument("--generator", default="gpt-4o", help="Generator model")
parser.add_argument("--judge", default="gpt-4o", help="Judge model")
parser.add_argument("--labeler", default="gpt-4o", help="Labeler judge that provides labels")
parser.add_argument("--train_test", action="store_true", help="Use specific train-test split")
parser.add_argument("--gen_checklist", action="store_true", help="Generate checklist")

# Running Mode
parser.add_argument("--mode", choices=["api", "offline"], help="Running mode, set to 'api' to use OpenAI API, set to 'offline' to use local models through vLLM")
parser.add_argument("--instant_api", action="store_true", help="Run using instant API.")
parser.add_argument("--api_parallel_size", default=1, help="Number of parallel API calls, adjust based on your API rate limit.")
parser.add_argument("--offline_config", default="config/offline/default.yaml", help="Path to vLLM config file")

# Others
parser.add_argument("--resume_from_task_id", default=None, help="Task ID")
parser.add_argument("--keep_batch_files", action="store_true", help="Keep batch files")
parser.add_argument("--gpu_ids", default="0", help="GPU IDs, split by comma")

args = parser.parse_args(default_args)
kwargs = vars(args)

task_id = f"{args.dataset}_{int(time.time())}" \
    if args.resume_from_task_id is None \
    else args.resume_from_task_id

if args.mode == "api":
    client = openai.OpenAI()
else:
    client = None
    if not os.path.exists(os.path.join(args.data_dir, "batch")):
      os.makedirs(os.path.join(args.data_dir, "batch"))

task_id = f"{args.dataset}_{args.judge}_{int(time.time())}" \
    if args.resume_from_task_id is None \
    else args.resume_from_task_id

train_model_names = load_target_models(
    data_dir=args.data_dir,
    config_dir=args.config_dir,
    dataset_name=args.dataset,
    split="train" if args.train_test else "full"
)

test_model_names = load_target_models(
    data_dir=args.data_dir,
    config_dir=args.config_dir,
    dataset_name=args.dataset,
    split="test" if args.train_test else "full"
)

logger = logging.getLogger("rich")

start_message = f"""[underline bold red on white blink]RocketEval[/]
[bold yellow on red blink] Task Information[/]
- Dataset: "{args.dataset}"
- Judge: "{args.judge}"
- Labeler: "{args.labeler}"
- Task ID: "{task_id}"
""".replace("\t", "")

logger.info(start_message, extra={"markup": True})


logger.info(f"[bold yellow on red blink]RocketEval Completed[/]", extra={"markup": True})

if args.gen_checklist:
    # I - Checklist Creation
    logger.info(
        "[bold yellow on red blink]I. Checklist Creation[/]", extra={"markup": True}
    )

    checklist_task(
        client=client,
        task_id=task_id,
        **kwargs
    )

    logger.info(
        f"[yellow]Checklist Creation completed.[/]\n\n",
        extra={"markup": True}
    )
else:
    logger.info(
        f"[bold yellow on red blink]Checklist Creation skipped.[/]", extra={"markup": True},
    )

# II - Judgment Creation
logger.info(
    "[bold yellow on red blink]II. Judgment Creation[/]", extra={"markup": True}
)

judgment_task(
    model_names=train_model_names + test_model_names,
    client=client,
    task_id=task_id,
    **kwargs
)

logger.info(
    f"[yellow]Judgment Creation completed.[/]\n\n",
    extra={"markup": True}
)


# III - Score Creation
logger.info(
    f"[bold yellow on red blink]III. Score Creation[/]",
    extra={"markup": True}
)

score_task(
    train_model_names=train_model_names,
    test_model_names=test_model_names,
    task_id=task_id,
    **kwargs
)

logger.info(
    f"[yellow]Score Creation completed.[/]\n\n",
    extra={"markup": True}
)


# IV - Ranking
logger.info(
    f"[bold yellow on red blink]IV. Ranking[/]",
    extra={"markup": True}
)

ranking = ranking_task(
    model_names=test_model_names,
    **kwargs
)

Console().print(Markdown(ranking.to_markdown()), justify="center")

logger.info(
    f"[yellow]Ranking completed.[/]\n\n",
    extra={"markup": True}
)


# Finish
logger.info(
    f"[bold yellow on red blink]RocketEval Completed[/]",
    extra={"markup": True}
)

------

After deriving the scores and rankings, you can also export the data to the [LMSYS Chatbot Arena](https://lmarena.ai/) format for further analysis using the official [notebook](https://colab.research.google.com/drive/1KdwokPjirkTmpO_P1WByFNFiqxWQquwH).

In [None]:
from rocketeval.tools.export import chatbot_arena_match
from rocketeval.data.data_loader import load_target_models

load_target_models(dataset_name="mt-bench", split="test")
result = chatbot_arena_match(dataset_name="mt-bench", judge="Qwen2.5-0.5B-Instruct", model_names=test_model_names)
result.to_json("matches.jsonl", orient="records", lines=True)

result.head(5)

You can try on more powerful judge models and increase the number of test models to get a better and comprehensive evaluation result.

------

If you find this work useful in your research, please consider citing the following paper:
```bibtex
@inproceedings{wei2025rocketeval,
    title={RocketEval: Efficient automated {LLM} evaluation via grading checklist},
    author={Tianjun Wei and Wei Wen and Ruizhi Qiao and Xing Sun and Jianghong Ma},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=zJjzNj6QUe}
}
```