Skip to content

OSU-NLP-Group/Mind2Web-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mind2Web 2

Mind2Web 2 is a benchmark for agentic search systems, featuring Agent-as-a-Judge methodology for comprehensive, rigorous, and reliable assessment.

Mind2Web 2 Overview

Mind2Web 2 features realistic and diverse long-horizon web search tasks and a novel Agent-as-a-Judge framework to evaluate complex, time-varying, and citation-backed answers.

🔗 Links

🆕 Updates

  • 2025/06/26: The GitHub repo is live. The manuscript is now on arXiv.

⚙️ Environment Setup

Run the following commands to set up the Python environment:

# Create and activate conda environment
conda create -n mind2web2 python=3.11
conda activate mind2web2

# Install the package in development mode
pip install -e .

# Install browsers for Playwright
playwright install

📁 Repo Structure

Mind2Web2-polish/
├── dataset/                 # Evaluation data and results
├── mind2web2/              # Main package
│   ├── api_tools/          # External API tools
│   ├── llm_client/         # LLM client implementations
│   ├── utils/              # Utility functions
│   ├── eval_runner.py      # Evaluation execution
│   ├── eval_toolkit.py     # Evaluation toolkit and utilities
│   ├── evaluator.py        # Core evaluation logic
│   └── verification_tree.py # Rubric tree implementation
├── pyproject.toml          # Package configuration
└── README.md              # This file

🚀 Run Evaluation

1. Prepare Your Data

Put your answer folder under the dataset directory. The folder should contain your agent's responses in markdown format.

2. Set up API Keys

Configure the necessary API keys for evaluation:

# Set up environment variables for OpenAI API
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

# (Optional) Environment variables for Azure OpenAI
export AZURE_OPENAI_API_KEY="YOUR_AZURE_OPENAI_API_KEY"
export AZURE_OPENAI_ENDPOINT_URL="YOUR_AZURE_OPENAI_ENDPOINT_URL"
export AZURE_OPENAI_API_VERSION="2025-03-01-preview"

# Tool APIs for certain evaluation tasks
export GOOGLE_MAPS_API_KEY="YOUR_GOOGLE_MAPS_API_KEY"

3. Precache Webpages

Before running evaluation, you may want to precache the webpages to reduce evaluation latency. Loading webpages on-the-fly is very inefficient.

Use the following script to precache:

# Coming Soon!

We also provide this lightweight script to fix the errors in the precached webpages (for example, some pages may be blocked by human verification):

# Coming Soon!

4. Run Evaluation

Execute the evaluation process:

# Coming Soon!

📝 Citation

If you find this work useful, please consider starring our repo and citing our papers:

@misc{gou2025mind2web2,
    title = {Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge}, 
    author = {Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Weijian Qi and Andrei Kopanev and Botao Yu and Bernal Jiménez Gutiérrez and Yiheng Shu and Chan Hee Song and Jiaman Wu and Shijie Chen and Hanane Nour Moussa and Tianshu Zhang and Jian Xie and Yifei Li and Tianci Xue and Zeyi Liao and Kai Zhang and Boyuan Zheng and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
    year = {2025},
    eprint = {2506.21506},
    archivePrefix = {arXiv},
    primaryClass = {cs.AI}
}

About

Mind2Web-2 Benchmark: Evaluating Agentic Search with Agent-as-a-Judge

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages