Mind2Web 2

Mind2Web 2 is a benchmark for agentic search systems, featuring Agent-as-a-Judge methodology for comprehensive, rigorous, and reliable assessment.

Mind2Web 2 features realistic and diverse long-horizon web search tasks and a novel Agent-as-a-Judge framework to evaluate complex, time-varying, and citation-backed answers.

🔗 Links

🆕 Updates

2025/06/26: The GitHub repo is live. The manuscript is now on arXiv.

⚙️ Environment Setup

Run the following commands to set up the Python environment:

# Create and activate conda environment
conda create -n mind2web2 python=3.11
conda activate mind2web2

# Install the package in development mode
pip install -e .

# Install browsers for Playwright
playwright install

📁 Repo Structure

Mind2Web2-polish/
├── dataset/                 # Evaluation data and results
├── mind2web2/              # Main package
│   ├── api_tools/          # External API tools
│   ├── llm_client/         # LLM client implementations
│   ├── utils/              # Utility functions
│   ├── eval_runner.py      # Evaluation execution
│   ├── eval_toolkit.py     # Evaluation toolkit and utilities
│   ├── evaluator.py        # Core evaluation logic
│   └── verification_tree.py # Rubric tree implementation
├── pyproject.toml          # Package configuration
└── README.md              # This file

🚀 Run Evaluation

1. Prepare Your Data

Put your answer folder under the dataset directory. The folder should contain your agent's responses in markdown format.

2. Set up API Keys

Configure the necessary API keys for evaluation:

# Set up environment variables for OpenAI API
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY"

# (Optional) Environment variables for Azure OpenAI
export AZURE_OPENAI_API_KEY="YOUR_AZURE_OPENAI_API_KEY"
export AZURE_OPENAI_ENDPOINT_URL="YOUR_AZURE_OPENAI_ENDPOINT_URL"
export AZURE_OPENAI_API_VERSION="2025-03-01-preview"

# Tool APIs for certain evaluation tasks
export GOOGLE_MAPS_API_KEY="YOUR_GOOGLE_MAPS_API_KEY"

3. Precache Webpages

Before running evaluation, you may want to precache the webpages to reduce evaluation latency. Loading webpages on-the-fly is very inefficient.

Use the following script to precache:

# Coming Soon!

We also provide this lightweight script to fix the errors in the precached webpages (for example, some pages may be blocked by human verification):

# Coming Soon!

4. Run Evaluation

Execute the evaluation process:

# Coming Soon!

📝 Citation

If you find this work useful, please consider starring our repo and citing our papers:

@misc{gou2025mind2web2,
    title = {Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge}, 
    author = {Boyu Gou and Zanming Huang and Yuting Ning and Yu Gu and Michael Lin and Weijian Qi and Andrei Kopanev and Botao Yu and Bernal Jiménez Gutiérrez and Yiheng Shu and Chan Hee Song and Jiaman Wu and Shijie Chen and Hanane Nour Moussa and Tianshu Zhang and Jian Xie and Yifei Li and Tianci Xue and Zeyi Liao and Kai Zhang and Boyuan Zheng and Zhaowei Cai and Viktor Rozgic and Morteza Ziyadi and Huan Sun and Yu Su},
    year = {2025},
    eprint = {2506.21506},
    archivePrefix = {arXiv},
    primaryClass = {cs.AI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
mind2web2		mind2web2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mind2Web 2

🔗 Links

🆕 Updates

⚙️ Environment Setup

📁 Repo Structure

🚀 Run Evaluation

1. Prepare Your Data

2. Set up API Keys

3. Precache Webpages

4. Run Evaluation

📝 Citation

About

Uh oh!

Packages

Contributors 2

Languages

License

OSU-NLP-Group/Mind2Web-2

Folders and files

Latest commit

History

Repository files navigation

Mind2Web 2

🔗 Links

🆕 Updates

⚙️ Environment Setup

📁 Repo Structure

🚀 Run Evaluation

1. Prepare Your Data

2. Set up API Keys

3. Precache Webpages

4. Run Evaluation

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Packages 0

Contributors 2

Languages

Packages