An enhanced version of WebVoyager with methodological improvements for web agent evaluations
EmergenceWebVoyager builds upon the original WebVoyager benchmark to address methodological limitations in web agent evaluations. This project implements principles outlined in our paper "Towards Methodological Consistency in WebAgent Evaluations" to provide a more rigorous, consistent, and reproducible framework for assessing web agent capabilities.
- Improved Methodological Framework: Standardized evaluation protocols that ensure consistent assessment across different web agents
- Task Assertions: Every task has a series of assertions that will help an evaluator.
- Dynamic Task Generation: Enhanced task instantiation process that creates time-appropriate benchmarks
- Structured Annotation System: Purpose-built annotation tool for consistent human evaluation
- Transparent Leaderboard: Open, verifiable results with video evidence of agent performance
- Leaderboard Home - View overall rankings and performance metrics
- Leaderboard Viewer - Detailed examination of individual agent performance
Check out the execution videos on the leaderboard pages linked above. These videos provide transparent evidence of agent performance and enable verification of evaluation results.
Got feedback or found a bug? Hop into our Discord!
EmergenceWebVoyager provides a comprehensive suite of web navigation tasks designed to evaluate web agents across various dimensions of capability. The benchmark addresses several methodological limitations identified in previous web agent evaluations:
- Temporal Relevance: Tasks that automatically update to remain contextually appropriate over time
- Reproducibility: Standardized testing protocol with complete task trajectories
- Comprehensive Metrics: Multi-dimensional evaluation beyond simple success/failure binary
- Transparent Evaluation: Open leaderboard with verifiable execution videos
Each benchmark task consists of:
- A templated intent with dynamic instantiation parameters
- Clear success criteria in the form of assessment questions
- Reference answers that illustrate possible successful outcomes
# Clone the repository
git clone https://github.com/emergenceai/EmergenceWebVoyager.git
cd EmergenceWebVoyager
# Create and activate virtual environment
python -m venv env
source env/bin/activate # On Windows: env\Scripts\activateThe benchmark includes a task instantiation script that generates time-appropriate versions of the task templates:
cd tasks
python instantiate_tasks.pyThis will create a new JSON file with instantiated tasks using current dates and context-appropriate parameters.
The benchmark includes a web-based annotation tool for consistent human evaluation:
cd AnnotationTool
python main.pyVisit http://localhost:8000 in your browser to access the annotation interface.
tasks/- Contains task templates and instantiation scriptsAnnotationTool/- Web-based tool for human evaluation of agent performanceleaderboard/- Interactive leaderboard interface for viewing results