EmergenceWebVoyager

An enhanced version of WebVoyager with methodological improvements for web agent evaluations

Introduction

EmergenceWebVoyager builds upon the original WebVoyager benchmark to address methodological limitations in web agent evaluations. This project implements principles outlined in our paper "Towards Methodological Consistency in WebAgent Evaluations" to provide a more rigorous, consistent, and reproducible framework for assessing web agent capabilities.

Key Contributions

Improved Methodological Framework: Standardized evaluation protocols that ensure consistent assessment across different web agents
Task Assertions: Every task has a series of assertions that will help an evaluator.
Dynamic Task Generation: Enhanced task instantiation process that creates time-appropriate benchmarks
Structured Annotation System: Purpose-built annotation tool for consistent human evaluation
Transparent Leaderboard: Open, verifiable results with video evidence of agent performance

Leaderboard

Leaderboard Home - View overall rankings and performance metrics
Leaderboard Viewer - Detailed examination of individual agent performance

Execution Videos

Check out the execution videos on the leaderboard pages linked above. These videos provide transparent evidence of agent performance and enable verification of evaluation results.

Discord

Got feedback or found a bug? Hop into our Discord!

Benchmark Overview

EmergenceWebVoyager provides a comprehensive suite of web navigation tasks designed to evaluate web agents across various dimensions of capability. The benchmark addresses several methodological limitations identified in previous web agent evaluations:

Temporal Relevance: Tasks that automatically update to remain contextually appropriate over time
Reproducibility: Standardized testing protocol with complete task trajectories
Comprehensive Metrics: Multi-dimensional evaluation beyond simple success/failure binary
Transparent Evaluation: Open leaderboard with verifiable execution videos

Task Structure

Each benchmark task consists of:

A templated intent with dynamic instantiation parameters
Clear success criteria in the form of assessment questions
Reference answers that illustrate possible successful outcomes

Installation and Setup

Setup Environment

# Clone the repository
git clone https://github.com/emergenceai/EmergenceWebVoyager.git
cd EmergenceWebVoyager

# Create and activate virtual environment
python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Usage

Generating Benchmark Tasks

The benchmark includes a task instantiation script that generates time-appropriate versions of the task templates:

cd tasks
python instantiate_tasks.py

This will create a new JSON file with instantiated tasks using current dates and context-appropriate parameters.

Annotation Tool

The benchmark includes a web-based annotation tool for consistent human evaluation:

cd AnnotationTool
python main.py

Visit http://localhost:8000 in your browser to access the annotation interface.

Project Structure

tasks/ - Contains task templates and instantiation scripts
AnnotationTool/ - Web-based tool for human evaluation of agent performance
leaderboard/ - Interactive leaderboard interface for viewing results

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
AnnotationTool		AnnotationTool
leaderboard		leaderboard
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EmergenceWebVoyager

Introduction

Key Contributions

Leaderboard

Execution Videos

Discord

Benchmark Overview

Task Structure

Installation and Setup

Setup Environment

Usage

Generating Benchmark Tasks

Annotation Tool

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EmergenceWebVoyager

Introduction

Key Contributions

Leaderboard

Execution Videos

Discord

Benchmark Overview

Task Structure

Installation and Setup

Setup Environment

Usage

Generating Benchmark Tasks

Annotation Tool

Project Structure

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages