The real APIs integrated into this benchmark were carefully selected from the public-apis repository based on the following criteria:
- Reliability: APIs with high uptime and stable service
- Accessibility: Free or freemium APIs that don't require complex authentication
- Diversity: Coverage of different domains and functionality types
- Documentation: Well-documented APIs with clear parameter specifications
- Rate Limits: Reasonable rate limits suitable for benchmark evaluation
We expand the benchmark by incorporating real-world APIs alongside the existing simulated tools:
- Real API Tasks: Tasks that interact with actual public APIs to test agents' capabilities in real-world scenarios. The tasks are given in
task_library_realAPI.json - Real API Tools: Integrated a curated selection of public APIs from the public-apis repository. The tools are given in
tool_registry_realAPI.json.
The real APIs were carefully selected to provide diverse functionality while maintaining reliability and accessibility for benchmark evaluation.
This version introduces several key improvements compared to the previous version:
The task categories have been renamed for better clarity and semantic meaning:
single_tool_task→content_analysistwo_step_task→batch_processingmulti_step_processing→data_multistep_processingnetwork_integration→api_data_retrievalcomplex_workflow→advanced_processing
These new names better reflect the actual functionality and purpose of each task category, making the benchmark more intuitive for researchers and practitioners.
- Tools →
tool_registry.json(Paper "Tool Registry"). - Tasks (with four reference workflows) →
simulated_tasks_enhanced/task_library_enhanced_v3_{difficulty}_with_workflows.json(Paper "Reference Workflows", "Workflow Prompt Generation", "MDP-BASED OPTIMAL WORKFLOW GENERATION"). biasedvariants → supplemental analysis dataset versions (for robustness/sampling comparisons).epository contains the supplementary materials submitted with the paper:- Tasks (task specifications and reference workflows): a set of JSON files under the
tasks/directory. - Tools (tool registry / probabilistic tool behavior): the
tool_registry.jsonat the repository root.
The sections below explain how these files map to the paper and provide a unified naming scheme for easier review and reproduction.
- Tool Registry (Tools) → Paper “Benchmark Setup” → “Tool Registry”, consistent with “Benchmark Construction > Tool Library Generation”. Error models and dependencies are detailed in the paper and appendices.
- Tasks and Reference Workflows (Tasks + Workflows) → Paper “Benchmark Setup > Benchmark Data Organization” → “Task Specifications” and “Reference Workflows”. The four prompt variants correspond to “Workflow Prompt Generation” and “MDP-Based Optimal Workflow Generation”.
-
tool_registry.json- Tool registry of APIs, including parameter templates, return schemas, and probabilistic error models (INVALID_INPUT, OPERATION_FAILED, TIMEOUT, CALCULATION_ERROR, OVERFLOW).
- Corresponds to the “Tool Registry” implementation in the paper.
-
tasks/*.json- Task sets, organized by difficulty and whether reference workflows are included:
- Files with the suffix
_with_workflows.json: extend task specifications with four prompt variants (Baseline / Chain-of-Thought / Optimal Workflow / Flawed Workflow). - Files without that suffix: contain only Task Specifications.
- File names that include
biased: indicate a biased sampling/configuration variant for supplemental analysis (does not affect the main definitions in the paper).
- Files with the suffix
- Task sets, organized by difficulty and whether reference workflows are included:
- Tools →
tool_registry.json(Paper “Tool Registry”). - Tasks (without workflows) →
tasks/task_specs_v3_{difficulty}.json(Paper “Task Specifications”). - Tasks (with four reference workflows) →
tasks/task_specs_v3_{difficulty}_with_workflows.json(Paper “Reference Workflows”, “Workflow Prompt Generation”, “MDP-BASED OPTIMAL WORKFLOW GENERATION”). biasedvariants → supplemental analysis dataset versions (for robustness/sampling comparisons).