Skip to content

Pi3AI/TOBench

Repository files navigation

TOBench: A Task-Oriented Omni-Modal Agent Benchmark Harness for General Real-World MCP Tasks

TOBench is a benchmark for Omni-modal agent tool-calling capacity in general real-world tasks.

Testing Environment

Code Environment

Set up Python and core dependencies:

uv venv .venv --python 3.12
source .venv/bin/activate
export PYTHONPATH="$PWD:$PYTHONPATH"
uv pip install -r requirements.txt

MCP Environment

Install MCP runtime prerequisites (Node.js/npm and local setup):

bash ./scripts/env/mcp_env_install.sh

Verify MCP servers are available:

uv run python server_tools/check_installation.py

If any MCP has problems, you can fix it manually:

cd server_tools/local_servers/asr_mcp_server
uv sync
cd ../../..

# edge-tts edition should be the latest in "server_tools/local_servers/edge_tts_mcp_server/pyproject.toml"  if you encounter the 403 error  
cd server_tools/local_servers/edge_tts_mcp_server
uv sync
cd ../../..

cd server_tools/local_servers/excel-mcp-server
uv sync
cd ../../..

cd server_tools/local_servers/Google-Search-MCP
npm install
npm run build
cd ../../..

cd server_tools/local_servers/image-processing-toolkits
uv sync
cd ../../..

cd server_tools/local_servers/mcp_weather_server
uv sync
cd ../../..

cd server_tools/local_servers/mcp-google-images-search
npm install
npm run build
cd ../../..

cd server_tools/local_servers/pdf_mcp_server-0.1.2
uv sync
cd ../../..

#cd server_tools/local_servers/seedream-image-mcp
#uv sync
#cd ../../..

cd server_tools/local_servers/servers/src/filesystem
npm install
npm run build
cd ../../..

cd server_tools/local_servers/video-audio-mcp
sudo apt-get update && sudo apt-get install -y libjpeg-dev zlib1g-dev
uv sync
cd ../../..

cd server_tools/local_servers/youtube-mcp-server
npm install
npm run build
cd ../../..

# update Node edition
bash ./scripts/env/mcp_env_install.sh

API Environment

Get necessary Google and Seedream API keys in "configs/token_key_session.py"

  1. Get Google platform API

Step 1: Google Cloud Console Setup

Step 2: Custom Search Engine Setup

  • Go to Google Custom Search Engine
  • Click "Add" to create a new search engine
  • Enter the sites you want to search (or leave blank for entire web)
  • Give your search engine a name
  • Click "Create"
  • Go to "Setup" → "Basics" and copy your "Search engine ID"

Step 3: Configure Search Engine (Optional)

  • Search the entire web: Leave "Sites to search" empty
  • Search specific sites: Add domains like github.com, stackoverflow.com
  • Advanced settings: Configure language, region, and other preferences
  1. Get Seedream API key You can register an account in https://console.volcengine.com/ and get the API key for doubao-seedream-4-0-250828

Case Construction Pipeline

The entrypoint is pipeline/distill/main.py:

uv run python pipeline/distill/main.py

High-level workflow:

  • Read MCP details from pipeline/distill/config/mcp_details_summary.json.
  • Generate MCP brief to pipeline/distill/config/mcp_brief.json.
  • Generate domain candidates per category/subcategory and write pipeline/distill/config/domain.json.
  • Generate tasks (easy/medium/hard) and save outputs under pipeline/distill/outputs/<category>/<subcategory>/.

Key files:

  • pipeline/distill/config/category.json: category/subcategory seeds for domain generation.
  • pipeline/distill/config/mcp_details_summary.json: full MCP server/tool metadata.
  • pipeline/distill/config/mcp_brief.json: compact MCP metadata used by downstream steps.
  • pipeline/distill/config/domain.json: generated domain definitions used for task generation.
  • pipeline/distill/prompt.py: prompt templates for domain generation, task generation, and trajectory generation.

Testing and Analysis

Download Dataset

Download the dataset from https://huggingface.co/datasets/AI-Safeguard/TOBenchURL or by CLI command and unzip it to the ./tasks path.

Set Custom API

Before testing any model, configure OpenAI-compatible endpoints and API keys in configs/global_configs.py:

custom_llm_key=""
custom_url=""

user_llm_key=""
user_llm_url=""
user_llm="gemini-3-pro-preview"

judge_llm_key=""
judge_llm_url=""
judge_llm="gemini-3-pro-preview"

Notes:

  • custom_llm_key + custom_url: your tested model endpoint (OpenAI-compatible URL + API key).
  • user_llm_* and judge_llm_* are also required for user/judge simulation.
  • Keep user_llm and judge_llm as gemini-3-pro-preview (default). Do not change them for this benchmark.

If your tasks require external services (for example, Google Search), configure the corresponding API keys in configs/token_key_session.py before running tests.

Task Testing

Run a single task:

bash scripts/run/run_test.sh \
  Customer_Service/Education-Pandas normal ./outputs \
  qwen3.5-plus qwen_official 100 scripts/run_config.json

run_test.sh arguments:

  • Customer_Service/Education-Pandas: task path under tasks/.
  • normal: run mode (normal or quickstart).
  • ./outputs: output root directory.
  • qwen3.5-plus: model short name.
  • qwen_official: provider name.
  • 100: max tool-call steps (single-turn limit).
  • scripts/run_config.json: evaluation/runtime config file.

Run batch tests in parallel:

bash scripts/run/run_parallel.sh \
  Customer_Service 2 normal ./outputs/qwen35 \
  qwen3.5-plus qwen_official 100 scripts/run_config.json 60

bash scripts/run/run_parallel.sh \
  Intelligent_Creation 2 normal ./outputs/qwen35 \
  qwen3.5-plus qwen_official 100 scripts/run_config.json 60

run_parallel.sh arguments:

  • Customer_Service / Intelligent_Creation: domain folder under tasks/.
  • 2: concurrency (number of tasks running at the same time, which depends on your computer performance).
  • normal: run mode (normal).
  • ./outputs/qwen35: output root for this batch.
  • qwen3.5-plus: model short name.
  • qwen_official: provider name.
  • 100: max tool-call steps per task.
  • scripts/run_config.json: evaluation/runtime config file.
  • 60: timeout (minutes) per task.

Run both commands above to cover the two major categories (Customer_Service and Intelligent_Creation).

After parallel testing, each category folder will contain an execution_report.txt file, for example:

  • outputs/sonnet46/Intelligent_Creation/execution_report.txt

The report records whether each task passed or failed (including exception cases).

If some tasks are abnormal/failed, rerun selected tasks using:

bash scripts/run/rerun.sh

Then update/check execution_report.txt again and summarize final metrics.

Summarize benchmark results:

uv run python pipeline/analysis/output_statistics.py \
  --base_dir outputs/qwen35 \
  --output outputs/qwen35_stats.txt

This summary includes pass rate, average tokens per task, average tool calls, and other aggregate metrics.

Analysis

Analyze failed tasks and error categories in three steps.

  1. Extract failed-task trajectories from execution logs:
uv run python pipeline/analysis/extract_failed_logs.py \
  --base_dir ./outputs/qwen36 \
  --output_path ./results/qwen36_analysis.jsonl

extract_failed_logs.py arguments:

  • --base_dir: model output root that contains category folders and task run.log files.
  • --output_path: output JSONL path for extracted failed-task trajectories.
  1. Run LLM-based failure analysis (sequential script):
bash pipeline/analysis/run_analysis_sequential.sh

run_analysis_sequential.sh behavior:

  • Reads <model>_analysis.jsonl from outputs/.
  • Calls pipeline/analysis/analyse_failed_logs.py for each enabled model.
  • Writes <model>_analysis_output.jsonl to outputs/.

analyse_failed_logs.py arguments:

  • --input_jsonl: extracted JSONL from step 1.
  • --output_jsonl: output JSONL containing structured error analysis.

Default analysis settings:

  • Prompt template: code/pipeline/analysis/prompt.py.
  • Analysis model: gemini-3-pro-preview (from global_configs.user_llm).
  • API endpoint/key: global_configs.user_llm_url and global_configs.user_llm_key.
  1. Summarize category/subcategory error statistics:
uv run python pipeline/analysis/error_statistics.py \
  --outputs_dir ./results \
  --models qwen36 \
  --save_json ./results/error_cat_summary.json

error_statistics.py arguments:

  • --outputs_dir: directory containing <model>_analysis_output.jsonl files.
  • --models: model prefixes to summarize.
  • --save_json: optional path to save aggregated JSON summary.

About

TOBench:Task-Oriented Omni-modal Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors