TOBench: A Task-Oriented Omni-Modal Agent Benchmark Harness for General Real-World MCP Tasks

TOBench is a benchmark for Omni-modal agent tool-calling capacity in general real-world tasks.

Testing Environment

Code Environment

Set up Python and core dependencies:

uv venv .venv --python 3.12
source .venv/bin/activate
export PYTHONPATH="$PWD:$PYTHONPATH"
uv pip install -r requirements.txt

MCP Environment

Install MCP runtime prerequisites (Node.js/npm and local setup):

bash ./scripts/env/mcp_env_install.sh

Verify MCP servers are available:

uv run python server_tools/check_installation.py

If any MCP has problems, you can fix it manually:

cd server_tools/local_servers/asr_mcp_server
uv sync
cd ../../..

# edge-tts edition should be the latest in "server_tools/local_servers/edge_tts_mcp_server/pyproject.toml"  if you encounter the 403 error  
cd server_tools/local_servers/edge_tts_mcp_server
uv sync
cd ../../..

cd server_tools/local_servers/excel-mcp-server
uv sync
cd ../../..

cd server_tools/local_servers/Google-Search-MCP
npm install
npm run build
cd ../../..

cd server_tools/local_servers/image-processing-toolkits
uv sync
cd ../../..

cd server_tools/local_servers/mcp_weather_server
uv sync
cd ../../..

cd server_tools/local_servers/mcp-google-images-search
npm install
npm run build
cd ../../..

cd server_tools/local_servers/pdf_mcp_server-0.1.2
uv sync
cd ../../..

#cd server_tools/local_servers/seedream-image-mcp
#uv sync
#cd ../../..

cd server_tools/local_servers/servers/src/filesystem
npm install
npm run build
cd ../../..

cd server_tools/local_servers/video-audio-mcp
sudo apt-get update && sudo apt-get install -y libjpeg-dev zlib1g-dev
uv sync
cd ../../..

cd server_tools/local_servers/youtube-mcp-server
npm install
npm run build
cd ../../..

# update Node edition
bash ./scripts/env/mcp_env_install.sh

API Environment

Get necessary Google and Seedream API keys in "configs/token_key_session.py"

Get Google platform API

Step 1: Google Cloud Console Setup

Go to the Google Cloud Console
Create a new project or select an existing one
Enable the "Custom Search API" in https://console.cloud.google.com/apis/library
Go to "Credentials" → "Create Credentials" → "API Key"，https://console.cloud.google.com/apis/credentials
Copy your API key

Step 2: Custom Search Engine Setup

Go to Google Custom Search Engine
Click "Add" to create a new search engine
Enter the sites you want to search (or leave blank for entire web)
Give your search engine a name
Click "Create"
Go to "Setup" → "Basics" and copy your "Search engine ID"

Step 3: Configure Search Engine (Optional)

Search the entire web: Leave "Sites to search" empty
Search specific sites: Add domains like github.com, stackoverflow.com
Advanced settings: Configure language, region, and other preferences

Get Seedream API key You can register an account in https://console.volcengine.com/ and get the API key for doubao-seedream-4-0-250828

Case Construction Pipeline

The entrypoint is pipeline/distill/main.py:

uv run python pipeline/distill/main.py

High-level workflow:

Read MCP details from pipeline/distill/config/mcp_details_summary.json.
Generate MCP brief to pipeline/distill/config/mcp_brief.json.
Generate domain candidates per category/subcategory and write pipeline/distill/config/domain.json.
Generate tasks (easy/medium/hard) and save outputs under pipeline/distill/outputs/<category>/<subcategory>/.

Key files:

pipeline/distill/config/category.json: category/subcategory seeds for domain generation.
pipeline/distill/config/mcp_details_summary.json: full MCP server/tool metadata.
pipeline/distill/config/mcp_brief.json: compact MCP metadata used by downstream steps.
pipeline/distill/config/domain.json: generated domain definitions used for task generation.
pipeline/distill/prompt.py: prompt templates for domain generation, task generation, and trajectory generation.

Testing and Analysis

Download Dataset

Download the dataset from https://huggingface.co/datasets/AI-Safeguard/TOBenchURL or by CLI command and unzip it to the ./tasks path.

Set Custom API

Before testing any model, configure OpenAI-compatible endpoints and API keys in configs/global_configs.py:

custom_llm_key=""
custom_url=""

user_llm_key=""
user_llm_url=""
user_llm="gemini-3-pro-preview"

judge_llm_key=""
judge_llm_url=""
judge_llm="gemini-3-pro-preview"

Notes:

custom_llm_key + custom_url: your tested model endpoint (OpenAI-compatible URL + API key).
user_llm_* and judge_llm_* are also required for user/judge simulation.
Keep user_llm and judge_llm as gemini-3-pro-preview (default). Do not change them for this benchmark.

If your tasks require external services (for example, Google Search), configure the corresponding API keys in configs/token_key_session.py before running tests.

Task Testing

Run a single task:

bash scripts/run/run_test.sh \
  Customer_Service/Education-Pandas normal ./outputs \
  qwen3.5-plus qwen_official 100 scripts/run_config.json

run_test.sh arguments:

Customer_Service/Education-Pandas: task path under tasks/.
normal: run mode (normal or quickstart).
./outputs: output root directory.
qwen3.5-plus: model short name.
qwen_official: provider name.
100: max tool-call steps (single-turn limit).
scripts/run_config.json: evaluation/runtime config file.

Run batch tests in parallel:

bash scripts/run/run_parallel.sh \
  Customer_Service 2 normal ./outputs/qwen35 \
  qwen3.5-plus qwen_official 100 scripts/run_config.json 60

bash scripts/run/run_parallel.sh \
  Intelligent_Creation 2 normal ./outputs/qwen35 \
  qwen3.5-plus qwen_official 100 scripts/run_config.json 60

run_parallel.sh arguments:

Customer_Service / Intelligent_Creation: domain folder under tasks/.
2: concurrency (number of tasks running at the same time, which depends on your computer performance).
normal: run mode (normal).
./outputs/qwen35: output root for this batch.
qwen3.5-plus: model short name.
qwen_official: provider name.
100: max tool-call steps per task.
scripts/run_config.json: evaluation/runtime config file.
60: timeout (minutes) per task.

Run both commands above to cover the two major categories (Customer_Service and Intelligent_Creation).

After parallel testing, each category folder will contain an execution_report.txt file, for example:

outputs/sonnet46/Intelligent_Creation/execution_report.txt

The report records whether each task passed or failed (including exception cases).

If some tasks are abnormal/failed, rerun selected tasks using:

bash scripts/run/rerun.sh

Then update/check execution_report.txt again and summarize final metrics.

Summarize benchmark results:

uv run python pipeline/analysis/output_statistics.py \
  --base_dir outputs/qwen35 \
  --output outputs/qwen35_stats.txt

This summary includes pass rate, average tokens per task, average tool calls, and other aggregate metrics.

Analysis

Analyze failed tasks and error categories in three steps.

Extract failed-task trajectories from execution logs:

uv run python pipeline/analysis/extract_failed_logs.py \
  --base_dir ./outputs/qwen36 \
  --output_path ./results/qwen36_analysis.jsonl

extract_failed_logs.py arguments:

--base_dir: model output root that contains category folders and task run.log files.
--output_path: output JSONL path for extracted failed-task trajectories.

Run LLM-based failure analysis (sequential script):

bash pipeline/analysis/run_analysis_sequential.sh

run_analysis_sequential.sh behavior:

Reads <model>_analysis.jsonl from outputs/.
Calls pipeline/analysis/analyse_failed_logs.py for each enabled model.
Writes <model>_analysis_output.jsonl to outputs/.

analyse_failed_logs.py arguments:

--input_jsonl: extracted JSONL from step 1.
--output_jsonl: output JSONL containing structured error analysis.

Default analysis settings:

Prompt template: code/pipeline/analysis/prompt.py.
Analysis model: gemini-3-pro-preview (from global_configs.user_llm).
API endpoint/key: global_configs.user_llm_url and global_configs.user_llm_key.

Summarize category/subcategory error statistics:

uv run python pipeline/analysis/error_statistics.py \
  --outputs_dir ./results \
  --models qwen36 \
  --save_json ./results/error_cat_summary.json

error_statistics.py arguments:

--outputs_dir: directory containing <model>_analysis_output.jsonl files.
--models: model prefixes to summarize.
--save_json: optional path to save aggregated JSON summary.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
configs		configs
pipeline		pipeline
scripts		scripts
server_tools		server_tools
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
main.py		main.py
package-lock.json		package-lock.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOBench: A Task-Oriented Omni-Modal Agent Benchmark Harness for General Real-World MCP Tasks

Testing Environment

Code Environment

MCP Environment

API Environment

Case Construction Pipeline

Testing and Analysis

Download Dataset

Set Custom API

Task Testing

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TOBench: A Task-Oriented Omni-Modal Agent Benchmark Harness for General Real-World MCP Tasks

Testing Environment

Code Environment

MCP Environment

API Environment

Case Construction Pipeline

Testing and Analysis

Download Dataset

Set Custom API

Task Testing

Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages