TOBench is a benchmark for Omni-modal agent tool-calling capacity in general real-world tasks.
Set up Python and core dependencies:
uv venv .venv --python 3.12
source .venv/bin/activate
export PYTHONPATH="$PWD:$PYTHONPATH"
uv pip install -r requirements.txtInstall MCP runtime prerequisites (Node.js/npm and local setup):
bash ./scripts/env/mcp_env_install.shVerify MCP servers are available:
uv run python server_tools/check_installation.pyIf any MCP has problems, you can fix it manually:
cd server_tools/local_servers/asr_mcp_server
uv sync
cd ../../..
# edge-tts edition should be the latest in "server_tools/local_servers/edge_tts_mcp_server/pyproject.toml" if you encounter the 403 error
cd server_tools/local_servers/edge_tts_mcp_server
uv sync
cd ../../..
cd server_tools/local_servers/excel-mcp-server
uv sync
cd ../../..
cd server_tools/local_servers/Google-Search-MCP
npm install
npm run build
cd ../../..
cd server_tools/local_servers/image-processing-toolkits
uv sync
cd ../../..
cd server_tools/local_servers/mcp_weather_server
uv sync
cd ../../..
cd server_tools/local_servers/mcp-google-images-search
npm install
npm run build
cd ../../..
cd server_tools/local_servers/pdf_mcp_server-0.1.2
uv sync
cd ../../..
#cd server_tools/local_servers/seedream-image-mcp
#uv sync
#cd ../../..
cd server_tools/local_servers/servers/src/filesystem
npm install
npm run build
cd ../../..
cd server_tools/local_servers/video-audio-mcp
sudo apt-get update && sudo apt-get install -y libjpeg-dev zlib1g-dev
uv sync
cd ../../..
cd server_tools/local_servers/youtube-mcp-server
npm install
npm run build
cd ../../..
# update Node edition
bash ./scripts/env/mcp_env_install.shGet necessary Google and Seedream API keys in "configs/token_key_session.py"
- Get Google platform API
Step 1: Google Cloud Console Setup
- Go to the Google Cloud Console
- Create a new project or select an existing one
- Enable the "Custom Search API" in https://console.cloud.google.com/apis/library
- Go to "Credentials" → "Create Credentials" → "API Key",https://console.cloud.google.com/apis/credentials
- Copy your API key
Step 2: Custom Search Engine Setup
- Go to Google Custom Search Engine
- Click "Add" to create a new search engine
- Enter the sites you want to search (or leave blank for entire web)
- Give your search engine a name
- Click "Create"
- Go to "Setup" → "Basics" and copy your "Search engine ID"
Step 3: Configure Search Engine (Optional)
- Search the entire web: Leave "Sites to search" empty
- Search specific sites: Add domains like
github.com,stackoverflow.com - Advanced settings: Configure language, region, and other preferences
- Get Seedream API key
You can register an account in
https://console.volcengine.com/and get the API key fordoubao-seedream-4-0-250828
The entrypoint is pipeline/distill/main.py:
uv run python pipeline/distill/main.pyHigh-level workflow:
- Read MCP details from
pipeline/distill/config/mcp_details_summary.json. - Generate MCP brief to
pipeline/distill/config/mcp_brief.json. - Generate domain candidates per category/subcategory and write
pipeline/distill/config/domain.json. - Generate tasks (easy/medium/hard) and save outputs under
pipeline/distill/outputs/<category>/<subcategory>/.
Key files:
pipeline/distill/config/category.json: category/subcategory seeds for domain generation.pipeline/distill/config/mcp_details_summary.json: full MCP server/tool metadata.pipeline/distill/config/mcp_brief.json: compact MCP metadata used by downstream steps.pipeline/distill/config/domain.json: generated domain definitions used for task generation.pipeline/distill/prompt.py: prompt templates for domain generation, task generation, and trajectory generation.
Download the dataset from
https://huggingface.co/datasets/AI-Safeguard/TOBenchURL
or by CLI command and unzip it to the ./tasks path.
Before testing any model, configure OpenAI-compatible endpoints and API keys in configs/global_configs.py:
custom_llm_key=""
custom_url=""
user_llm_key=""
user_llm_url=""
user_llm="gemini-3-pro-preview"
judge_llm_key=""
judge_llm_url=""
judge_llm="gemini-3-pro-preview"Notes:
custom_llm_key+custom_url: your tested model endpoint (OpenAI-compatible URL + API key).user_llm_*andjudge_llm_*are also required for user/judge simulation.- Keep
user_llmandjudge_llmasgemini-3-pro-preview(default). Do not change them for this benchmark.
If your tasks require external services (for example, Google Search), configure the corresponding API keys in configs/token_key_session.py before running tests.
Run a single task:
bash scripts/run/run_test.sh \
Customer_Service/Education-Pandas normal ./outputs \
qwen3.5-plus qwen_official 100 scripts/run_config.jsonrun_test.sh arguments:
Customer_Service/Education-Pandas: task path undertasks/.normal: run mode (normalorquickstart)../outputs: output root directory.qwen3.5-plus: model short name.qwen_official: provider name.100: max tool-call steps (single-turn limit).scripts/run_config.json: evaluation/runtime config file.
Run batch tests in parallel:
bash scripts/run/run_parallel.sh \
Customer_Service 2 normal ./outputs/qwen35 \
qwen3.5-plus qwen_official 100 scripts/run_config.json 60
bash scripts/run/run_parallel.sh \
Intelligent_Creation 2 normal ./outputs/qwen35 \
qwen3.5-plus qwen_official 100 scripts/run_config.json 60run_parallel.sh arguments:
Customer_Service/Intelligent_Creation: domain folder undertasks/.2: concurrency (number of tasks running at the same time, which depends on your computer performance).normal: run mode (normal)../outputs/qwen35: output root for this batch.qwen3.5-plus: model short name.qwen_official: provider name.100: max tool-call steps per task.scripts/run_config.json: evaluation/runtime config file.60: timeout (minutes) per task.
Run both commands above to cover the two major categories (Customer_Service and Intelligent_Creation).
After parallel testing, each category folder will contain an execution_report.txt file, for example:
outputs/sonnet46/Intelligent_Creation/execution_report.txt
The report records whether each task passed or failed (including exception cases).
If some tasks are abnormal/failed, rerun selected tasks using:
bash scripts/run/rerun.shThen update/check execution_report.txt again and summarize final metrics.
Summarize benchmark results:
uv run python pipeline/analysis/output_statistics.py \
--base_dir outputs/qwen35 \
--output outputs/qwen35_stats.txtThis summary includes pass rate, average tokens per task, average tool calls, and other aggregate metrics.
Analyze failed tasks and error categories in three steps.
- Extract failed-task trajectories from execution logs:
uv run python pipeline/analysis/extract_failed_logs.py \
--base_dir ./outputs/qwen36 \
--output_path ./results/qwen36_analysis.jsonlextract_failed_logs.py arguments:
--base_dir: model output root that contains category folders and taskrun.logfiles.--output_path: output JSONL path for extracted failed-task trajectories.
- Run LLM-based failure analysis (sequential script):
bash pipeline/analysis/run_analysis_sequential.shrun_analysis_sequential.sh behavior:
- Reads
<model>_analysis.jsonlfromoutputs/. - Calls
pipeline/analysis/analyse_failed_logs.pyfor each enabled model. - Writes
<model>_analysis_output.jsonltooutputs/.
analyse_failed_logs.py arguments:
--input_jsonl: extracted JSONL from step 1.--output_jsonl: output JSONL containing structured error analysis.
Default analysis settings:
- Prompt template:
code/pipeline/analysis/prompt.py. - Analysis model:
gemini-3-pro-preview(fromglobal_configs.user_llm). - API endpoint/key:
global_configs.user_llm_urlandglobal_configs.user_llm_key.
- Summarize category/subcategory error statistics:
uv run python pipeline/analysis/error_statistics.py \
--outputs_dir ./results \
--models qwen36 \
--save_json ./results/error_cat_summary.jsonerror_statistics.py arguments:
--outputs_dir: directory containing<model>_analysis_output.jsonlfiles.--models: model prefixes to summarize.--save_json: optional path to save aggregated JSON summary.