UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Languages / 语言: English | 简体中文

🚀Overview

Tool-use capability enables large language model (LLM) agents to interact with external systems through structured tool calls. However, existing work often uses inconsistent interaction representations, pays too little attention to the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. UniToolCall is a unified framework for tool learning that standardizes the full pipeline—from toolset construction and dataset generation to evaluation.

The framework curates a tool pool of 22k+ tools and a hybrid training corpus of 390k+ instances, combining 10 standardized public datasets with structurally controlled synthetic trajectories covering single-hop, multi-hop, single-turn, and multi-turn interactions, explicitly modeling serial and parallel execution, and introducing an Anchor Linkage mechanism for multi-turn interactions to enforce cross-turn dependencies. We further convert 7 public benchmarks into a unified Query-Action-Observation-Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels.

🧠Performance

Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, UniToolCall achieves 93.0% single-turn Strict Precision, outperforming Qwen3-32B by 20.3 points. The figure below summarizes evaluation results.

✨Dataset

Training data has two parts: (1) public-converted data, and (2) pipeline-generated data.

The full public-converted release is on Hugging Face:

huggingface.co/datasets/EIT-NLP/UniToolCall

Pipeline-generated datasets in this repository:

multi-hop_pipeline/data/, multi-turn_pipeline/data/, and single-hop_pipeline/data/

🔧Toolset

Tool list used to build training data:

tool_set/apis/toolset.json

🌍Requirements

Python >=3.9 (see pyproject.toml).
Third-party packages are listed in requirements.txt and mirrored under [project] / dependencies in pyproject.toml.

🧪Installation

cd UniToolCall
pip install -r requirements.txt
pip install -e .

🌟Inference and evaluation

Scripts live under test_set/scripts/metrics/. Configure API keys via environment variables first.

Inference

From the repository root:

cd test_set/scripts/metrics
python generate_with_qwen_server_list.py

Typical example:

# api1 = SiliconFlow; api2 = OpenAI; api3 = Anthropic; api4 = Gemini; server/sft = local vLLM
python generate_with_qwen_server_list.py --mode api1 \
  --inputfile /path/to/benchmark_json_dir \
  --outputfile /path/to/predictions_dir

Evaluation

cd test_set/scripts/metrics
python all_evaluation.py

Batch evaluation:

python all_evaluation.py \
  --inputfile /path/to/predictions_dir \
  --outputfile /path/to/eval_results_dir \
  --gtfile /path/to/gt_json_dir

Dataset-construction pipelines live under multi-hop_pipeline/scripts/, multi-turn_pipeline/scripts/, and single-hop_pipeline/scripts/; run from that directory, e.g. cd multi-hop_pipeline/scripts && python generate_via_api.py.

🎯Repository layout

Path	Description
`multi-hop_pipeline/`	Multi-hop trajectory generation, QC, augment, standardization
`multi-turn_pipeline/`	Multi-turn generation
`single-hop_pipeline/`	Single-hop data utilities
`test_set/`	Benchmarks
`tool_set/`	Tool corpus
`train_set/`	Training-data preparation
`src/uni_toolcall/`	Shared package (`paths`, `prompts`, `secrets`)

Pipeline folders (*_pipeline/) usually contain prompts/ and outputs/. multi-hop_pipeline/ may also include db/ (e.g. usage_stats.json for sampling).

Under train_set/scripts/ and test_set/scripts/, subfolders such as convert/, analysis/, metrics/, and toollist/ group scripts by role.

👉License

See the LICENSE file in the repository root (Apache License 2.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

🚀Overview

🧠Performance

✨Dataset

🔧Toolset

🌍Requirements

🧪Installation

🌟Inference and evaluation

Inference

Evaluation

🎯Repository layout

👉License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
multi-hop_pipeline		multi-hop_pipeline
multi-turn_pipeline		multi-turn_pipeline
single-hop_pipeline		single-hop_pipeline
src/uni_toolcall		src/uni_toolcall
test_set		test_set
tool_set		tool_set
train_set		train_set
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

🚀Overview

🧠Performance

✨Dataset

🔧Toolset

🌍Requirements

🧪Installation

🌟Inference and evaluation

Inference

Evaluation

🎯Repository layout

👉License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages