- [November 13, 2025]: 📄 Our paper is now available on arXiv and Hugging Face.
- [November 13, 2025]: 🚀 Our codebase released. You can use LoopTool to construct specific dialogues for your own toolset, andn finetune language model optimized for particular tools using RL algorithms. You can further refine performance by iteratively updating the training data and the model training process.
Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model's specific weaknesses and allows noisy labels to persist, degrading training efficiency.
We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs.
Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
We compare LoopTool-8B and LoopTool-32B with various representation models in BFCL-v3 and ACEBench. We adopt the official evaluation script and report the average accuracy across categories. On both BFCL-v3 and ACEBench leaderboards, LoopTool-8B achieves SOTA performance among all 8B-scale open-source models and exceeds several larger counterparts. LoopTool‑32B achieves the top position in BFCL‑v3, demonstrates the best performance among open‑source models in ACEBench, and ranks second overall, immediately following GPT‑4o.
Key Features:
-
Seed Data Generation: We support a customizable toolset, utilizing a multi‑agent simulation framework to generate dialogue flows centered on a specific toolset. The overall architecture comprises the Planner, User, Assistant, and Tool.
-
End-to-End RL Training with Verl: We support the transformation of specific tool‑invocation dialogues into samples adapted for model reinforcement learning training. By integrating the Verl library, we provide supervision over the model’s tool‑invocation steps.
-
Iterative Data and Model Evolution: LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model's mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures.
# Create conda environment
conda create -n looptool python=3.10
conda activate looptool
# Install requirements
cd LoopTool
pip install -r requirements.txtWe recommend following the official Verl guidance to correctly install the verl library.
The scripts of seed data generation are organized in dialog_generation folder (cd dialog_generation).
- The
toolsfolder contains the complete set of tools. You may extendively incorporate custom tool sets. We provide 1.2w tool set descriptions scraped from ToolBench. - Run the data generation:
python run.py --func data_crawl --raw_data_path ./dialogs/toolbench --tools_path tools/toolbench --used_models Qwen3-32B --is_en --use_plan --use_cot --thread_num 10 --crawl_num 1000- Transform the dialogs into conversations: Please configure the file path properly.
# Please configure the file path properly.
python trans2conversation.pyThe scripts of GRPO training are organized in grpotool folder.
- Transform the conversation into GRPO training samples
cd grpotool/dataset
# Please configure the file path properly.
python utils/conversation_transform_grpo_qwen.py
# (Option) filter out the training sample
python utils/filter_grpo_sample.py
# Please configure the file path properly.
python utils/json2parquet.py- GRPO Training for Robust Tool call
# Qwen3-8B (8 H800)
bash train_grpo_qwen.sh
# Qwen3-32B (32 H800)
bash multinode_qwen32b.shThe scripts of data iteration are organized in dataloop folder. Please configure the file path properly according to your dataset.
-
Greedy Capability Probing (GCP) queries the fine-tuned model on the training corpus using greedy decoding, revealing mastered, borderline, and failure cases.
-
Judgement-Guided Label Verification (JGLV) employs Qwen3-32B to compare each prediction against its reference label.
-
Error-Driven Data Expansion (EDDE) transforms verified failure cases into new, structurally similar but contextually diverse challenging samples.
cd dataloop
# Greedy Capability Probing
python greedy_capability_prob.py
# Judgement-Guided Label Verification (JGLV)
python judgement_label_verification.py
# Error-Driven Data Expansion (EDDE)
python error_data_expansion.py
# (Option) filter the new-generated samples
python filter_grpo_sample.pyUpon obtaining a new round of data, we load the checkpoint from the previous round’s model and employ the new data to conduct GRPO reinforcement learning training.
We recommend following the official BFCL guidelines to configure the evaluation environment, and employing the handler located in bfcl/qwentool.py as the Handler during our model evaluation.
In the ACEBench evaluation, we instruct the model to produce tool calls of the <tool_call> </tool_call> type in order to align with the training format of the model.
We sincerely appreciate the contributions of the open-source community:
If you find this work helpful, please consider to cite our paper:
@misc{zhang2025looptool,
title={LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls},
author={Kangning Zhang and Wenxiang Jiao and Kounianhua Du and Yuan Lu and Weiwen Liu and Weinan Zhang and Lei Zhang and Yong Yu},
year={2025},
eprint={2511.09148},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.09148},
}This project is released under the MIT License.
For any questions or feedback, please reach out to us at zhangkangning@sjtu.edu.cn.



