ClawGym is a scalable framework for synthesizing data, training agents, and evaluating Claw-style personal agents in local, stateful workspaces. π¨βπ»
β οΈ Code, datasets, benchmark assets, and model checkpoints are currently under internal company review and will be released soon. π
- [2026.05.01] π We release Claw Series Part I: ClawGym, our first work on scalable data synthesis, training, and evaluation for Claw-style agents.
ClawGym is a scalable framework for building, training, and evaluating Claw-style personal agents across realistic local workspace environments.
ClawGym supports the full lifecycle of personal-agent development. It first constructs ClawGym-SynData, a diverse dataset of 13.5K executable tasks synthesized from persona-driven intents and skill-grounded operations. Each task is paired with a realistic mock workspace and hybrid verification mechanisms, enabling reproducible execution and automated evaluation.
Based on these synthesized tasks, we collect interaction trajectories through black-box rollouts and use them to train a family of capable Claw-style models, termed ClawGym-Agents. We further explore reinforcement learning (RL) through a lightweight sandbox-parallel pipeline, supports both Docker-based and Docker-free backends, and learns directly from outcome rewards.
To support reliable evaluation, we build ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review.
ClawGym-SynData contains 13.5K executable Claw-style tasks. It combines two synthesis routes:
- Persona-driven synthesis: samples user profiles, scenario categories, and atomic operations to generate realistic workspace-grounded requests.
- Skill-grounded synthesis: builds tasks from OpenClaw skills, using one primary skill with optional supporting skills to encourage multi-step workflows.
The task generation process covers 9 scenario categories, 43 subcategories, 7 operation categories, and 26 atomic operations. For skill-grounded synthesis, we annotate 16,837 collected skills across categories such as Data & APIs, Dev Tools, Workflows, Automation, Security, Prompts, MCP Tools, and others.
Each task is paired with lightweight mock resources and task-specific verifiers. Human-sampled quality analysis over 50 training tasks gives an overall score of 4.06 / 5, indicating good task coherence, resource consistency, and verifier quality.
ClawGym-Agents are trained from black-box OpenClaw rollouts on ClawGym-SynData. We collect 24.5K interaction trajectories using teacher rollouts from MiniMax-M2.5 and GLM-5.1, then filter trajectories by verifier scores.
The selected trajectories are long-horizon and tool-intensive:
| Avg. Rounds | Avg. Tokens | Avg. Tool Calls | Avg. Tool Types |
|---|---|---|---|
| 13.00 | 18.67K | 15.82 | 3.25 |
We perform multi-turn SFT on Qwen3-series backbones and obtain ClawGym-4B, ClawGym-8B, and ClawGym-30B-A3B. We also explore reinforcement learning (RL) through a lightweight sandbox-parallel pipeline.
ClawGym-Bench is a diagnostic benchmark of 200 instances for Claw-style agents. Each task contains a user instruction, mock workspace resources, and a task-specific verifier.
- 156 tasks use code-based verification.
- 44 tasks use hybrid verification, combining code checks with rubric-based judgment.
- Hybrid scoring uses 0.7 weight for code-based verification and 0.3 weight for rubric-based verification.
The benchmark is selected through difficulty-aware filtering and human-LLM review. It covers six workspace-grounded categories:
| Category | Product. & Collab. |
Systems & Auto. |
Analysis & Reason. |
Content & Domain |
Planning & Knowl. |
Software Dev. |
|---|---|---|---|---|---|---|
| # Tasks | 44 | 42 | 35 | 28 | 26 | 25 |
We evaluate ClawGym-Agents on ClawGym-Bench and PinchBench. The main results show that training on ClawGym-SynData consistently improves compact open-weight backbones.
Our implementation builds upon the excellent codebases of slime, OpenClaw, OpenClaw-RL, PinchBench, OpenRLHF and Megatron-LM.
We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.
@misc{bai2026clawgymscalableframeworkbuilding,
title={ClawGym: A Scalable Framework for Building Effective Claw Agents},
author={Fei Bai and Huatong Song and Shuang Sun and Daixuan Cheng and Yike Yang and Chuan Hao and Renyuan Li and Feng Chang and Yuan Wei and Ran Tao and Bryan Dai and Jian Yang and Wayne Xin Zhao},
year={2026},
eprint={2604.26904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.26904},
}For any questions or feedback, please reach out to us at feibai@ruc.edu.cn, songhuatong123@ruc.edu.cn, sunshuang@ruc.edu.cn.

