Rui Shao1,3,†, Ruize Gao2,†, Bin Xie1, Yixing Li1, Kaiwen Zhou4, Shuai Wang4, Weili Guan1,3, Gongwei Chen1,*
1 Harbin Institute of Technology, Shenzhen
2 National University of Singapore, CNRS@CREATE
3 Shenzhen Loop Area Institute
4 Huawei Noah’s Ark Lab
† Equal contribution * Corresponding author
Overview of trajectory synthesis paradigms. Compared with (a) existing methods, (b) HATS integrates hardness-driven exploration and alignment-guided refinement in a closed loop, producing high-quality trajectories with rich semantic coverage and strong instruction--execution alignment. (c) Experiments show HATS outperforms OS-Genesis by 100%↑ on AndroidWorld (22.60 vs. 11.30) and 215%↑ on WebArena (20.60 vs. 6.53).
Current GUI trajectory synthesis pipelines struggle with semantic-ambiguous actions—interactions whose functional meaning depends on contextual, sequential, or visual cues. These actions are:
- Under-represented: Over 70% of collected traces collapse into trivial actions like "open menu" or "tap back"
- Poorly processed: When captured, they often lead to instruction-execution misalignment, introducing noisy supervision
Examples of semantic-ambiguous actions include:
- (a) Identical icons triggering different functions depending on context
- (b) Operations requiring prerequisite steps to succeed
- (c) Visually similar elements leading to distinct outcomes
HATS consists of two cooperative modules unified through Hardness-Driven Monte Carlo Tree Search (HD-MCTS):
Problem with uniform exploration: Random walks oversample trivial actions and miss semantically challenging interactions.
Our solution: Replace random exploration with a hardness-aware policy that:
- Uses UCB-based selection to balance exploration and exploitation
- Prioritizes under-represented, semantically complex UI states
- Concentrates search effort on high-value, ambiguous actions
Problem with one-shot synthesis: Direct instruction generation produces vague descriptions that fail to replay consistently.
Our solution: Multi-round refinement process that:
- Synthesizes initial instruction from exploration trace
- Replays instruction to verify execution consistency
- Measures alignment using action-level reconstruction recall
- Refines instruction by injecting missing contextual cues
- Iterates until semantic alignment is achieved (R ≥ 0.7)
Only verified trajectories passing alignment checks are admitted to the training corpus.
The two modules form a feedback cycle:
- Exploration → Refinement: Hardness-driven search supplies challenging trajectories for validation
- Refinement → Exploration: Misalignment signals are converted into hardness rewards that guide future exploration
This closed loop progressively enhances both diversity (coverage of semantic-ambiguous actions) and fidelity (instruction-execution alignment) of synthesized data.
If you find HATS useful for your research, please cite our paper:
@inproceedings{shao2026hats,
title={HATS: Hardness-Aware Trajectory Synthesis for GUI Agents},
author={Shao, Rui and Gao, Ruize and Xie, Bin and Li, Yixing and Zhou, Kaiwen and Wang, Shuai and Guan, Weili and Chen, Gongwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}





