Skip to content

JiuTian-VL/HATS

Repository files navigation

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Rui Shao1,3,†, Ruize Gao2,†, Bin Xie1, Yixing Li1, Kaiwen Zhou4, Shuai Wang4, Weili Guan1,3, Gongwei Chen1,*

1 Harbin Institute of Technology, Shenzhen    2 National University of Singapore, CNRS@CREATE
3 Shenzhen Loop Area Institute    4 Huawei Noah’s Ark Lab

Equal contribution   * Corresponding author

Paper Data Model Project Page

💡 Brief View

Overview of trajectory synthesis paradigms. Compared with (a) existing methods, (b) HATS integrates hardness-driven exploration and alignment-guided refinement in a closed loop, producing high-quality trajectories with rich semantic coverage and strong instruction--execution alignment. (c) Experiments show HATS outperforms OS-Genesis by 100%↑ on AndroidWorld (22.60 vs. 11.30) and 215%↑ on WebArena (20.60 vs. 6.53).


🔍 The Problem: Semantic-Ambiguous Actions

Current GUI trajectory synthesis pipelines struggle with semantic-ambiguous actions—interactions whose functional meaning depends on contextual, sequential, or visual cues. These actions are:

  • Under-represented: Over 70% of collected traces collapse into trivial actions like "open menu" or "tap back"
  • Poorly processed: When captured, they often lead to instruction-execution misalignment, introducing noisy supervision

Examples of semantic-ambiguous actions include:

  • (a) Identical icons triggering different functions depending on context
  • (b) Operations requiring prerequisite steps to succeed
  • (c) Visually similar elements leading to distinct outcomes

🏗️ HATS Framework

HATS consists of two cooperative modules unified through Hardness-Driven Monte Carlo Tree Search (HD-MCTS):

1️⃣ Hardness-Driven Exploration Module

Problem with uniform exploration: Random walks oversample trivial actions and miss semantically challenging interactions.

Our solution: Replace random exploration with a hardness-aware policy that:

  • Uses UCB-based selection to balance exploration and exploitation
  • Prioritizes under-represented, semantically complex UI states
  • Concentrates search effort on high-value, ambiguous actions

2️⃣ Alignment-Guided Refinement Module

Problem with one-shot synthesis: Direct instruction generation produces vague descriptions that fail to replay consistently.

Our solution: Multi-round refinement process that:

  • Synthesizes initial instruction from exploration trace
  • Replays instruction to verify execution consistency
  • Measures alignment using action-level reconstruction recall
  • Refines instruction by injecting missing contextual cues
  • Iterates until semantic alignment is achieved (R ≥ 0.7)

Only verified trajectories passing alignment checks are admitted to the training corpus.


🔄 Closed-Loop Integration

The two modules form a feedback cycle:

  1. Exploration → Refinement: Hardness-driven search supplies challenging trajectories for validation
  2. Refinement → Exploration: Misalignment signals are converted into hardness rewards that guide future exploration

This closed loop progressively enhances both diversity (coverage of semantic-ambiguous actions) and fidelity (instruction-execution alignment) of synthesized data.


📊 Main Experimental Results

Main Results on AndroidWorld

Main Results on WebArena


🎓 Citation

If you find HATS useful for your research, please cite our paper:

@inproceedings{shao2026hats,
  title={HATS: Hardness-Aware Trajectory Synthesis for GUI Agents},
  author={Shao, Rui and Gao, Ruize and Xie, Bin and Li, Yixing and Zhou, Kaiwen and Wang, Shuai and Guan, Weili and Chen, Gongwei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

About

[CVPR 2026] HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages