View-guided Hierarchical VLA for Precise Tube PnP Task

Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable potential in general-purpose robotic manipulation. However, their application to high-precision tasks remains constrained by limited input resolution and the inability to resolve visually ambiguous objects, such as semi-transparent test tubes. In this paper, we propose a novel View-guided Hierarchical VLA framework designed to bridge the gap between semantic understanding and high-precision execution in tube pick-and-place (PnP) tasks. Unlike existing hierarchical methods that focus primarily on long-horizon planning, our approach leverages a view-guided visual attention mechanism to enhance spatial precision. Specifically, we employ a high-level vision analyzer comprising a Vision-Language Model (VLM) for stage identification and a mixture of lightweight YOLO experts for localized focus. These experts, trained with data collected via a semi-supervised pipeline using SAM3, generate dynamic bounding boxes that direct the low-level policy (based on Pi0.5) to zoom in on critical interaction regions. By explicitly providing view-guided visual focus, we mitigate the resolution bottleneck and amplify feature representation for challenging transparent objects. Experimental results demonstrate that our framework significantly outperforms baseline VLAs in success rate for accurate tube manipulation, validating the effectiveness of integrating view guidance into hierarchical control.

Data Annotation

Data annotation is performed interactively using sam_ui under sam3-gradio for semi-supervised data collection and YOLO network training.

Dataset Preparation

The scripts/transform scripts are used to convert raw data into the lerobot data format with extracted focus regions, which can be directly used for OpenPI training.

Inference Configuration

dobot_tube_view_auto: Uses manually written stage division inference based on YOLO detection.
dobot_tube_view_staged: Uses VLM + prompt for stage division.
dobot_tube_view: Does not use stage division.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

View-guided Hierarchical VLA for Precise Tube PnP Task

Abstract

Data Annotation

Dataset Preparation

Inference Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

View-guided Hierarchical VLA for Precise Tube PnP Task

Abstract

Data Annotation

Dataset Preparation

Inference Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages