Skip to content

Astricaelus/VLAViewDependence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

View-guided Hierarchical VLA for Precise Tube PnP Task

Abstract

Vision-Language-Action (VLA) models have demonstrated remarkable potential in general-purpose robotic manipulation. However, their application to high-precision tasks remains constrained by limited input resolution and the inability to resolve visually ambiguous objects, such as semi-transparent test tubes. In this paper, we propose a novel View-guided Hierarchical VLA framework designed to bridge the gap between semantic understanding and high-precision execution in tube pick-and-place (PnP) tasks. Unlike existing hierarchical methods that focus primarily on long-horizon planning, our approach leverages a view-guided visual attention mechanism to enhance spatial precision. Specifically, we employ a high-level vision analyzer comprising a Vision-Language Model (VLM) for stage identification and a mixture of lightweight YOLO experts for localized focus. These experts, trained with data collected via a semi-supervised pipeline using SAM3, generate dynamic bounding boxes that direct the low-level policy (based on Pi0.5) to zoom in on critical interaction regions. By explicitly providing view-guided visual focus, we mitigate the resolution bottleneck and amplify feature representation for challenging transparent objects. Experimental results demonstrate that our framework significantly outperforms baseline VLAs in success rate for accurate tube manipulation, validating the effectiveness of integrating view guidance into hierarchical control.

Data Annotation

Data annotation is performed interactively using sam_ui under sam3-gradio for semi-supervised data collection and YOLO network training.

Dataset Preparation

The scripts/transform scripts are used to convert raw data into the lerobot data format with extracted focus regions, which can be directly used for OpenPI training.

Inference Configuration

  • dobot_tube_view_auto: Uses manually written stage division inference based on YOLO detection.
  • dobot_tube_view_staged: Uses VLM + prompt for stage division.
  • dobot_tube_view: Does not use stage division.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors