Skip to content

Doo-mon/RSS

Repository files navigation

Stable Language Guidance for Vision-Language-Action Models

ACL 2026 Main Conference

arXiv Code

Zhihao Zhan1, Yuhao Chen1, Jiaying Zhou1, Qinhan Lyu1, Hao Liu1, Keze Wang1,2,3, Liang Lin1,2,3, Guangrun Wang*1,2,3

1Sun Yat-sen University, 2Guangdong Key Laboratory of Big Data Analysis and Processing, 3X-Era AI Lab, 4Guangdong University of Technology.

*Corresponding author

✨ Abstract

Pipeline

Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose Residual Semantic Steering (RSS), a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) Monte Carlo Syntactic Integration, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) Residual Affordance Steering, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.

⚙️ Setup

uv

We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.

Run the following to set up the environment:

git clone --recurse-submodules git@github.com:Doo-mon/RSS.git

# Or if you already cloned the repo:
git submodule update --init --recursive

GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

For more details, refer to the original openpi repository.

🚀 Training / Inference / Deployment

Caption Data Preparation

Refers to /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_intern.py, /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_llava.py or /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_qwen.py

Multiple Inference Settings

Refefs to /examples/libero_shortcut/main_{*}.py

Commands

Refers to run_train.sh, run_server_eval.sh and run_local_eval.sh

Citation

If you find our work useful, please consider citing:

@article{zhan2026stable,
  title={Stable Language Guidance for Vision-Language-Action Models},
  author={Zhan, Zhihao and Chen, Yuhao and Zhou, Jiaying and Lv, Qinhan and Liu, Hao and Wang, Keze and Lin, Liang and Wang, Guangrun},
  journal={arXiv preprint arXiv:2601.04052},
  year={2026}
}

Acknowledgements

We express our sincere gratitude to the developers of openpi for open-sourcing their codebase.

License

This project is licensed under the MIT License. See LICENSE for details.

About

[ACL2026] Stable Language Guidance for Vision–Language–Action Models

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors