ACL 2026 Main Conference
Zhihao Zhan1, Yuhao Chen1, Jiaying Zhou1, Qinhan Lyu1, Hao Liu1, Keze Wang1,2,3, Liang Lin1,2,3, Guangrun Wang*1,2,3
1Sun Yat-sen University, 2Guangdong Key Laboratory of Big Data Analysis and Processing, 3X-Era AI Lab, 4Guangdong University of Technology.
*Corresponding author
Vision-Language-Action (VLA) models have demonstrated impressive capabilities in generalized robotic control; however, they remain notoriously brittle to linguistic perturbations. We identify a critical ``modality collapse'' phenomenon where strong visual priors overwhelm sparse linguistic signals, causing agents to overfit to specific instruction phrasings while ignoring the underlying semantic intent. To address this, we propose Residual Semantic Steering (RSS), a probabilistic framework that disentangles physical affordance from semantic execution. RSS introduces two theoretical innovations: (1) Monte Carlo Syntactic Integration, which approximates the true semantic posterior via dense, LLM-driven distributional expansion, and (2) Residual Affordance Steering, a dual-stream decoding mechanism that explicitly isolates the causal influence of language by subtracting the visual affordance prior. Theoretical analysis suggests that RSS effectively maximizes the mutual information between action and intent while suppressing visual distractors. Empirical results across diverse manipulation benchmarks demonstrate that RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
We manage Python dependencies with uv. If you haven't installed uv, please follow uv installation instructions to set it up.
Run the following to set up the environment:
git clone --recurse-submodules git@github.com:Doo-mon/RSS.git
# Or if you already cloned the repo:
git submodule update --init --recursive
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .For more details, refer to the original openpi repository.
Refers to /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_intern.py, /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_llava.py or /examples/libero_shortcut/convert_libero_caption_data_to_lerobot_qwen.py
Refefs to /examples/libero_shortcut/main_{*}.py
Refers to run_train.sh, run_server_eval.sh and run_local_eval.sh
If you find our work useful, please consider citing:
@article{zhan2026stable,
title={Stable Language Guidance for Vision-Language-Action Models},
author={Zhan, Zhihao and Chen, Yuhao and Zhou, Jiaying and Lv, Qinhan and Liu, Hao and Wang, Keze and Lin, Liang and Wang, Guangrun},
journal={arXiv preprint arXiv:2601.04052},
year={2026}
}We express our sincere gratitude to the developers of openpi for open-sourcing their codebase.
This project is licensed under the MIT License. See LICENSE for details.
