Beyond Action Residuals: Steering Robot Manipulation Policies with Bottleneck Latent Reinforcement Learning (ZPRL)
Dongjie Yu*,1,2, Kun Lei*,2,3, Zhennan Jiang4, Jia Pan#,1, Huazhe Xu#,2,5
* Equal contribution # Corresponding authors
1 School of Computing and Data Science, HKU; 2 Shanghai Qi Zhi Institute; 3 Shanghai Jiao Tong University; 4 Institute of Automation, CAS; 5 IIIS, THU
TL;DR: ZPRL is an RL finetuning framework that perturbs bottleneck latents to steer robot manipulation policies, achieving efficient steering and smooth robot actions.
- Clone the repository.
git clone
cd ZPRL- Create a virtual environment and install the required dependencies. We used mamba for fast env management, but you can also use conda.
mamba env create -f ./conda_environment.yaml
mamba activate zprl
mamba install -c conda-forge mesalib glew glfw # necessary for GPU renderingThen add export MUJOCO_GL=egl to your ~/.bashrc. If osmesa/glfw/glew issues still exist during training, check this page for possible solutions.
- Check your
~/.bashrcand locateMAMBA_ROOT_PREFIXorCONDA_PREFIX. Then add either
export CPATH=$MAMBA_ROOT_PREFIX/includeor
export CPATH=$CONDA_PREFIX/includecorrespondingly to your ~/.bashrc if there is no such command.
- Install
robosuiteandrobomimic. We use specific versions for both to leverage Python bindings ofmujocoso we do not have to installmujoco-pyand download more files.
mamba activate zprl
cd /your/path/to/dependencies/
git clone https://github.com/ARISE-Initiative/robosuite.git
cd robosuite
git checkout v1.4.1 # version matters to reproduce the results
pip install -e .
cd /your/path/to/dependencies/
git clone https://github.com/ARISE-Initiative/robomimic.git
cd robomimic
git checkout 9273f9cc # commit matters to reproduce the results
pip install -e .
# some pre-setup to reduce warnings
python /your/path/to/dependencies/robomimic/robomimic/scripts/setup_macros.py- We made a small patch to
robosuiteto correct its GPU rendering on the specified device and enable faster parallel simulation. Replace/your/path/to/dependencies/robosuite/robosuite/renderers/context/egl_context.pywith this and replace/your/path/to/dependencies/robosuite/robosuite/utils/binding_utils.pywith this. You may need to change theconversion_mapinegl_context.pybecause the map varies on different computers. (Credit to Baiye Cheng)
We have uploaded the datasets for training robomimic tasks (can, square, transport) here. After downloading the whole directory, you can either put it under ./data_local/ or create a soft link to it by ln -s /your/path/to/robomimicv030 ./data_local/. The data root directory is like:
robomimicv030
├── can
│ └── mh
│ └── image_v141_subset_abs.hdf5
├── square
│ └── mh
│ └── image_v141_subset_abs.hdf5
└── transport
└── mh
└── image_v141_subset_abs.hdf5Each .hdf5 randomly samples 100 trajectories from the original Robomimic MH dataset, renders the image observation following scripts here, and turns the delta action into an absolute action with this. But downloading the dataset we upload can save you all of these steps.
We found that the version of robomimic (and robosuite) for generating datasets should match the version for training policies. Therefore, remember to
git checkoutand do not mix environments.
- Activate the environment and login to W&B to track experiments if you have never done before.
mamba activate zprl
wandb login- Launch training on cuda:0 with seed 0.
export MUJOCO_EGL_DEVICE_ID=0 # make robomimic envs run on cuda:0
python train.py \
--config-name=train_flow_match_vib_unet_image_workspace \
training.seed=0 \
task.dataset.seed=0 \
exp_name=unet_vib_default \
training.device=cuda:0 \
task=square_image_abs \
policy.vib_latent_dim=16 \
policy.vib_beta=0.0002 \
policy.vib_recon=0.01We only change
vib_latent_dim,vib_betaduring offline training. Specifically, we set(16, 0.0002)for can and square and(32, 0.0001)for transport. You can also try other values to see how they affect the performance.
After the training finishes, you will get a directory containing the results like this.
data/outputs/yyyy.mm.dd/hh.mm.ss_train_flow_match_vib_unet_image_square_image
├── .hydra
│ ├── config.yaml
│ ├── hydra.yaml
│ └── overrides.yaml
├── checkpoints
│ ├── epoch_0000-score_0.000.ckpt
│ └── latest.ckpt
├── logs.json.txt
├── media
│ ├── ...
│ └── train_1.mp4
└── train.logYou can evaluate the checkpoint again with specified action chunk length (here we use 4) by
python eval_base.py \
-c path/to/offline/checkpoints/latest.ckpt\
-o data/eval/square/ \
-t 4 \
-d cuda:0Then you will get a directory containing the results on 100 rollouts. You can edit eval_base.py to change configurations of evaluation.
data/eval/square
├── eval_log.json
└── media
├── test_100000.mp4
├── ...
└── test_100009.mp4The offline stack generally follows Diffusion Policy and the authors have built an incredible code base and tutorial. Remember to check it out if you are interested.
After the offline training, you can try ZPRL starting from a given base policy with
python train.py \
--config-name=train_online_vib_robomimic_workspace \
training.seed=0 \
exp_name=zprl_default \
training.device=cuda:0 \
online_task=square_image_abs \
online_task.base_ckpt=path/to/offline/checkpoints/latest.ckptNote that there will be 45 parallel environments running (20 for training and 25 for evaluation) and it will not cause OOM on RTX 4090. The structure of the resulting RL training directory under data/outputs/ is similar as that of offline IL.
An important hyperparameter in ZPRL is the scale of z-perturbation (
$\lambda$ in our paper). We recommend starting from let$\mathrm{RMS}(\lambda\Delta z) \approx 0.1 \mathrm{RMS}(z)$ . We track these values during online RL for tuning reference.
When you summarize the RL results, remember to multiply the steps in w&b by
n_action_steps(the length of action chunk) such that the actual environment steps are counted.
You can evaluate the composed policy after online finetuning with
python eval_sum.py \
-c path/to/checkpoints/step_600000.ckpt \
-o path/to/eval_logs/ \
-d cuda:0Our code base is built on the following repositories and the structure of this README borrows a lot from DICE-RL. We thank the authors for open-sourcing their wonderful codes and clear documentation.
- Diffusion Policy: our offline pipeline generally follows the Diffusion Policy workspace but replaces DDPM/DDIM with a rectified flow to reduce denoising steps during inference.
- Policy Decorator: our online workspace basically follows what policy decorator does. We make some optimization (such as next observation pre-encoding) to accelerate training.
- SOE: We borrow the idea of introducing VIB module into imitation policies to realize in-manifold exploration from SOE.
If you find this repository useful, please consider citing our paper:
@misc{yu2026zprl,
title={Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning},
author={Dongjie Yu and Kun Lei and Zhennan Jiang and Jia Pan and Huazhe Xu},
year={2026},
eprint={2605.19919},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2605.19919},
}Feel free to contact Dongjie Yu if you have any questions about the paper or the code base.