Skip to content

ManUtdMoon/ZPRL

Repository files navigation

Beyond Action Residuals: Steering Robot Manipulation Policies with Bottleneck Latent Reinforcement Learning (ZPRL)

Dongjie Yu*,1,2, Kun Lei*,2,3, Zhennan Jiang4, Jia Pan#,1, Huazhe Xu#,2,5

* Equal contribution # Corresponding authors

1 School of Computing and Data Science, HKU; 2 Shanghai Qi Zhi Institute; 3 Shanghai Jiao Tong University; 4 Institute of Automation, CAS; 5 IIIS, THU

TL;DR: ZPRL is an RL finetuning framework that perturbs bottleneck latents to steer robot manipulation policies, achieving efficient steering and smooth robot actions.

Installation

  1. Clone the repository.
git clone
cd ZPRL
  1. Create a virtual environment and install the required dependencies. We used mamba for fast env management, but you can also use conda.
mamba env create -f ./conda_environment.yaml
mamba activate zprl
mamba install -c conda-forge mesalib glew glfw # necessary for GPU rendering

Then add export MUJOCO_GL=egl to your ~/.bashrc. If osmesa/glfw/glew issues still exist during training, check this page for possible solutions.

  1. Check your ~/.bashrc and locate MAMBA_ROOT_PREFIX or CONDA_PREFIX. Then add either
export CPATH=$MAMBA_ROOT_PREFIX/include

or

export CPATH=$CONDA_PREFIX/include

correspondingly to your ~/.bashrc if there is no such command.

  1. Install robosuite and robomimic. We use specific versions for both to leverage Python bindings of mujoco so we do not have to install mujoco-py and download more files.
mamba activate zprl
cd /your/path/to/dependencies/
git clone https://github.com/ARISE-Initiative/robosuite.git
cd robosuite
git checkout v1.4.1 # version matters to reproduce the results
pip install -e .

cd /your/path/to/dependencies/
git clone https://github.com/ARISE-Initiative/robomimic.git
cd robomimic
git checkout 9273f9cc # commit matters to reproduce the results
pip install -e .

# some pre-setup to reduce warnings
python /your/path/to/dependencies/robomimic/robomimic/scripts/setup_macros.py
  1. We made a small patch to robosuite to correct its GPU rendering on the specified device and enable faster parallel simulation. Replace /your/path/to/dependencies/robosuite/robosuite/renderers/context/egl_context.py with this and replace /your/path/to/dependencies/robosuite/robosuite/utils/binding_utils.py with this. You may need to change the conversion_map in egl_context.py because the map varies on different computers. (Credit to Baiye Cheng)

Downloading datasets

We have uploaded the datasets for training robomimic tasks (can, square, transport) here. After downloading the whole directory, you can either put it under ./data_local/ or create a soft link to it by ln -s /your/path/to/robomimicv030 ./data_local/. The data root directory is like:

robomimicv030
├── can
│   └── mh
│       └── image_v141_subset_abs.hdf5
├── square
│   └── mh
│       └── image_v141_subset_abs.hdf5
└── transport
    └── mh
        └── image_v141_subset_abs.hdf5

Each .hdf5 randomly samples 100 trajectories from the original Robomimic MH dataset, renders the image observation following scripts here, and turns the delta action into an absolute action with this. But downloading the dataset we upload can save you all of these steps.

We found that the version of robomimic (and robosuite) for generating datasets should match the version for training policies. Therefore, remember to git checkout and do not mix environments.

Offline Training

  1. Activate the environment and login to W&B to track experiments if you have never done before.
mamba activate zprl
wandb login
  1. Launch training on cuda:0 with seed 0.
export MUJOCO_EGL_DEVICE_ID=0 # make robomimic envs run on cuda:0
python train.py \
    --config-name=train_flow_match_vib_unet_image_workspace \
    training.seed=0 \
    task.dataset.seed=0 \
    exp_name=unet_vib_default \
    training.device=cuda:0 \
    task=square_image_abs \
    policy.vib_latent_dim=16 \
    policy.vib_beta=0.0002 \
    policy.vib_recon=0.01

We only change vib_latent_dim, vib_beta during offline training. Specifically, we set (16, 0.0002) for can and square and (32, 0.0001) for transport. You can also try other values to see how they affect the performance.

After the training finishes, you will get a directory containing the results like this.

data/outputs/yyyy.mm.dd/hh.mm.ss_train_flow_match_vib_unet_image_square_image
├── .hydra
│   ├── config.yaml
│   ├── hydra.yaml
│   └── overrides.yaml
├── checkpoints
│   ├── epoch_0000-score_0.000.ckpt
│   └── latest.ckpt
├── logs.json.txt
├── media
│   ├── ...
│   └── train_1.mp4
└── train.log

You can evaluate the checkpoint again with specified action chunk length (here we use 4) by

python eval_base.py \
    -c path/to/offline/checkpoints/latest.ckpt\
    -o data/eval/square/ \
    -t 4 \
    -d cuda:0

Then you will get a directory containing the results on 100 rollouts. You can edit eval_base.py to change configurations of evaluation.

data/eval/square
├── eval_log.json
└── media
    ├── test_100000.mp4
    ├── ...
    └── test_100009.mp4

The offline stack generally follows Diffusion Policy and the authors have built an incredible code base and tutorial. Remember to check it out if you are interested.

Online RL

After the offline training, you can try ZPRL starting from a given base policy with

python train.py \
    --config-name=train_online_vib_robomimic_workspace \
    training.seed=0 \
    exp_name=zprl_default \
    training.device=cuda:0 \
    online_task=square_image_abs \
    online_task.base_ckpt=path/to/offline/checkpoints/latest.ckpt

Note that there will be 45 parallel environments running (20 for training and 25 for evaluation) and it will not cause OOM on RTX 4090. The structure of the resulting RL training directory under data/outputs/ is similar as that of offline IL.

An important hyperparameter in ZPRL is the scale of z-perturbation ($\lambda$ in our paper). We recommend starting from let $\mathrm{RMS}(\lambda\Delta z) \approx 0.1 \mathrm{RMS}(z)$. We track these values during online RL for tuning reference.

When you summarize the RL results, remember to multiply the steps in w&b by n_action_steps (the length of action chunk) such that the actual environment steps are counted.

You can evaluate the composed policy after online finetuning with

python eval_sum.py \
   -c path/to/checkpoints/step_600000.ckpt \
   -o path/to/eval_logs/ \
   -d cuda:0

Acknowledgements

Our code base is built on the following repositories and the structure of this README borrows a lot from DICE-RL. We thank the authors for open-sourcing their wonderful codes and clear documentation.

  • Diffusion Policy: our offline pipeline generally follows the Diffusion Policy workspace but replaces DDPM/DDIM with a rectified flow to reduce denoising steps during inference.
  • Policy Decorator: our online workspace basically follows what policy decorator does. We make some optimization (such as next observation pre-encoding) to accelerate training.
  • SOE: We borrow the idea of introducing VIB module into imitation policies to realize in-manifold exploration from SOE.

Citation

If you find this repository useful, please consider citing our paper:

@misc{yu2026zprl,
      title={Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning},
      author={Dongjie Yu and Kun Lei and Zhennan Jiang and Jia Pan and Huazhe Xu},
      year={2026},
      eprint={2605.19919},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2605.19919},
}

Contact

Feel free to contact Dongjie Yu if you have any questions about the paper or the code base.

About

Implementation of the paper "Beyond Action Residuals: Steering Robot Manipulation Policies with Bottleneck Latent Reinforcement Learning" (https://arxiv.org/abs/2605.19919).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages