Pytorch implementation of
Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation by
Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen (UC San Diego)
We offer a method called MAD. Using MAD, a reinforcement learning agent can easily merge multiple camera views to gain higher sample efficiency while still being able to function with any singular camera view alone. MAD achieves that by
- encoding camera views individually,
- merging the individual camera view features through summation to create a multi camera view representation,
- framing all the singular view representations as augmentations to the multi camera view representation during the learning.
If you find our work useful, please consider citing our paper:
@misc{almuzairee2025merging,
title={Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation},
author={Abdulaziz Almuzairee and Rohan Patil and Dwait Bhatt and Henrik I. Christensen},
year={2025},
eprint={2505.04619},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.04619},
}
System requirements:
- GPU: Minimum 11 GB RAM, supports CUDA 11.0 or later
- CPU RAM: 95 GB (Meta-World tasks), 135GB (ManiSkill3 tasks) -- replay buffer implementation is not optimized
- CPU Cores: 4 cores
- Average Runtime: 4hrs (Meta-World tasks), 6hrs (ManiSkill3 tasks)
- Recommended Base Docker Image:
nvidia/cudagl:11.3.0-base-ubuntu18.04
- Conda
If you're ready, clone this repo:
git clone https://github.com/aalmuzairee/mad.git
cd mad
Then to run Meta-World you need to install mujoco210. We provide a utility script that installs it:
apt update
apt-get install gcc libosmesa6-dev wget git -y
. ./extras/install_mw.sh
After installing mujoco, you can then install the packages by running:
cd mad
conda env create -f environment.yaml
conda activate mad
Finally you need to install gym=0.21 for Meta-World and setup the GPU links:
pip install pip==24.0
pip install gym==0.21
# some setups need the following symlink command
ln -s /usr/local/cuda /usr/local/nvidia
We provide examples on how to train below.
# Train MAD (Ours) with all three cameras and evaluate on all three singular and combined
python train.py agent=mad task=basketball
# Train MVD (Baseline) with two cameras, and evaluating on singular and combined cameras
python train.py agent=baselines.mvd task=hammer cameras=[first,third1] eval_cameras=[[first],[third1],[first,third1]]
# Train VIB (Baseline) with two cameras, and evaluating only on combined cameras
python train.py agent=baselines.vib task=soccer cameras=[first,third1] eval_cameras=[[first,third1]]
where the log outputs will be:
eval S: 0 E: 0 R: 3.60 L: 100 T: 0:00:07 SPS: 246.49 SR: 0.00
train S: 100 E: 1 R: 6.42 L: 100 T: 0:00:08 SPS: 115.77
train S: 200 E: 2 R: 3.17 L: 100 T: 0:00:09 SPS: 115.18
with each letter corresponding to:
eval S: Steps E: Episode R: Episode Reward L: Episode Length T: Time Elapsed SPS: Steps Per Second SR: Success Rate
For logging, we recommend configuring Weights and Biases (wandb
) in config.yaml
to track training progress.
We further provide limited logging in local csv files.
Please refer to config.yaml
for a full list of options.
There are four algorithms that you can choose from:
mad
: MAD (Almuzairee et al., 2025)baselines.mvd
: MVD (Dunion et al., 2024)baselines.vib
: VIB (Hsu et al., 2022)baselines.drq
: DrQ (Kostrikov et al., 2020)
by setting the agent
variable in the config.yaml
file or a commandline argument like agent=mad
.
We test on 20 Visual RL tasks from Meta-World and ManiSkill, you can find them listed in envs/__init__.py
As stated in the paper, we use three cameras:
- First Person:
first
- Third Person A:
third1
- Third Person B:
third2
which can be set for training in cameras:
cameras=[first,third1,third2]
and for evaluation you can use any combination, but it must be a list of lists. One example when evaluating on all singular and combined cameras would be:
eval_cameras=[[first],[third1],[third2],[first,third1,third2]]
which can all be set in config.yaml
file or as a command line argument to train.py
.
We further offer on-screen rendering visualization scripts that help you visualize the environments, they can be found in extras/ and can be run using:
# visualize metaworld
python extras/visualize_mw.py
# visualize maniskill
python extras/visualize_ms.py
A: If you set frame_stack=1
in the config, you will be able to train with 35 GB (Meta-World tasks) or 45GB (ManiSkill3 tasks). We haven't verified the results, but we would expect a ~5-10% drop in final performance.
This project is licensed under the MIT License - see the LICENSE
file for details. Note that the repository relies on third-party code, which is subject to their respective licenses.