Skip to content

Pytorch Implementation for "Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation" - [CoRL 2025]

License

Notifications You must be signed in to change notification settings

aalmuzairee/mad

Repository files navigation

Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation

Pytorch implementation of

Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation by

Abdulaziz Almuzairee, Rohan Patil, Dwait Bhatt, Henrik I. Christensen (UC San Diego)



[Website] [Paper]


πŸŽ₯ TLDR;

We offer a method called MAD. Using MAD, a reinforcement learning agent can easily merge multiple camera views to gain higher sample efficiency while still being able to function with any singular camera view alone. MAD achieves that by

  1. encoding camera views individually,
  2. merging the individual camera view features through summation to create a multi camera view representation,
  3. framing all the singular view representations as augmentations to the multi camera view representation during the learning.

Citation

If you find our work useful, please consider citing our paper:

@misc{almuzairee2025merging,
      title={Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation}, 
      author={Abdulaziz Almuzairee and Rohan Patil and Dwait Bhatt and Henrik I. Christensen},
      year={2025},
      eprint={2505.04619},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.04619}, 
}

Getting Started

βš™οΈ Installation

System requirements:

  • GPU: Minimum 11 GB RAM, supports CUDA 11.0 or later
  • CPU RAM: 95 GB (Meta-World tasks), 135GB (ManiSkill3 tasks) -- replay buffer implementation is not optimized
  • CPU Cores: 4 cores
  • Average Runtime: 4hrs (Meta-World tasks), 6hrs (ManiSkill3 tasks)
  • Recommended Base Docker Image: nvidia/cudagl:11.3.0-base-ubuntu18.04
  • Conda

If you're ready, clone this repo:

git clone https://github.com/aalmuzairee/mad.git
cd mad

Then to run Meta-World you need to install mujoco210. We provide a utility script that installs it:

apt update
apt-get install gcc libosmesa6-dev wget git -y
. ./extras/install_mw.sh

After installing mujoco, you can then install the packages by running:

cd mad
conda env create -f environment.yaml
conda activate mad

Finally you need to install gym=0.21 for Meta-World and setup the GPU links:

pip install pip==24.0
pip install gym==0.21

# some setups need the following symlink command
ln -s /usr/local/cuda /usr/local/nvidia 

πŸ“œ Example usage

We provide examples on how to train below.

# Train MAD (Ours) with all three cameras and evaluate on all three singular and combined
python train.py agent=mad task=basketball

# Train MVD (Baseline) with two cameras, and evaluating on singular and combined cameras
python train.py agent=baselines.mvd task=hammer cameras=[first,third1] eval_cameras=[[first],[third1],[first,third1]]

# Train VIB (Baseline) with two cameras, and evaluating only on combined cameras
python train.py agent=baselines.vib task=soccer cameras=[first,third1] eval_cameras=[[first,third1]]

where the log outputs will be:

eval    S: 0            E: 0            R: 3.60         L: 100          T: 0:00:07      SPS: 246.49     SR: 0.00                                                                                                            
train   S: 100          E: 1            R: 6.42         L: 100          T: 0:00:08      SPS: 115.77
train   S: 200          E: 2            R: 3.17         L: 100          T: 0:00:09      SPS: 115.18 

with each letter corresponding to:

eval    S: Steps        E: Episode      R: Episode Reward     L: Episode Length    T: Time Elapsed      SPS: Steps Per Second    SR: Success Rate   

For logging, we recommend configuring Weights and Biases (wandb) in config.yaml to track training progress. We further provide limited logging in local csv files.


πŸ“– Config options

Please refer to config.yaml for a full list of options.

Algorithms

There are four algorithms that you can choose from:

by setting the agent variable in the config.yaml file or a commandline argument like agent=mad.

Tasks

We test on 20 Visual RL tasks from Meta-World and ManiSkill, you can find them listed in envs/__init__.py

Cameras

As stated in the paper, we use three cameras:

  • First Person: first
  • Third Person A: third1
  • Third Person B: third2

which can be set for training in cameras:

cameras=[first,third1,third2]

and for evaluation you can use any combination, but it must be a list of lists. One example when evaluating on all singular and combined cameras would be:

eval_cameras=[[first],[third1],[third2],[first,third1,third2]]

which can all be set in config.yaml file or as a command line argument to train.py.


✨ Visualization

We further offer on-screen rendering visualization scripts that help you visualize the environments, they can be found in extras/ and can be run using:

 # visualize metaworld
python extras/visualize_mw.py

# visualize maniskill
python extras/visualize_ms.py

❓ FAQ

Q: Can I train with lower CPU RAM?

A: If you set frame_stack=1 in the config, you will be able to train with 35 GB (Meta-World tasks) or 45GB (ManiSkill3 tasks). We haven't verified the results, but we would expect a ~5-10% drop in final performance.

License

This project is licensed under the MIT License - see the LICENSE file for details. Note that the repository relies on third-party code, which is subject to their respective licenses.

About

Pytorch Implementation for "Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation" - [CoRL 2025]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published