A simple testbed for robotics manipulation policies based on robomimic. All policies are rewritten in a simple way. We may further expand it to the robocasa benchmark, which is also based on robosuite simulator.
We also have policies trained and tested on the CALVIN benchmark, e.g., GR1-Training which is the current SOTA on the hardest ABC->D task of CALVIN.
We also recommend other good frameworks / comunities for robotics policy learning.
-
HuggingFace's LeRobot, which currently have ACT, Diffusion Policy (only simple pusht task), TDMPC, and VQ-BeT. LeRobot has a nice robotics learning community on this discord server.
-
CleanDiffuser which implements multiple diffusion algorithms for imitation learning and reinforcement learning. Our implementation of diffusion algorithms is different from CleanDiffuser, but we thank the help of their team members.
-
Dr. Mu Yao organizes a nice robitics learning community for Chinese researchers, see DeepTimber website and 知乎.
Please remember we build systems for you ヾ(^▽^*)). Feel free to ask @StarCycle if you have any question!
[2024.11.1] Update the performance of PushT environment.
[2024.8.9] Several updates below. And we are merging the base Florence policy to HuggingFace LeRobot.
- Add Florence policy with DiT diffusion action head from MDT developed by intuitive robot lab at KIT.
- Switch from tensorboard to wandb.
- Heavily optimize training speed of Florence-series models.
- Support compilation.
[2024.7.30] Add Florence policy with MLP action head & diffusion transformer action head (from Cheng Chi's Diffusion policy). Add RT-1 policy.
[2024.7.16] Add transformer version of Diffusion Policy.
[2024.7.15] Initial release which only contains UNet version of Diffusion Policy.
Unified State and Action Space.
-
All policies share the same data pre-processing pipeline and predict actions in 3D Cartesian translation + 6D rotation + gripper open/close. The 3D translation can be relative to current gripper position (
abs_mode=False
) or the world coordinate (abs_mode=True
). -
They perceive
obs_horizon
historical observations, generatechunk_size
future actions, and executetest_chunk_size
predicted actions. An example withobs_horizon=3, chunk_size=4, test_chunk_size=2
:
Policy sees: |o|o|o|
Policy predicts: | | |a|a|a|a|
Policy executes: | | |a|a|
- They use image input from both static and wrist cameras.
Multi-GPU training and simulation.
-
We achieve multi-GPU / multi-machine training with HuggingFace accelerate.
-
We achieve parallel simulation with asynchronized environment provided by stable-baseline3. In practice, we train and evaluate the model on multiple GPUs. For each GPU training process, there are several parallel environments running on different CPU.
Optimizing data loading pipeline and profiling.
-
We implement a simple GPU data prefetching mechanism.
-
Image preprocessing are performed on GPU, instead of CPU.
-
You can perform detailed profiling of the training pipeline by setting
do_profile=True
and check the trace log withtorch_tb_profiler
. Introduction to the pytorch profiler.
Sorry...but you should tune the learning rate manually.
-
We try new algorithms here so we are not sure when the algorithm will converge before we run it. Thus, we use a simple constant learning rate schduler with warmup. To get the best performance, you should set the learning rate manually: a high learning rate at the beginning and a lower learning rate at the end.
-
Sometimes you need to freeze the visual encoder at the first training stage, and unfreeze the encoder when the loss converges in the first stage. It's can be done by setting
freeze_vision_tower=<True/False>
in the script.
We implement the following algorithms:
Google's RT1.
-
Our implementation supports EfficientNet v1/v2 and you can directly load pretrained weights by torchvision API. Google's implementation only supports EfficientNet v1.
-
You should choose a text encoder in Sentence Transformers to generate text embeddings and sent them to RT1.
-
Our implementation predicts multiple continuous actions (see above) instead of a single discrete action. We find our setting has better performance.
-
To get better performance, you should freeze the EfficientNet visual encoder in the 1st training stage, and unfreeze it in the 2nd stage.
Chi Cheng's Diffusion Policy (UNet / Transformer).
-
Our architecture is a copy of Chi Cheng's network. We test it in our pipeline and it has the same performance. Note that diffusion policy trains 2 resnet visual encoders for 2 camera views from scratch, so we never freeze the visual encoders.
-
We also support predict actions in episilon / sample / v-space and other diffusion schedulers. The
DiffusionPolicy
wrapper can easily adapt to different network designs.
Florence Policy developed on Microsoft's Florence2 VLM, which is trained with VQA, OCR, detection and segmentation tasks on 900M images.
-
We develop the policy on the pretrained model.
-
Unlike OpenVLA and RT2, Florence2 is much smaller with 0.23B (Florence-2-base) or 0.7B (Florence-2-large) parameters.
-
Unlike OpenVLA and RT2 which generate discrete actions, our Florence policy generates continuous actions with a linear action head / a diffusion transformer action head from Cheng Chi's Diffusion Policy / a DiT action head from MDT policy.
-
The following figure illustrates the architecture of the Florence policy. We always freeze the DaViT visual encoder of Florence2, which is so good that unfreezing it does not improve the success rate.
Square task with professional demos:
Policy | Success Rate | Model Size |
---|---|---|
RT-1 | 62% | 23.8M |
Diffusion Policy (UNet) | 88.5% | 329M |
Diffusion Policy (Transformer) | 90.5% | 31.5M |
Florence (linear head) | 88.5% | 270.8M |
Florence (diffusion head - MDT DiT) | 93.75% | 322.79M |
*The success rate is measured with an average of 3 latest checkpoints. Each checkpoint is evaluated with 96 rollouts. *For diffusion models, we save both the trained model and the exponential moving average (EMA) of the trained model in a checkpoint
PushT task:
Policy | Success Rate | Model Size |
---|---|---|
RT-1 | 52% | 23.8M |
Diffusion Policy (UNet) | 64.5% | 76M |
Florence (linear head) | 53% | 270.8M |
Florence (diffusion head - MDT DiT) | 64% | 322.79M |
*Each checkpoint is evaluated with 96 rollouts.
*A success in the PushT environment requires a final IoU > 95% (which is difficult to locate under low resolution). If you raise the resolution or reduce the threshold, the succes rate will be much higher.
You can use mirror sites of Github to avoid the connection problem in some regions. With different simulators, it's recommended to use different python versions, which will be mentioned below.
conda create -n mimic python=3.x
conda activate mimic
apt install curl git libgl1-mesa-dev libgl1-mesa-glx libglew-dev libosmesa6-dev software-properties-common net-tools unzip vim virtualenv wget xpra xserver-xorg-dev libglfw3-dev patchelf cmake
git clone https://github.com/EDiRobotics/mimictest
cd mimictest
pip install -e .
Now, depending on the environment and model you want, Please perform the following steps.
For Robomimic experiments.
The recommended python version is 3.9. You need to install robomimic
and robosuite
via:
pip install pip install robosuite@https://github.com/cheng-chi/robosuite/archive/277ab9588ad7a4f4b55cf75508b44aa67ec171f0.tar.gz
pip install robomimic
Recent robosuite has turned to the DeepMind's Mujoco 3 backend but we are still using the old version with Mujoco 2.1. This is because the dataset is recorded in Mujoco 2.1, which has slighlyly dynamics difference with Mujoco 3.
You should also download dataset that contains robomimic_image.zip
or robomimic_lowdim.zip
from the official link or HuggingFace. In this example, I use the tool of HF-Mirror. You can set the environment variable export HF_ENDPOINT=https://hf-mirror.com
to avoid the connection problem in some regions.
apt install git-lfs aria2
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9
If you only want to download a subset of the data, e.g., the square task with image input:
./hfd.sh EDiRobotics/mimictest_data --dataset --tool aria2c -x 9 --include robomimic_image/square.zip
For the PushT experiment.
The recommended python version is 3.19. You can install the environment via
pip install gym-pusht
Then you can download the PushT dataset from the official link.
For Florence-based models.
To use florence-based models, you should download one of it from HuggingFace, for example:
./hfd.sh microsoft/Florence-2-base --model --tool aria2c -x 9
And then set model_path
in the script, for example:
# in Script/FlorenceImage.py
model_path = "/path/to/downloaded/florence/folder"
You need to install florence-specific dependencies, e.g., flash-attention. You can achieve it with:
pip install -e .[florence]
- You shall first run
accelerate config
to set environment parameters (number of GPUs, precision, etc). We recommend to usebf16
. - Download and unzip the dataset mentioned above.
- Please check and modify the settings (e.g, train or eval, and the corresponding settings) in the scripts you want to run, under the
Script
directory. Each script represents a configuration of an algorithm. - Please then run
accelerate launch Script/<the script you choose>.py
- `GLIBCXX_3.4.30' not found
ImportError: /opt/conda/envs/test/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /lib/x86_64-linux-gnu/libLLVM-15.so.1)
You can try conda install -c conda-forge gcc=12.1
which is a magical command that automatically install some dependencies.
Also check this link.
- Spend too much time compiling flash-attn
You can download a pre-build wheel from official release, instead of building a wheel by yourself. For example (you should choose a wheel depending on your system):
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
pip install flash_attn-2.6.3+cu118torch2.4cxx11abiTRUE-cp39-cp39-linux_x86_64.whl
When installing pytorch, make sure the torch cuda version and your cuda driver version are the same (e.g., 11.8).
- Cannot initialize a EGL device display
Cannot initialize a EGL device display. This likely means that your EGL driver does not support the PLATFORM_DEVICE extension, which is required for creating a headless rendering context.
You can try conda install -c conda-forge gcc=12.1
.
- fatal error: GL/osmesa.h: No such file or directory
/tmp/pip-install-rsxccpmh/mujoco-py/mujoco_py/gl/osmesashim.c:1:23: fatal error: GL/osmesa.h: No such file or directory
compilation terminated.
error: command 'gcc' failed with exit status 1
You can try conda install -c conda-forge mesalib glew glfw
or check this link.
- cannot find -lGL
/home/ubuntu/anaconda3/compiler_compat/ld: cannot find -lGL
collect2: error: ld returned 1 exit status
error: command 'gcc' failed with exit status 1
You can try conda install -c conda-forge mesa-libgl-devel-cos7-x86_64
or check this link.
SystemError: initialization of _internal failed without raising an exception
.
You can simply pip -U numba
or this link.
ImportError: libGL.so.1: cannot open shared object file
apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
Or check this link.
failed to EGL with glad
The core problem seems to be lack of libEGL.so.1
. You may try apt-get update && apt-get install libegl1
. If you find other packages not installed during installing libegl1
, please install them.