Source code for Meta-learning Parameterized Skills (ICML 2023).
We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We propose to leverage off-policy Meta-RL combined with a trajectory-centric smoothness term to learn a set of parameterized skills. Our agent can use these learned skills to construct a three-level hierarchical framework that models a Temporally-extended Parameterized Action Markov Decision Process. We empirically demonstrate that the proposed algorithms enable an agent to solve a set of highly difficult long-horizon (obstacle-course and robot manipulation) tasks.
The repo is built upon PEARL.
We utilize pytorch-softdtw-cuda for DTW implementation.
Requirements: Pytorch (we use version 1.7.1 but later version should also be fine.), Mujoco210, Gym, etc.
- Install the modified
metaworld
:
$ cd metaworld
$ pip install -e .
To learn the parameterized skills, run: python launch_experiment_ml1.py ./configs/[ENV_NAME].json
for different coffee tasks. By default the code will use the GPU - to use CPU instead, set use_gpu=False
in the appropriate config file.
Output files will be written to ./output/[ENV]/[EXP NAME]
where the experiment name is uniquely generated based on the date. The file progress.csv
contains statistics logged over the course of training.
After having all the coffee-related parameterized skills, run: python eval_act_boundcoffee.py ./output/[ENV]/[EXP NAME].json ./output/[ENV] --num_trajs 50
to get the bounds of the learned new parameterized action spaces.
Then, to learn the high-level and mid-level control policies using the learning parameterized skills, run: python launch_experiment_coffeeplanning.py ./configs/coffee-full.json
and modify the coffee-full.json
file with your own paths for the learning skill-conditioned policies and bounds of action space.
Run python sim_policy_coffeeplanning.py ./output/[ENV]/[EXP NAME].json ./output/[ENV] --num_trajs 100
to evaluate the overall policies. You can use --video
to save the video.