OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning

Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward function can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.


A lot of the code here was borrowed from a bunch of sources for the best performing results. As inspiration, we used some of the following repositories.


To run experiments, use the run scripts in the main directory.


python Hopper-v1 /path/to/experts/mujoco/expert_rollouts_Hopper-v1.pickle --num_expert_rollouts 10 --timesteps_per_batch 25000 --d_num_epochs_per_step 3 --d_mutual_info_penalty 0.1 --n_iters 500 --num_options 2


We're still in the process of cleaning up the code and merging in changes from private repositories for final publication, so if there's something weird going on let us know.


