Skip to content
Multi-Objective Reinforcement Learning
Python HTML Shell Jupyter Notebook Makefile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
pydial interact with human Aug 19, 2019
.gitignore envelope w/o homotopy, ready to run on tiger Dec 19, 2018 Update Aug 27, 2019

Deep Multi-Objective Reinforcement Learning

A Generalized Algorithm for Multi-Objective RL and Policy Adaptation, arXiv:1908.08342.


We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks.In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After this initial learning phase, our agent can quickly adapt to any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.


The experiments on two synthetic domains, Deep Sea Treasure (DST) and Fruit Tree Navigation (FTN), as well as two complex real domains, Task-Oriented Dialog Policy Learning (Dialog) and SuperMario Game (SuperMario).


  • Example - train envelope MOQ-learning on FTN domain:
    python --env-name ft --method crl-envelope --model linear --gamma 0.99 --mem-size 4000 --batch-size 256 --lr 1e-3 --epsilon 0.5 --epsilon-decay --weight-num 32 --episode-num 5000 --optimizer Adam --save crl/envelope/saved/ --log crl/envelope/logs/ --update-freq 100 --beta 0.01 --name 0

The code for our envelope MOQ-learning algorithm is in synthetic/crl/envelope/, neural network architecture is configurable in synthetic/crl/envelope/models. Two synthetic environments are under the file synthetic/envs.


Code for Task-Oriented Dialog Policy Leanring. The environment is modified from PyDial.

  • Example - train envelope MOQ-learning on Dialog domain:
    pydial train config/MORL/Train/envelope.cfg

The code for our envelope MOQ-learning algorithm is in pydial/policy/


The multi-objective version SuperMario Game. The environment is modified from Kautenja/gym-super-mario-bros.

  • Example - train envelope MOQ-learning on SuperMario domain:
    python --env-id SuperMarioBros-v2 --use-cuda --use-gae --life-done --single-stage --training --standardization --num-worker 16 --sample-size 8 --beta 0.05 --name e3c_b05

The code for our envelope MOQ-learning algorithm is in multimario/ Two multi-objective version environment is in multimario/


  author    = {Runzhe Yang and
               Xingyuan Sun and
               Karthik Narasimhan},
  title     = {A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation},
  journal   = {CoRR},
  volume    = {abs/1908.08342},
  year      = {2019},
  url       = {},
  archivePrefix = {arXiv},
  eprint    = {1908.08342},

You can’t perform that action at this time.