Merge branch 'master' into HER

Stable-Baselines-Team · Oct 18, 2018 · 3131bec · 3131bec
2 parents 31a9183 + 65d6583
commit 3131bec
Show file tree

Hide file tree

Showing 43 changed files with 464 additions and 214 deletions.
diff --git a/.github/ISSUE_TEMPLATE/issue-template.md b/.github/ISSUE_TEMPLATE/issue-template.md
@@ -1,6 +1,6 @@
 ---
 name: Issue Template
-about: How to create and issue for this repository
+about: How to create an issue for this repository
 
 ---
 

diff --git a/.travis.yml b/.travis.yml
@@ -12,4 +12,6 @@ install:
   - docker pull araffin/stable-baselines-cpu
 
 script:
-  - docker run --env CODACY_PROJECT_TOKEN=$CODACY_PROJECT_TOKEN --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov-report xml --cov=. -v tests/ && python-codacy-coverage -r coverage.xml --token=$CODACY_PROJECT_TOKEN"
+  # For pull requests from fork, Codacy token is not available, leading to build failure
+  - 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then docker run --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/"; fi'
+  - 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then docker run --env CODACY_PROJECT_TOKEN=$CODACY_PROJECT_TOKEN --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov-report xml --cov=. -v tests/ && python-codacy-coverage -r coverage.xml --token=$CODACY_PROJECT_TOKEN"; fi'
diff --git a/README.md b/README.md
@@ -125,7 +125,7 @@ All the following examples can be executed online using Google colab notebooks:
 | ACER                | :heavy_check_mark:           | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x:                 | :x:                | :heavy_check_mark:                |
 | ACKTR               | :heavy_check_mark:           | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x:                 | :x:                | :heavy_check_mark:                |
 | DDPG                | :heavy_check_mark:           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
-| DeepQ               | :heavy_check_mark:           | :x:                | :x:                | :heavy_check_mark: | :x:                 | :x:                | :x:                               |
+| DQN                 | :heavy_check_mark:           | :x:                | :x:                | :heavy_check_mark: | :x:                 | :x:                | :x:                               |
 | GAIL <sup>(2)</sup> | :heavy_check_mark:           | :heavy_check_mark: | :heavy_check_mark: | :x:                | :x:                 | :x:                | :heavy_check_mark: <sup>(4)</sup> |
 | HER <sup>(3)</sup>  | :x: <sup>(5)</sup>           | :x:                | :heavy_check_mark: | :x:                | :x:                 | :x:                | :x:                               |
 | PPO1                | :heavy_check_mark:           | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark:  | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
@@ -170,6 +170,10 @@ To cite this repository in publications:
     }
 ```
 
+## Maintainers
+
+Stable-Baselines is currently maintained by [Ashley Hill](https://github.com/hill-a) (aka @hill-a) and [Antonin Raffin](https://github.com/araffin) (aka @araffin).
+
 ## How To Contribute
 
 To any interested in making the baselines better, there is still some documentation that needs to be done.

diff --git a/docs/guide/custom_policy.rst b/docs/guide/custom_policy.rst
@@ -3,7 +3,7 @@
 Custom Policy Network
 ---------------------
 
-Stable baselines provides default policy networks for images (CNNPolicies)
+Stable baselines provides default policy networks (see :ref:`Policies <policies>` ) for images (CNNPolicies)
 and other type of input features (MlpPolicies).
 However, you can also easily define a custom architecture for the policy (or value) network:
 
@@ -30,6 +30,15 @@ However, you can also easily define a custom architecture for the policy (or val
   # Train the agent
   model.learn(total_timesteps=100000)
 
+  del model
+  # When loading a model with a custom policy
+  # you MUST pass explicitly the policy when loading the saved model
+  model = A2C.load(policy=CustomPolicy)
+
+.. warning::
+
+  When loading a model with a custom policy, you must pass the custom policy explicitly when loading the model. (cf previous example)
+
 
 You can also registered your policy, to help with code simplicity: you can refer to your custom policy using a string.
 

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -40,7 +40,7 @@ In the following example, we will train, save and load an A2C model on the Lunar
 
 .. note::
   LunarLander requires the python package `box2d`.
-  You can install it using ``apt install swing`` and then ``pip install box2d box2d-kengz``
+  You can install it using ``apt install swig`` and then ``pip install box2d box2d-kengz``
 
 .. code-block:: python
 

diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -5,11 +5,44 @@ Changelog
 
 For download links, please look at `Github release page <https://github.com/hill-a/stable-baselines/releases>`_.
 
-Master version 2.0.0.rc0 (TO BE RELEASED SOON)
------------------------------------------------
+Pre Release 2.1.1.a0 (WIP)
+--------------------------
+
+- fixed MpiAdam synchronization issue in PPO1 (thanks to @brendenpetersen) issue #50
+
+
+Release 2.1.0 (2018-10-2)
+-------------------------
+
+.. warning::
+
+	This version contains breaking changes for DQN policies, please read the full details
+
+**Bug fixes + doc update**
+
+
+- added patch fix for equal function using `gym.spaces.MultiDiscrete` and `gym.spaces.MultiBinary`
+- fixes for DQN action_probability
+- re-added double DQN + refactored DQN policies **breaking changes**
+- replaced `async` with `async_eigen_decomp` in ACKTR/KFAC for python 3.7 compatibility
+- removed action clipping for prediction of continuous actions (see issue #36)
+- fixed NaN issue due to clipping the continuous action in the wrong place (issue #36)
+- documentation was updated (policy + DDPG example hyperparameters)
+
+Release 2.0.0 (2018-09-18)
+--------------------------
+
+.. warning::
+
+	This version contains breaking changes, please read the full details
 
 **Tensorboard, refactoring and bug fixes**
 
+
+- Renamed DeepQ to DQN **breaking changes**
+- Renamed DeepQPolicy to DQNPolicy **breaking changes**
+- fixed DDPG behavior **breaking changes**
+- changed default policies for DDPG, so that DDPG now works correctly **breaking changes**
 - added more documentation (some modules from common).
 - added doc about using custom env
 - added Tensorboard support for A2C, ACER, ACKTR, DDPG, DeepQ, PPO1, PPO2 and TRPO
@@ -20,8 +53,6 @@ Master version 2.0.0.rc0 (TO BE RELEASED SOON)
 - fixed PPO1 and TRPO done values for recurrent policies
 - fixed image normalization not occurring when using images
 - updated VecEnv objects for the new Gym version
-- changed default policies for DDPG, so that DDPG now works correctly
-- fixed DDPG behavior
 - added test for DDPG
 - refactored DQN policies
 - added registry for policies, can be passed as string to the agent
@@ -33,8 +64,6 @@ Master version 2.0.0.rc0 (TO BE RELEASED SOON)
 - added assert in PPO2 for recurrent policies
 - fixed predict function to handle both vectorized and unwrapped environment
 - added input check to the predict function
-- changed DeepQ to DQN **breaking changes**
-- changed DeepQPolicy to DQNPolicy **breaking changes**
 - refactored ActorCritic models to reduce code duplication
 - refactored Off Policy models (to begin HER and replay_buffer refactoring)
 - added tests for auto vectorization detection

diff --git a/docs/modules/a2c.rst b/docs/modules/a2c.rst
@@ -62,7 +62,7 @@ Train a A2C agent on `CartPole-v1` using 4 processes.
 
   del model # remove to demonstrate saving and loading
 
-  A2C.load("a2c_cartpole")
+  model = A2C.load("a2c_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/acer.rst b/docs/modules/acer.rst
@@ -42,8 +42,7 @@ Example
 
   import gym
 
-  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
-      CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
+  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
   from stable_baselines.common.vec_env import SubprocVecEnv
   from stable_baselines import ACER
 
@@ -57,7 +56,7 @@ Example
 
   del model # remove to demonstrate saving and loading
 
-  ACER.load("acer_cartpole")
+  model = ACER.load("acer_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/acktr.rst b/docs/modules/acktr.rst
@@ -43,8 +43,7 @@ Example
 
   import gym
 
-  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
-      CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
+  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
   from stable_baselines.common.vec_env import SubprocVecEnv
   from stable_baselines import ACKTR
 
@@ -58,7 +57,7 @@ Example
 
   del model # remove to demonstrate saving and loading
 
-  ACKTR.load("acktr_cartpole")
+  model = ACKTR.load("acktr_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/ddpg.rst b/docs/modules/ddpg.rst
@@ -13,6 +13,17 @@ DDPG
   The DDPG model does not support ``stable_baselines.common.policies`` because it uses q-value instead
   of value estimation, as a result it must use its own policy models (see :ref:`ddpg_policies`).
 
+
+.. rubric:: Available Policies
+
+.. autosummary::
+    :nosignatures:
+
+    MlpPolicy
+    LnMlpPolicy
+    CnnPolicy
+    LnCnnPolicy
+
 Notes
 -----
 
@@ -47,7 +58,7 @@ Example
   import gym
   import numpy as np
 
-  from stable_baselines.ddpg.policies import MlpPolicy, CnnPolicy
+  from stable_baselines.ddpg.policies import MlpPolicy
   from stable_baselines.common.vec_env import DummyVecEnv
   from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
   from stable_baselines import DDPG
@@ -58,15 +69,15 @@ Example
   # the noise objects for DDPG
   n_actions = env.action_space.shape[-1]
   param_noise = None
-  action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=float(0.2) * np.ones(n_actions))
+  action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
 
   model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
-  model.learn(total_timesteps=25000)
+  model.learn(total_timesteps=400000)
   model.save("ddpg_mountain")
 
   del model # remove to demonstrate saving and loading
 
-  DDPG.load("ddpg_mountain")
+  model = DDPG.load("ddpg_mountain")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/dqn.rst b/docs/modules/dqn.rst
@@ -14,6 +14,16 @@ and its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).
   The DQN model does not support ``stable_baselines.common.policies``,
   as a result it must use its own policy models (see :ref:`deepq_policies`).
 
+.. rubric:: Available Policies
+
+.. autosummary::
+    :nosignatures:
+
+    MlpPolicy
+    LnMlpPolicy
+    CnnPolicy
+    LnCnnPolicy
+
 Notes
 -----
 
@@ -46,7 +56,7 @@ Example
   import gym
 
   from stable_baselines.common.vec_env import DummyVecEnv
-  from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
+  from stable_baselines.deepq.policies import MlpPolicy
   from stable_baselines import DQN
 
   env = gym.make('CartPole-v1')
@@ -58,7 +68,7 @@ Example
 
   del model # remove to demonstrate saving and loading
 
-  DeepQ.load("deepq_cartpole")
+  model = DQN.load("deepq_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/policies.rst b/docs/modules/policies.rst
@@ -5,6 +5,30 @@
 Policy Networks
 ===============
 
+Stable-baselines provides a set of default policies, that can be used with most action spaces.
+If you need more control on the policy architecture, You can also create a custom policy (see :ref:`custom_policy`).
+
+.. note::
+
+	CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)
+
+.. warning::
+  For all algorithms (except DDPG), continuous actions are only clipped during training
+  (to avoid out of bound error). However, you have to manually clip the action when using
+  the `predict()` method.
+
+.. rubric:: Available Policies
+
+.. autosummary::
+    :nosignatures:
+
+    MlpPolicy
+    MlpLstmPolicy
+    MlpLnLstmPolicy
+    CnnPolicy
+    CnnLstmPolicy
+    CnnLnLstmPolicy
+
 
 Base Classes
 ------------

diff --git a/docs/modules/ppo1.rst b/docs/modules/ppo1.rst
@@ -20,7 +20,8 @@ For that, ppo uses clipping to avoid too large update.
 Notes
 -----
 
--  Original paper:  https://arxiv.org/abs/1502.05477
+-  Original paper:  https://arxiv.org/abs/1707.06347
+- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
 -  OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
 - ``mpirun -np 8 python -m stable_baselines.ppo1.run_atari`` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (``-h``) for more options.
 - ``python -m stable_baselines.ppo1.run_mujoco`` runs the algorithm for 1M frames on a Mujoco environment.
@@ -52,8 +53,7 @@ Example
 
   import gym
 
-  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
-      CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
+  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
   from stable_baselines.common.vec_env import DummyVecEnv
   from stable_baselines import PPO1
 
@@ -66,7 +66,7 @@ Example
 
   del model # remove to demonstrate saving and loading
 
-  PPO1.load("ppo1_cartpole")
+  model = PPO1.load("ppo1_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/docs/modules/ppo2.rst b/docs/modules/ppo2.rst
@@ -25,12 +25,13 @@ For that, ppo uses clipping to avoid too large update.
 Notes
 -----
 
--  Original paper: https://arxiv.org/abs/1707.06347
--  OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
--  ``python -m stable_baselines.ppo2.run_atari`` runs the algorithm for 40M
+- Original paper: https://arxiv.org/abs/1707.06347
+- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
+- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
+- ``python -m stable_baselines.ppo2.run_atari`` runs the algorithm for 40M
    frames = 10M timesteps on an Atari game. See help (``-h``) for more
    options.
--  ``python -m stable_baselines.ppo2.run_mujoco`` runs the algorithm for 1M
+- ``python -m stable_baselines.ppo2.run_mujoco`` runs the algorithm for 1M
    frames on a Mujoco environment.
 
 Can I use?
@@ -73,7 +74,7 @@ Train a PPO agent on `CartPole-v1` using 4 processes.
 
    del model # remove to demonstrate saving and loading
 
-   PPO2.load("ppo2_cartpole")
+   model = PPO2.load("ppo2_cartpole")
 
    # Enjoy trained agent
    obs = env.reset()

diff --git a/docs/modules/trpo.rst b/docs/modules/trpo.rst
@@ -43,8 +43,7 @@ Example
 
   import gym
 
-  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
-      CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
+  from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
   from stable_baselines.common.vec_env import DummyVecEnv
   from stable_baselines import TRPO
 
@@ -57,7 +56,7 @@ Example
 
   del model # remove to demonstrate saving and loading
 
-  TRPO.load("trpo_cartpole")
+  model = TRPO.load("trpo_cartpole")
 
   obs = env.reset()
   while True:

diff --git a/stable_baselines/__init__.py b/stable_baselines/__init__.py
@@ -1,3 +1,6 @@
+import gym
+import numpy as np
+
 from stable_baselines.a2c import A2C
 from stable_baselines.acer import ACER
 from stable_baselines.acktr import ACKTR
@@ -9,4 +12,19 @@
 from stable_baselines.ppo2 import PPO2
 from stable_baselines.trpo_mpi import TRPO
 
-__version__ = "2.0.0.rc0"
+__version__ = "2.1.1.a0"
+
+
+# patch Gym spaces to add equality functions, if not implemented
+# See https://github.com/openai/gym/issues/1171
+if gym.spaces.MultiBinary.__eq__ == object.__eq__:  # by default, all classes have the __eq__ function from object.
+    def _eq(self, other):
+        return self.n == other.n
+
+    gym.spaces.MultiBinary.__eq__ = _eq
+
+if gym.spaces.MultiDiscrete.__eq__ == object.__eq__:
+    def _eq(self, other):
+        return np.all(self.nvec == other.nvec)
+
+    gym.spaces.MultiDiscrete.__eq__ = _eq