Skip to content

Commit

Permalink
Merge branch 'master' into HER
Browse files Browse the repository at this point in the history
  • Loading branch information
araffin committed Oct 18, 2018
2 parents 31a9183 + 65d6583 commit 3131bec
Show file tree
Hide file tree
Showing 43 changed files with 464 additions and 214 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/issue-template.md
@@ -1,6 +1,6 @@
---
name: Issue Template
about: How to create and issue for this repository
about: How to create an issue for this repository

---

Expand Down
4 changes: 3 additions & 1 deletion .travis.yml
Expand Up @@ -12,4 +12,6 @@ install:
- docker pull araffin/stable-baselines-cpu

script:
- docker run --env CODACY_PROJECT_TOKEN=$CODACY_PROJECT_TOKEN --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov-report xml --cov=. -v tests/ && python-codacy-coverage -r coverage.xml --token=$CODACY_PROJECT_TOKEN"
# For pull requests from fork, Codacy token is not available, leading to build failure
- 'if [ "$TRAVIS_PULL_REQUEST" != "false" ]; then docker run --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov=. -v tests/"; fi'
- 'if [ "$TRAVIS_PULL_REQUEST" = "false" ]; then docker run --env CODACY_PROJECT_TOKEN=$CODACY_PROJECT_TOKEN --rm --network host --ipc=host --mount src=$(pwd),target=/root/code/stable-baselines,type=bind araffin/stable-baselines-cpu bash -c "cd /root/code/stable-baselines/ && pytest --cov-config .coveragerc --cov-report term --cov-report xml --cov=. -v tests/ && python-codacy-coverage -r coverage.xml --token=$CODACY_PROJECT_TOKEN"; fi'
6 changes: 5 additions & 1 deletion README.md
Expand Up @@ -125,7 +125,7 @@ All the following examples can be executed online using Google colab notebooks:
| ACER | :heavy_check_mark: | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: |
| ACKTR | :heavy_check_mark: | :heavy_check_mark: | :x: <sup>(5)</sup> | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: |
| DDPG | :heavy_check_mark: | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| DeepQ | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: | :x: | :x: | :x: |
| DQN | :heavy_check_mark: | :x: | :x: | :heavy_check_mark: | :x: | :x: | :x: |
| GAIL <sup>(2)</sup> | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: | :x: | :x: | :heavy_check_mark: <sup>(4)</sup> |
| HER <sup>(3)</sup> | :x: <sup>(5)</sup> | :x: | :heavy_check_mark: | :x: | :x: | :x: | :x: |
| PPO1 | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: <sup>(4)</sup> |
Expand Down Expand Up @@ -170,6 +170,10 @@ To cite this repository in publications:
}
```

## Maintainers

Stable-Baselines is currently maintained by [Ashley Hill](https://github.com/hill-a) (aka @hill-a) and [Antonin Raffin](https://github.com/araffin) (aka @araffin).

## How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done.
Expand Down
11 changes: 10 additions & 1 deletion docs/guide/custom_policy.rst
Expand Up @@ -3,7 +3,7 @@
Custom Policy Network
---------------------

Stable baselines provides default policy networks for images (CNNPolicies)
Stable baselines provides default policy networks (see :ref:`Policies <policies>` ) for images (CNNPolicies)
and other type of input features (MlpPolicies).
However, you can also easily define a custom architecture for the policy (or value) network:

Expand All @@ -30,6 +30,15 @@ However, you can also easily define a custom architecture for the policy (or val
# Train the agent
model.learn(total_timesteps=100000)
del model
# When loading a model with a custom policy
# you MUST pass explicitly the policy when loading the saved model
model = A2C.load(policy=CustomPolicy)
.. warning::

When loading a model with a custom policy, you must pass the custom policy explicitly when loading the model. (cf previous example)


You can also registered your policy, to help with code simplicity: you can refer to your custom policy using a string.

Expand Down
2 changes: 1 addition & 1 deletion docs/guide/examples.rst
Expand Up @@ -40,7 +40,7 @@ In the following example, we will train, save and load an A2C model on the Lunar

.. note::
LunarLander requires the python package `box2d`.
You can install it using ``apt install swing`` and then ``pip install box2d box2d-kengz``
You can install it using ``apt install swig`` and then ``pip install box2d box2d-kengz``

.. code-block:: python
Expand Down
41 changes: 35 additions & 6 deletions docs/misc/changelog.rst
Expand Up @@ -5,11 +5,44 @@ Changelog

For download links, please look at `Github release page <https://github.com/hill-a/stable-baselines/releases>`_.

Master version 2.0.0.rc0 (TO BE RELEASED SOON)
-----------------------------------------------
Pre Release 2.1.1.a0 (WIP)
--------------------------

- fixed MpiAdam synchronization issue in PPO1 (thanks to @brendenpetersen) issue #50


Release 2.1.0 (2018-10-2)
-------------------------

.. warning::

This version contains breaking changes for DQN policies, please read the full details

**Bug fixes + doc update**


- added patch fix for equal function using `gym.spaces.MultiDiscrete` and `gym.spaces.MultiBinary`
- fixes for DQN action_probability
- re-added double DQN + refactored DQN policies **breaking changes**
- replaced `async` with `async_eigen_decomp` in ACKTR/KFAC for python 3.7 compatibility
- removed action clipping for prediction of continuous actions (see issue #36)
- fixed NaN issue due to clipping the continuous action in the wrong place (issue #36)
- documentation was updated (policy + DDPG example hyperparameters)

Release 2.0.0 (2018-09-18)
--------------------------

.. warning::

This version contains breaking changes, please read the full details

**Tensorboard, refactoring and bug fixes**


- Renamed DeepQ to DQN **breaking changes**
- Renamed DeepQPolicy to DQNPolicy **breaking changes**
- fixed DDPG behavior **breaking changes**
- changed default policies for DDPG, so that DDPG now works correctly **breaking changes**
- added more documentation (some modules from common).
- added doc about using custom env
- added Tensorboard support for A2C, ACER, ACKTR, DDPG, DeepQ, PPO1, PPO2 and TRPO
Expand All @@ -20,8 +53,6 @@ Master version 2.0.0.rc0 (TO BE RELEASED SOON)
- fixed PPO1 and TRPO done values for recurrent policies
- fixed image normalization not occurring when using images
- updated VecEnv objects for the new Gym version
- changed default policies for DDPG, so that DDPG now works correctly
- fixed DDPG behavior
- added test for DDPG
- refactored DQN policies
- added registry for policies, can be passed as string to the agent
Expand All @@ -33,8 +64,6 @@ Master version 2.0.0.rc0 (TO BE RELEASED SOON)
- added assert in PPO2 for recurrent policies
- fixed predict function to handle both vectorized and unwrapped environment
- added input check to the predict function
- changed DeepQ to DQN **breaking changes**
- changed DeepQPolicy to DQNPolicy **breaking changes**
- refactored ActorCritic models to reduce code duplication
- refactored Off Policy models (to begin HER and replay_buffer refactoring)
- added tests for auto vectorization detection
Expand Down
2 changes: 1 addition & 1 deletion docs/modules/a2c.rst
Expand Up @@ -62,7 +62,7 @@ Train a A2C agent on `CartPole-v1` using 4 processes.
del model # remove to demonstrate saving and loading
A2C.load("a2c_cartpole")
model = A2C.load("a2c_cartpole")
obs = env.reset()
while True:
Expand Down
5 changes: 2 additions & 3 deletions docs/modules/acer.rst
Expand Up @@ -42,8 +42,7 @@ Example
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACER
Expand All @@ -57,7 +56,7 @@ Example
del model # remove to demonstrate saving and loading
ACER.load("acer_cartpole")
model = ACER.load("acer_cartpole")
obs = env.reset()
while True:
Expand Down
5 changes: 2 additions & 3 deletions docs/modules/acktr.rst
Expand Up @@ -43,8 +43,7 @@ Example
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import ACKTR
Expand All @@ -58,7 +57,7 @@ Example
del model # remove to demonstrate saving and loading
ACKTR.load("acktr_cartpole")
model = ACKTR.load("acktr_cartpole")
obs = env.reset()
while True:
Expand Down
19 changes: 15 additions & 4 deletions docs/modules/ddpg.rst
Expand Up @@ -13,6 +13,17 @@ DDPG
The DDPG model does not support ``stable_baselines.common.policies`` because it uses q-value instead
of value estimation, as a result it must use its own policy models (see :ref:`ddpg_policies`).


.. rubric:: Available Policies

.. autosummary::
:nosignatures:

MlpPolicy
LnMlpPolicy
CnnPolicy
LnCnnPolicy

Notes
-----

Expand Down Expand Up @@ -47,7 +58,7 @@ Example
import gym
import numpy as np
from stable_baselines.ddpg.policies import MlpPolicy, CnnPolicy
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise, AdaptiveParamNoiseSpec
from stable_baselines import DDPG
Expand All @@ -58,15 +69,15 @@ Example
# the noise objects for DDPG
n_actions = env.action_space.shape[-1]
param_noise = None
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=float(0.2) * np.ones(n_actions))
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions), sigma=float(0.5) * np.ones(n_actions))
model = DDPG(MlpPolicy, env, verbose=1, param_noise=param_noise, action_noise=action_noise)
model.learn(total_timesteps=25000)
model.learn(total_timesteps=400000)
model.save("ddpg_mountain")
del model # remove to demonstrate saving and loading
DDPG.load("ddpg_mountain")
model = DDPG.load("ddpg_mountain")
obs = env.reset()
while True:
Expand Down
14 changes: 12 additions & 2 deletions docs/modules/dqn.rst
Expand Up @@ -14,6 +14,16 @@ and its extensions (Double-DQN, Dueling-DQN, Prioritized Experience Replay).
The DQN model does not support ``stable_baselines.common.policies``,
as a result it must use its own policy models (see :ref:`deepq_policies`).

.. rubric:: Available Policies

.. autosummary::
:nosignatures:

MlpPolicy
LnMlpPolicy
CnnPolicy
LnCnnPolicy

Notes
-----

Expand Down Expand Up @@ -46,7 +56,7 @@ Example
import gym
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines.deepq.policies import MlpPolicy
from stable_baselines import DQN
env = gym.make('CartPole-v1')
Expand All @@ -58,7 +68,7 @@ Example
del model # remove to demonstrate saving and loading
DeepQ.load("deepq_cartpole")
model = DQN.load("deepq_cartpole")
obs = env.reset()
while True:
Expand Down
24 changes: 24 additions & 0 deletions docs/modules/policies.rst
Expand Up @@ -5,6 +5,30 @@
Policy Networks
===============

Stable-baselines provides a set of default policies, that can be used with most action spaces.
If you need more control on the policy architecture, You can also create a custom policy (see :ref:`custom_policy`).

.. note::

CnnPolicies are for images only. MlpPolicies are made for other type of features (e.g. robot joints)

.. warning::
For all algorithms (except DDPG), continuous actions are only clipped during training
(to avoid out of bound error). However, you have to manually clip the action when using
the `predict()` method.

.. rubric:: Available Policies

.. autosummary::
:nosignatures:

MlpPolicy
MlpLstmPolicy
MlpLnLstmPolicy
CnnPolicy
CnnLstmPolicy
CnnLnLstmPolicy


Base Classes
------------
Expand Down
8 changes: 4 additions & 4 deletions docs/modules/ppo1.rst
Expand Up @@ -20,7 +20,8 @@ For that, ppo uses clipping to avoid too large update.
Notes
-----

- Original paper: https://arxiv.org/abs/1502.05477
- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
- ``mpirun -np 8 python -m stable_baselines.ppo1.run_atari`` runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (``-h``) for more options.
- ``python -m stable_baselines.ppo1.run_mujoco`` runs the algorithm for 1M frames on a Mujoco environment.
Expand Down Expand Up @@ -52,8 +53,7 @@ Example
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO1
Expand All @@ -66,7 +66,7 @@ Example
del model # remove to demonstrate saving and loading
PPO1.load("ppo1_cartpole")
model = PPO1.load("ppo1_cartpole")
obs = env.reset()
while True:
Expand Down
11 changes: 6 additions & 5 deletions docs/modules/ppo2.rst
Expand Up @@ -25,12 +25,13 @@ For that, ppo uses clipping to avoid too large update.
Notes
-----

- Original paper: https://arxiv.org/abs/1707.06347
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
- ``python -m stable_baselines.ppo2.run_atari`` runs the algorithm for 40M
- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
- ``python -m stable_baselines.ppo2.run_atari`` runs the algorithm for 40M
frames = 10M timesteps on an Atari game. See help (``-h``) for more
options.
- ``python -m stable_baselines.ppo2.run_mujoco`` runs the algorithm for 1M
- ``python -m stable_baselines.ppo2.run_mujoco`` runs the algorithm for 1M
frames on a Mujoco environment.

Can I use?
Expand Down Expand Up @@ -73,7 +74,7 @@ Train a PPO agent on `CartPole-v1` using 4 processes.
del model # remove to demonstrate saving and loading
PPO2.load("ppo2_cartpole")
model = PPO2.load("ppo2_cartpole")
# Enjoy trained agent
obs = env.reset()
Expand Down
5 changes: 2 additions & 3 deletions docs/modules/trpo.rst
Expand Up @@ -43,8 +43,7 @@ Example
import gym
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy, \
CnnPolicy, CnnLstmPolicy, CnnLnLstmPolicy
from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import TRPO
Expand All @@ -57,7 +56,7 @@ Example
del model # remove to demonstrate saving and loading
TRPO.load("trpo_cartpole")
model = TRPO.load("trpo_cartpole")
obs = env.reset()
while True:
Expand Down
20 changes: 19 additions & 1 deletion stable_baselines/__init__.py
@@ -1,3 +1,6 @@
import gym
import numpy as np

from stable_baselines.a2c import A2C
from stable_baselines.acer import ACER
from stable_baselines.acktr import ACKTR
Expand All @@ -9,4 +12,19 @@
from stable_baselines.ppo2 import PPO2
from stable_baselines.trpo_mpi import TRPO

__version__ = "2.0.0.rc0"
__version__ = "2.1.1.a0"


# patch Gym spaces to add equality functions, if not implemented
# See https://github.com/openai/gym/issues/1171
if gym.spaces.MultiBinary.__eq__ == object.__eq__: # by default, all classes have the __eq__ function from object.
def _eq(self, other):
return self.n == other.n

gym.spaces.MultiBinary.__eq__ = _eq

if gym.spaces.MultiDiscrete.__eq__ == object.__eq__:
def _eq(self, other):
return np.all(self.nvec == other.nvec)

gym.spaces.MultiDiscrete.__eq__ = _eq

0 comments on commit 3131bec

Please sign in to comment.