Skip to content

Commit

Permalink
Refactor common (#540)
Browse files Browse the repository at this point in the history
* Move tensorflow layer definitions to a new file

* Move Scheduler from A2C utils to common schedules file

* Add missing definitions for legacy Scheduler

* Move tensorflow-related utilities tf_utils

* Move total_episode_reward_logger to tf_util

* Move get_by_index to ACER codes (only used by ACER)

* Move EpisodeStats to ACER codes (only used by ACER)

* Finish refactoring a2c/utils.py

* Refactor ppo2.py (move shared code elsewhere)

* Refactor replay buffer (mode out from deepq to commons)

* Remove shared function from SAC and TD3 (get_vars)

* Remove unused code

* Move flatten_lists to common file

* Fix imports in tests

* Add missing import to ACER

* Fix ACER dtype error

* Rename replay_buffer -> buffers

* Remove unused import

* Fix import in a test

* Move orphan method to more social circles

* Move PPO1/TRPO seg_gen to commons

* Update changelog

* Move SAC/TD3 policy code under more suitable tf_layers

* Update to new traj_seg_gen

* Add list of what was moved where

Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
  • Loading branch information
Miffyli and araffin committed Feb 28, 2020
1 parent e601c36 commit a4efff0
Show file tree
Hide file tree
Showing 30 changed files with 905 additions and 893 deletions.
22 changes: 22 additions & 0 deletions docs/misc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,27 @@ Breaking Changes:
when ``return_episode_rewards`` is set to ``True`` (instead of ``n_steps``)
- Callback are now called after each ``env.step()`` for consistency (it was called every ``n_steps`` before
in algorithm like ``A2C`` or ``PPO2``)
- Removed unused code in ``common/a2c/utils.py`` (``calc_entropy_softmax``, ``make_path``)
- **Refactoring, including removed files and moving functions.**

- Algorithms no longer import from each other, and ``common`` does not import from algorithms.
- ``a2c/utils.py`` removed and split into other files:

- common/tf_util.py: ``sample``, ``calc_entropy``, ``mse``, ``avg_norm``, ``total_episode_reward_logger``,
``q_explained_variance``, ``gradient_add``, ``avg_norm``, ``check_shape``,
``seq_to_batch``, ``batch_to_seq``.
- common/tf_layers.py: ``conv``, ``linear``, ``lstm``, ``_ln``, ``lnlstm``, ``conv_to_fc``, ``ortho_init``.
- a2c/a2c.py: ``discount_with_dones``.
- acer/acer_simple.py: ``get_by_index``, ``EpisodeStats``.
- common/schedules.py: ``constant``, ``linear_schedule``, ``middle_drop``, ``double_linear_con``, ``double_middle_drop``,
``SCHEDULES``, ``Scheduler``.

- ``trpo_mpi/utils.py`` functions moved (``traj_segment_generator`` moved to ``common/runners.py``, ``flatten_lists`` to ``common/misc_util.py``).
- ``ppo2/ppo2.py`` functions moved (``safe_mean`` to ``common/math_util.py``, ``constfn`` and ``get_schedule_fn`` to ``common/schedules.py``).
- ``sac/policies.py`` function ``mlp`` moved to ``common/tf_layers.py``.
- ``sac/sac.py`` function ``get_vars`` removed (replaced with ``tf.util.get_trainable_vars``).
- ``deepq/replay_buffer.py`` renamed to ``common/buffers.py``.


New Features:
^^^^^^^^^^^^^
Expand Down Expand Up @@ -58,6 +79,7 @@ Others:
- Cleanup and refactoring in ``common/identity_env.py`` (@shwang)
- Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)


Documentation:
^^^^^^^^^^^^^^
- Add dedicated page for callbacks
Expand Down
22 changes: 20 additions & 2 deletions stable_baselines/a2c/a2c.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,26 @@
from stable_baselines.common import explained_variance, tf_util, ActorCriticRLModel, SetVerbosity, TensorboardWriter
from stable_baselines.common.policies import ActorCriticPolicy, RecurrentActorCriticPolicy
from stable_baselines.common.runners import AbstractEnvRunner
from stable_baselines.a2c.utils import discount_with_dones, Scheduler, mse, total_episode_reward_logger
from stable_baselines.ppo2.ppo2 import safe_mean
from stable_baselines.common.schedules import Scheduler
from stable_baselines.common.tf_util import mse, total_episode_reward_logger
from stable_baselines.common.math_util import safe_mean


def discount_with_dones(rewards, dones, gamma):
"""
Apply the discount value to the reward, where the environment is not done
:param rewards: ([float]) The rewards
:param dones: ([bool]) Whether an environment is done or not
:param gamma: (float) The discount value
:return: ([float]) The discounted rewards
"""
discounted = []
ret = 0 # Return: discounted reward
for reward, done in zip(rewards[::-1], dones[::-1]):
ret = reward + gamma * ret * (1. - done) # fixed off by one bug
discounted.append(ret)
return discounted[::-1]


class A2C(ActorCriticRLModel):
Expand Down

0 comments on commit a4efff0

Please sign in to comment.