Refactor common (#540)

* Move tensorflow layer definitions to a new file * Move Scheduler from A2C utils to common schedules file * Add missing definitions for legacy Scheduler * Move tensorflow-related utilities tf_utils * Move total_episode_reward_logger to tf_util * Move get_by_index to ACER codes (only used by ACER) * Move EpisodeStats to ACER codes (only used by ACER) * Finish refactoring a2c/utils.py * Refactor ppo2.py (move shared code elsewhere) * Refactor replay buffer (mode out from deepq to commons) * Remove shared function from SAC and TD3 (get_vars) * Remove unused code * Move flatten_lists to common file * Fix imports in tests * Add missing import to ACER * Fix ACER dtype error * Rename replay_buffer -> buffers * Remove unused import * Fix import in a test * Move orphan method to more social circles * Move PPO1/TRPO seg_gen to commons * Update changelog * Move SAC/TD3 policy code under more suitable tf_layers * Update to new traj_seg_gen * Add list of what was moved where Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
Stable-Baselines-Team · Feb 28, 2020 · a4efff0 · a4efff0
1 parent e601c36
commit a4efff0
Show file tree

Hide file tree

Showing 30 changed files with 905 additions and 893 deletions.
diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -17,6 +17,27 @@ Breaking Changes:
   when ``return_episode_rewards`` is set to ``True`` (instead of ``n_steps``)
 - Callback are now called after each ``env.step()`` for consistency (it was called every ``n_steps`` before
   in algorithm like ``A2C`` or ``PPO2``)
+- Removed unused code in ``common/a2c/utils.py`` (``calc_entropy_softmax``, ``make_path``)
+- **Refactoring, including removed files and moving functions.**
+
+   - Algorithms no longer import from each other, and ``common`` does not import from algorithms.
+   - ``a2c/utils.py`` removed and split into other files:
+
+      - common/tf_util.py: ``sample``, ``calc_entropy``, ``mse``, ``avg_norm``, ``total_episode_reward_logger``, 
+        ``q_explained_variance``, ``gradient_add``, ``avg_norm``, ``check_shape``, 
+        ``seq_to_batch``, ``batch_to_seq``.
+      - common/tf_layers.py: ``conv``, ``linear``, ``lstm``, ``_ln``, ``lnlstm``, ``conv_to_fc``, ``ortho_init``.
+      - a2c/a2c.py: ``discount_with_dones``.
+      - acer/acer_simple.py: ``get_by_index``, ``EpisodeStats``.
+      - common/schedules.py: ``constant``, ``linear_schedule``, ``middle_drop``, ``double_linear_con``, ``double_middle_drop``,
+        ``SCHEDULES``, ``Scheduler``.
+
+   - ``trpo_mpi/utils.py`` functions moved (``traj_segment_generator`` moved to ``common/runners.py``, ``flatten_lists`` to ``common/misc_util.py``).
+   - ``ppo2/ppo2.py`` functions moved (``safe_mean`` to ``common/math_util.py``, ``constfn`` and ``get_schedule_fn`` to ``common/schedules.py``).
+   - ``sac/policies.py`` function ``mlp`` moved to ``common/tf_layers.py``.
+   - ``sac/sac.py`` function ``get_vars`` removed (replaced with ``tf.util.get_trainable_vars``).
+   - ``deepq/replay_buffer.py`` renamed to ``common/buffers.py``.
+
 
 New Features:
 ^^^^^^^^^^^^^
@@ -58,6 +79,7 @@ Others:
 - Cleanup and refactoring in ``common/identity_env.py`` (@shwang)
 - Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)
 
+
 Documentation:
 ^^^^^^^^^^^^^^
 - Add dedicated page for callbacks

diff --git a/stable_baselines/a2c/a2c.py b/stable_baselines/a2c/a2c.py
@@ -8,8 +8,26 @@
 from stable_baselines.common import explained_variance, tf_util, ActorCriticRLModel, SetVerbosity, TensorboardWriter
 from stable_baselines.common.policies import ActorCriticPolicy, RecurrentActorCriticPolicy
 from stable_baselines.common.runners import AbstractEnvRunner
-from stable_baselines.a2c.utils import discount_with_dones, Scheduler, mse, total_episode_reward_logger
-from stable_baselines.ppo2.ppo2 import safe_mean
+from stable_baselines.common.schedules import Scheduler
+from stable_baselines.common.tf_util import mse, total_episode_reward_logger
+from stable_baselines.common.math_util import safe_mean
+
+
+def discount_with_dones(rewards, dones, gamma):
+    """
+    Apply the discount value to the reward, where the environment is not done
+
+    :param rewards: ([float]) The rewards
+    :param dones: ([bool]) Whether an environment is done or not
+    :param gamma: (float) The discount value
+    :return: ([float]) The discounted rewards
+    """
+    discounted = []
+    ret = 0  # Return: discounted reward
+    for reward, done in zip(rewards[::-1], dones[::-1]):
+        ret = reward + gamma * ret * (1. - done)  # fixed off by one bug
+        discounted.append(ret)
+    return discounted[::-1]
 
 
 class A2C(ActorCriticRLModel):