Fix TRPO adv standardization when std is 0. (#723)

* Fix TRPO adv standardization when std is 0. * Updated changelog. * Use epsilon for std division in TRPO MPI advantage standardization. Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
Stable-Baselines-Team · Mar 7, 2020 · 63b1885 · 63b1885
1 parent ac46c37
commit 63b1885
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 1 deletion.
diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -69,6 +69,7 @@ Bug Fixes:
 - Fixed a bug in ``BaseRLModel`` when seeding vectorized environments. (@NeoExtended)
 - Fixed ``num_timesteps`` computation to be consistent between algorithms (updated after ``env.step()``)
   Only ``TRPO`` and ``PPO1`` update it differently (after synchronization) because they rely on MPI
+- Fixed bug in ``TRPO`` with NaN standardized advantages (@richardwu)
 - Fixed partial minibatch computation in ExpertDataset (@richardwu)
 
 Deprecations:

diff --git a/stable_baselines/trpo_mpi/trpo_mpi.py b/stable_baselines/trpo_mpi/trpo_mpi.py
@@ -340,7 +340,7 @@ def fisher_vector_product(vec):
 
 
                         vpredbefore = seg["vpred"]  # predicted value function before update
-                        atarg = (atarg - atarg.mean()) / atarg.std()  # standardized advantage function estimate
+                        atarg = (atarg - atarg.mean()) / (atarg.std() + 1e-8)  # standardized advantage function estimate
 
                         # true_rew is the reward without discount
                         if writer is not None: