Skip to content

Commit

Permalink
Fix TRPO adv standardization when std is 0. (#723)
Browse files Browse the repository at this point in the history
* Fix TRPO adv standardization when std is 0.

* Updated changelog.

* Use epsilon for std division in TRPO MPI advantage standardization.

Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org>
  • Loading branch information
richardwu and araffin committed Mar 7, 2020
1 parent ac46c37 commit 63b1885
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/misc/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ Bug Fixes:
- Fixed a bug in ``BaseRLModel`` when seeding vectorized environments. (@NeoExtended)
- Fixed ``num_timesteps`` computation to be consistent between algorithms (updated after ``env.step()``)
Only ``TRPO`` and ``PPO1`` update it differently (after synchronization) because they rely on MPI
- Fixed bug in ``TRPO`` with NaN standardized advantages (@richardwu)
- Fixed partial minibatch computation in ExpertDataset (@richardwu)

Deprecations:
Expand Down
2 changes: 1 addition & 1 deletion stable_baselines/trpo_mpi/trpo_mpi.py
Original file line number Diff line number Diff line change
Expand Up @@ -340,7 +340,7 @@ def fisher_vector_product(vec):


vpredbefore = seg["vpred"] # predicted value function before update
atarg = (atarg - atarg.mean()) / atarg.std() # standardized advantage function estimate
atarg = (atarg - atarg.mean()) / (atarg.std() + 1e-8) # standardized advantage function estimate

# true_rew is the reward without discount
if writer is not None:
Expand Down

0 comments on commit 63b1885

Please sign in to comment.