Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What is the real intention for reward scaling with running variance of discounted rewards? #1165

Closed
3 of 4 tasks
MagiFeeney opened this issue Nov 10, 2022 · 2 comments
Labels
question Further information is requested

Comments

@MagiFeeney
Copy link

MagiFeeney commented Nov 10, 2022

❓ Question

It confuses me a lot that using statistics of discounted rewards to rescale another quantity - reward. That seems a default choice for PPO. Is there any intuition for interpreting this choice? Why not just use single reward's running variance to achieve that? And what I can conclude is that its effect is pretty like learning rate annealing, or something I'm missing?

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • If code there is, it is minimal and working
  • If code there is, it is formatted using the markdown code blocks for both code and stack traces.
@MagiFeeney MagiFeeney added the question Further information is requested label Nov 10, 2022
@araffin
Copy link
Member

araffin commented Nov 10, 2022

Hello,

I have checked that there is no similar issue in the repo

Probably a duplicate of #348 (comment)

Why not just use single reward's running variance to achieve that?

you can also give that a try, I would be happy to have comparison (but I don't think it will make a big difference, the main thing is to scale the reward and return to make learning the value function easier).

@MagiFeeney
Copy link
Author

Those would be helpful, I will check it out! I am not confident to conclude that using single reward would be worse, but I have overwritten the normalize_reward function with creating another RunningMeanStd rew_rms, results show that it doesn't perform well, so I quickly skipped this choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants