[Question] What is the real intention for reward scaling with running variance of discounted rewards? #1165
Closed
3 of 4 tasks
Labels
question
Further information is requested
❓ Question
It confuses me a lot that using statistics of discounted rewards to rescale another quantity - reward. That seems a default choice for PPO. Is there any intuition for interpreting this choice? Why not just use single reward's running variance to achieve that? And what I can conclude is that its effect is pretty like learning rate annealing, or something I'm missing?
Checklist
The text was updated successfully, but these errors were encountered: