Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some problems when run example with OvercookedMultiEnv-v0 PPO #5

Closed
lixiyun98 opened this issue Jul 11, 2022 · 2 comments
Closed

some problems when run example with OvercookedMultiEnv-v0 PPO #5

lixiyun98 opened this issue Jul 11, 2022 · 2 comments

Comments

@lixiyun98
Copy link

Hello, I run this framework with the command for training PPO python3 trainer.py OvercookedMultiEnv-v0 PPO PPO --env-config '{"layout_name": "simple"}' --seed 10 --preset 1, However, in the training log, I found the loss of the value function increases as the reward increases, I am confused for this, and the ep_rew_mean can reach 300 when total-timesteps is 500000. And I wonder how to solve this because this looks like a bug.

@ShaoZhang0115
Copy link

We also met this problem. The loss, entropy loss, value loss, and policy gradient loss increase during the training process, with the reward increasing.
We have tried all the layouts, and the loss decreases only when the reward is 0, as we usually know.
We tried checking the call to SB3 and the computation of loss but found no obvious errors.
And no errors or warnings were encountered during training.
Would there be something wrong with the SB3 itself? Or the print of the log has something wrong?

@bsarkar321
Copy link
Collaborator

I believe that this behavior is expected and isn't an issue. SB3 (and all other PPO implementations I'm aware of) use a mean-squared error loss between the expected returns from the value function and the "true" value from the rollout. When policies are stochastic and their expected rewards are increasing, this will naturally lead to a higher variance in returns and a higher MSE (even though it is more accurate).

In my experience, the value loss and policy loss reported by PPO (or most other RL algorithms) do not provide a strong signal for "learning." I've noticed this same behavior with single-player environments like Cartpole, and also with different implementations of PPO (like CleanRL or Garage).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants