-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reward is much lower when using "--play" #1171
Comments
Solution: looking into the ====== I also had the same issue. I dug into the running process and found that the step-wise To make things clear, one has the following code in When I test the Also, the I managed to solve the second bug by using the code: ====== I finally got the key. The fact is, when the entire game ends, the system will return the result using the So, continuing the previous post, my solution is to check the So, the final answer is here: directly checking the existence of BTW, I just made it work on |
Thanks @198808xc, I've tested your answer on some mujoco environments and it seems it is also like that. When training, the reported reward mean is calculated from Mujoco environments are wrapped by a
which seems to be some kind of normalization plus clipping. And this is the reward that comes out of the outer |
I'm training models on Mujoco environments with the PPO2 algorithm on the tf2 branch of the project. During training, reward is slowly getting higher as expected. What is not expected is, when training has finished (or loading a previously trained model) the model seems not to perform well and reward is much lower than reported during training phase.
As an example, I trained a model for 2e5 in the HalfCheetah-v2 environment and during training reward increased to about between 200 and 300. However, just after training finished and this model was being run, the reported reward was about just a little more than 10. In this picture you can see the results:
This results were obtained by running:
python -m baselines.run --alg=ppo2 --env=HalfCheetah-v2 --network=mlp --num_timesteps=2e5 --log_path=logs\cheetah --play
Training different Mujoco environments or more steps doesn't change this issue. Or is it I am doing something wrong?
The text was updated successfully, but these errors were encountered: