Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Problems with PPO/ClippedPPO #87

Closed
KuenstlicheIntelligenz opened this issue Apr 9, 2018 · 10 comments
Closed

Problems with PPO/ClippedPPO #87

KuenstlicheIntelligenz opened this issue Apr 9, 2018 · 10 comments
Assignees

Comments

@KuenstlicheIntelligenz
Copy link

KuenstlicheIntelligenz commented Apr 9, 2018

Hey Guys,

I've trouble with the likely hood ratio and nan/inf. I'm using entropy regularization for the exploration and its getting quite low so i think that the distributions at some point can't be compared anymore.
My model learns until a certain point and then the nans happen. Adding a small epsilon in the ratio avoids the nans but then the reward curve is just dropping at some point and the model is not learning anymore.
(The KL Divergence is also divergent) I'm using my own environment, a feed forward architecture and have a continious problem.

I've already tried many things:

  • Optimizers: Adam(with different epsilons), RMSProp
  • Reducing the LR (that just postpones the crash)
  • Reducing the clipping(0.1) and the epochs, clip the gradients
  • Changing the coefficients for the value loss, policy loss and entropy
  • Changed the weight initializers and the network sizes (~a bigger network postpones the problem)
  • Changed the activation function (relu, lrelu, selu, tanh)

If i change the beta coefficient for the entropy i get either an ever increasing entropy or it falls until the crash happens.
The agent learns pretty well until that point, so i suppose i haven't made any error in my implementation.
I may have made an error in the amount i have changed the parameters.

Any tips or ideas to that?

@KuenstlicheIntelligenz
Copy link
Author

KuenstlicheIntelligenz commented Apr 11, 2018

If anyone has a similar problem, i solved it with the right parameters and by just clipping the logstd as a quickfix so it doesn't get too small. Also theres a known issue with Adam where you have to choose a higher epsilon. I used RMSProp.

Update: It hadn't solved it completely but postponing the crash

@ujjawalchugh97
Copy link

I'm facing the same issue. No matter what hyperparameters I use for the clipped PPO algorithm, the surrogate loss, KL divergence and the entropy become Nan after some time. Any solution to this?

@ujjawalchugh97
Copy link

@KuenstlicheIntelligenz did you figure out the cause/solution to the problem?

@MeiJuanLiu
Copy link

@ujjawalchugh97 , I'm facing the same issue, too. so i wonder to know if you have solved this problem .smaller network(with default layers) has no problem but a bigger network will cause nan value after just training for a few epochs and i have no idea to solve the problem.

@dhruvramani
Copy link

dhruvramani commented Jun 27, 2020

You can try this maybe : https://stats.stackexchange.com/a/66621

EDIT : Didn't work for me. Most of the values became zero instead. If anyone finds a solution, please let me know.

@eflopez1
Copy link

eflopez1 commented Dec 8, 2020

Hello. I don't use this library and am instead using Stable-Baselines . A similar issue came up for me where NaNs would appear after a few epochs of training on my custom environment and on larger Neural Networks. An interim solution has been to use smaller networks, but this is not ideal.

Recently I have seen success with large networks by setting the entropy coefficient to 0 as is discussed here: https://stable-baselines.readthedocs.io/en/master/.

A quick survey of how this library is setup and it seems to have a similar option, try that perhaps? On a side note, I am actually here because I am trying to learn why setting the entropy coefficient to 0 solves the problem.

Hope this helps.

@KuenstlicheIntelligenz
Copy link
Author

Sorry guys for the late answer, i had the problem in my masterthesis and remember solving it with clipping and adding epsilons to avoid getting too small and too large numbers.

@richardrl
Copy link

Hi @KuenstlicheIntelligenz can you elaborate where exactly you did the clipping? Or reference which part of code it was added

Also where you added the epsilon

Have been running into this exact same issue training a large model with PPO

@KuenstlicheIntelligenz
Copy link
Author

@richardrl
Sorry i cannot. I heavily changed the original repo and didnt documented well, due to time pressure. Also im not into the project anymore and currently have not the time to get into it.

I think the problem was that the probability distributions collapsed to a somewhat near dirac distributions and could not be compared anymore. So you could say Q(x) would be near 0 for some x while P(x) had a high probability, which resulted in Nans in the kl computation. Adding epsilon to Q(x) would still result in a very high kl divergence which resulted in exploding gradients which "destroyed" the model (learning curve dropping).
I somehow solved the problem by making sure the kl divergence and the gradients did not get too big. I surely clipped the gradients right before applying them to the network weights.

@richardrl
Copy link

Yea, I "fixed" this problem in the sense I get no errors anymore. I had to use clipping in the logp computation (visualize log(x) and exp(x) and think about what's reasonable). I also added an epsilon (ratio = logp_new -logp_old). Finally, I initialized the mean and variance of my distribution to be relatively small. This is important. If the variance is too high you are guaranteed to nan out when calculating the ratio

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants