Fix for issue 418 #451

ghidellar · 2020-06-26T11:53:15Z

No description provided.

shadiendrawis · 2020-06-29T11:58:38Z

Hi,

Nice catch, do you happen to have some example where this happens and the change fixes it?
I've run into some issues in other contexts before where probabilities become NaN if the agent doesn't train correctly, my only issue with the change is that adding the epsilon to the output policy vector makes that vector no longer a probability vector, but apparently the tensorflow distributions don't have a problem with that (probably because that epsilon is too small in most cases). Is there any other place where you can add that epsilon where it makes more sense mathematically and fixes the issue?

Regards,
Shadi

ghidellar · 2020-06-29T22:16:39Z

Hi @shadiendrawis , basically the Categorical distribution created using the softmax output contained not proper values ( this would be catched right away setting the tf.distributions.kl_divergence param value allow_nan_stats to False) . After that, all the values calculated using the action_probs_wrt_policy are not defined (entropy, kl_divergence, surrogate_loss) and the action_values = self.get_prediction(curr_state) call in the chooseAction contains Nan values. At this point I am thinking that we could perform a sanity check on self.action_probs_wrt_policy after it receives the log probs but sounds not so elegant to me and the KL divergence would be still not well defined. I reproduced this on a run using just 2 discrete actions in a multi feature continuous state space.

shadiendrawis · 2020-07-16T06:25:22Z

In that case, wouldn't it be more correct to try and fix the softmax outputs? how does adding an epsilon to the softmax outputs (which doesn't make them a probability vector) fix the problem?
It would be nice if you could provide an example that we can run to reproduce the problem locally and have a better understanding about what is causing it.
The reasons I'm hesitant about this fix:

I feel it masks the problem instead of fixing it since the self.action_probs_wrt_policy will still hold values that don't make sense.
We've seen this NaN problem happen constantly before, only to disappear later once we found the right set of hyper-parameters for the problem.

ghidellar · 2020-09-11T05:00:28Z

Hi @shadiendrawis . Sorry for the delay of this reply. I could go back to this problem lately and followed your advice. I moved the fix instead of adding the eps to the probs , adding it to the logits input to the softmax layer .
Actually I found a corner case scenario ( probably due to a particular data) which was still breaking the code. Moving the fix to the logits seems to fix this case as well so seems more solid .

self.policy_mean = tf.nn.softmax(policy_values + eps, name="policy")

Please let me know if this could make sense as we are not messing with probs anymore.

ghidellar and others added 2 commits June 26, 2020 23:40

Fix for issue 418

f0a9e03

Merge branch 'master' into ppo_head_fix

c21f20a

shadiendrawis self-requested a review June 29, 2020 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for issue 418 #451

Fix for issue 418 #451

ghidellar commented Jun 26, 2020

shadiendrawis commented Jun 29, 2020

ghidellar commented Jun 29, 2020 •

edited

Loading

shadiendrawis commented Jul 16, 2020

ghidellar commented Sep 11, 2020

Fix for issue 418 #451

Are you sure you want to change the base?

Fix for issue 418 #451

Conversation

ghidellar commented Jun 26, 2020

shadiendrawis commented Jun 29, 2020

ghidellar commented Jun 29, 2020 • edited Loading

shadiendrawis commented Jul 16, 2020

ghidellar commented Sep 11, 2020

ghidellar commented Jun 29, 2020 •

edited

Loading