Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Fix for issue 418 #451

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ghidellar
Copy link

No description provided.

@shadiendrawis shadiendrawis self-requested a review June 29, 2020 11:51
@shadiendrawis
Copy link
Collaborator

Hi,

Nice catch, do you happen to have some example where this happens and the change fixes it?
I've run into some issues in other contexts before where probabilities become NaN if the agent doesn't train correctly, my only issue with the change is that adding the epsilon to the output policy vector makes that vector no longer a probability vector, but apparently the tensorflow distributions don't have a problem with that (probably because that epsilon is too small in most cases). Is there any other place where you can add that epsilon where it makes more sense mathematically and fixes the issue?

Regards,
Shadi

@ghidellar
Copy link
Author

ghidellar commented Jun 29, 2020

Hi @shadiendrawis , basically the Categorical distribution created using the softmax output contained not proper values ( this would be catched right away setting the tf.distributions.kl_divergence param value allow_nan_stats to False) . After that, all the values calculated using the action_probs_wrt_policy are not defined (entropy, kl_divergence, surrogate_loss) and the action_values = self.get_prediction(curr_state) call in the chooseAction contains Nan values. At this point I am thinking that we could perform a sanity check on self.action_probs_wrt_policy after it receives the log probs but sounds not so elegant to me and the KL divergence would be still not well defined. I reproduced this on a run using just 2 discrete actions in a multi feature continuous state space.

@shadiendrawis
Copy link
Collaborator

In that case, wouldn't it be more correct to try and fix the softmax outputs? how does adding an epsilon to the softmax outputs (which doesn't make them a probability vector) fix the problem?
It would be nice if you could provide an example that we can run to reproduce the problem locally and have a better understanding about what is causing it.
The reasons I'm hesitant about this fix:

  1. I feel it masks the problem instead of fixing it since the self.action_probs_wrt_policy will still hold values that don't make sense.
  2. We've seen this NaN problem happen constantly before, only to disappear later once we found the right set of hyper-parameters for the problem.

@ghidellar
Copy link
Author

Hi @shadiendrawis . Sorry for the delay of this reply. I could go back to this problem lately and followed your advice. I moved the fix instead of adding the eps to the probs , adding it to the logits input to the softmax layer .
Actually I found a corner case scenario ( probably due to a particular data) which was still breaking the code. Moving the fix to the logits seems to fix this case as well so seems more solid .

self.policy_mean = tf.nn.softmax(policy_values + eps, name="policy")

Please let me know if this could make sense as we are not messing with probs anymore.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants