Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding of reward and loss function #33

Closed
tocab opened this issue Aug 21, 2017 · 7 comments
Closed

Understanding of reward and loss function #33

tocab opened this issue Aug 21, 2017 · 7 comments

Comments

@tocab
Copy link

tocab commented Aug 21, 2017

Hello,

i don't understand the combination of reward and loss function. The label which are given to the discriminator are defined as followed:

positive_labels = [[0, 1] for _ in positive_examples] negative_labels = [[1, 0] for _ in negative_examples]

The reward then is always the second, positive label:

ypred_for_auc = sess.run(discriminator.ypred_for_auc, feed) ypred = np.array([item[1] for item in ypred_for_auc])

So if the reward gets larger, the samples are identified as real class.
But then in the loss function of the generator, the reward is multiplied to the loss:

self.g_loss = -tf.reduce_sum( tf.reduce_sum( tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_emb, 1.0, 0.0) * tf.log( tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_emb]), 1e-20, 1.0) ), 1) * tf.reshape(self.rewards, [-1]) )

I don't understand that, since the loss gets minimized, the rewards will be minimized too. So shouldn't be taken the label item[0] for ypred?

@kunrenzhilu
Copy link

kunrenzhilu commented Aug 22, 2017

Why?

loss gets minimized, the rewards will be minimized too

@zhengliz
Copy link

zhengliz commented Apr 20, 2018

@tocab I agree with you. We are minimizing self.g_loss, which equals to maximizing the whole part inside -tf.reduce_sum(***). But *** is a product between a LogLikelihood that is <0, and a reward in the range of [0, 1]. Maximizing *** will simply push the reward to 0. Therefore, I believe even though the loss is decreasing, the model is actually getting less reward, which is in the opposite direction of what we want.

@luofuli
Copy link

luofuli commented Jun 17, 2018

@zhengliz @tocab I agree with you two. So is there anyone who replace the label item[0] for ypred?
More specifically,
change
ypred = np.array([item[1] for item in ypred_for_auc])
to
ypred = np.array([item[0] for item in ypred_for_auc])

@LantaoYu
Copy link
Owner

Hi, the reward should be the likelihood of a generated sample being real.

The intuitive explanation is, in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

Back to your discussion: "Maximizing *** will simply push the reward to 0." It should be noted that when training generator, the discriminator serves as a fixed environment and the reward is simply an external signal from that environment, which is not trainable. To be more specific, see this line, the reward for generator is a placeholder, which is equivalent to a provided constant number. So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

@eduOS
Copy link

eduOS commented Jun 26, 2018

What I learned, in computer vision scenarios include:

I found this answer is really of help:
gan

And so is this tutorial which runs that:

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

The way I intuit the above is as bellow:
I'd like to paraphrase the quote above as: to maximize tf.log(D_fake), which implies maximizing the probability of the sample being real, is better than minimizing 1-tf.log(D_fake), which means minimizing the probability of the sample being fake. From the perspective of the Generator, either way can let the generator adjust its parameters to optimize the likelihood of the sample being real. That is, if the discriminator tells the sample as real then the generator needs less loss to reduce in Tensorflow(1-tf.log(D_fake), as said aforementioned) and hence less gradient. And vice versa.

In this scenario:

I beg to differ and stick by changing item[1] to item[0] as @zhengliz said, which conflicts with what the author @LantaoYu repliedLet me paraphrase and analyse the reply from the author:

in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

If I've correctly grasped the meaning of the saying "in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real", I thought it implies that: G needs to optimize its trainable variables to get a higher reward(the probability of being real in his case, as supplemented). But hardly can anyone be convinced that by penalizing more scaling less the loss(negative log likelihood from the generator) down(amounts to penalizing more the network by a comparatively larger loss than the same network with the same loss but with a smaller reward between 0 and 1), the model won't drive us in the opposite direction. I think if the model do in the reverse way, that is, there exists a positive correlation between the loss and the reward, it would be more reasonable. In this implementation the correlation is negative, that is the more proper word with a larger likelihood more likely suffer from a bigger reward.

And subsequently the author wrote:

So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

@LantaoYu explained further that "For a good action that successfully fool the discriminator, you need to increase its probability in your distribution", but didn't articulate why "Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing". IIUC, E [Q(s,a) * log(p_\theta(a|s))] stands for the mean of the multiplication of the the probability of the sample being real and the original lossthe probability of the generated word corresponding to the target. But both in the paper and in this implementation the model is trying to minimize this, then how comes it that maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the function (2) in the paper? So, how can I comprehend "maximize E [Q(s,a) * log(p_\theta(a|s))]" correctly? Does maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the E [Q(s,a) * log(p_\theta(a|s))]?

Please help figure out anything that is wrong in my reasoning above. Thanks.

@LantaoYu
Copy link
Owner

LantaoYu commented Jun 26, 2018

@eduOS Thanks for your comment! Let's discuss your point one by one.

First, about your recommended "this answer" and the image. It's just an explanation of how the original GAN works and I don't see any contradictions here. The insight is GAN is a good framework for optimizing the symmetric and smooth JS divergence, but only for continuous random variables. So let's find out how to extend it to discrete sequences modeling.

Second, about your recommended tutorial and this quote

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

It should be noted that there is an error in this part of the tutorial and hence also in your quote. maximize tf.reduce_mean(tf.log(D_fake)) is equivalent to minimizing tf.reduce_mean(1 - tf.log(D_fake)), if you throw away the reduce_mean operation and the constant number 1. And if you look at the original paper, it says
2018-06-26 11 50 04
And its meaning is "maximizing the likelihood of a fake sample being real is better that minimizing the likelihood of a fake sample being fake", and the reason is the latter will cause gradient vanishing and block optimizing, not others. But one important thing is we are always "maximizing the likelihood of a fake sample being real".

Third, about "the RL language part". I don't quite understand what you mean by "penalizing", as it seems not an "RL language". I think this part is pretty clear, in RL, the most important thing is to specify what is the reward, i.e. what action is good and what is bad. As I discussed, in GAN, when training G, you always want it to generate samples that D think is real, so the reward is just the likelihood of a sample being real. After agreeing on this, the rest is just RL policy gradient derivations, I recommend David SIlver's slides on Policy Gradient.

Fourth, about "E [Q(s,a) * log(p_\theta(a|s))]" and "both in the paper and in this implementation the model is trying to minimize this". Please look at the code carefully. In this line, we define the loss of G as "-E [Q(s,a) * log(p_\theta(a|s))]", and we are minimizing this loss. So we are minimizing the negative of the expectation, i.e. maximizing E [Q(s,a) * log(p_\theta(a|s))].

Again, thanks for your interest in my work. I do admit that there are some limitations of SeqGAN like high variance etc. Since it was done two years ago, I also recommend our latest paper and the code, which I believe is the state-of-the-art.

@eduOS
Copy link

eduOS commented Jun 29, 2018

@LantaoYu I've realized what I misunderstood. The larger the (negative likelihood * the reward) the larger the gradients and the better the parameters are optimized. I misinterpreted the combined loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants