Understanding of reward and loss function #33

tocab · 2017-08-21T11:25:59Z

Hello,

i don't understand the combination of reward and loss function. The label which are given to the discriminator are defined as followed:

positive_labels = [[0, 1] for _ in positive_examples] negative_labels = [[1, 0] for _ in negative_examples]

The reward then is always the second, positive label:

ypred_for_auc = sess.run(discriminator.ypred_for_auc, feed) ypred = np.array([item[1] for item in ypred_for_auc])

So if the reward gets larger, the samples are identified as real class.
But then in the loss function of the generator, the reward is multiplied to the loss:

self.g_loss = -tf.reduce_sum( tf.reduce_sum( tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_emb, 1.0, 0.0) * tf.log( tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_emb]), 1e-20, 1.0) ), 1) * tf.reshape(self.rewards, [-1]) )

I don't understand that, since the loss gets minimized, the rewards will be minimized too. So shouldn't be taken the label item[0] for ypred?

The text was updated successfully, but these errors were encountered:

kunrenzhilu · 2017-08-22T03:15:56Z

Why?

loss gets minimized, the rewards will be minimized too

zhengliz · 2018-04-20T21:30:24Z

@tocab I agree with you. We are minimizing self.g_loss, which equals to maximizing the whole part inside -tf.reduce_sum(***). But *** is a product between a LogLikelihood that is <0, and a reward in the range of [0, 1]. Maximizing *** will simply push the reward to 0. Therefore, I believe even though the loss is decreasing, the model is actually getting less reward, which is in the opposite direction of what we want.

luofuli · 2018-06-17T09:35:16Z

@zhengliz @tocab I agree with you two. So is there anyone who replace the label item[0] for ypred?
More specifically,
change
ypred = np.array([item[1] for item in ypred_for_auc])
to
ypred = np.array([item[0] for item in ypred_for_auc])

LantaoYu · 2018-06-23T11:56:08Z

Hi, the reward should be the likelihood of a generated sample being real.

The intuitive explanation is, in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

Back to your discussion: "Maximizing *** will simply push the reward to 0." It should be noted that when training generator, the discriminator serves as a fixed environment and the reward is simply an external signal from that environment, which is not trainable. To be more specific, see this line, the reward for generator is a placeholder, which is equivalent to a provided constant number. So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

eduOS · 2018-06-26T03:17:28Z

What I learned, in computer vision scenarios include:

I found this answer is really of help:

And so is this tutorial which runs that:

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

The way I intuit the above is as bellow:
I'd like to paraphrase the quote above as: to maximize tf.log(D_fake), which implies maximizing the probability of the sample being real, is better than minimizing 1-tf.log(D_fake), which means minimizing the probability of the sample being fake. From the perspective of the Generator, either way can let the generator adjust its parameters to optimize the likelihood of the sample being real. That is, if the discriminator tells the sample as real then the generator needs less loss to reduce in Tensorflow(1-tf.log(D_fake), as said aforementioned) and hence less gradient. And vice versa.

In this scenario:

~~I beg to differ and stick by changing item[1] to item[0] as @zhengliz said, which conflicts with what the author @LantaoYu replied~~Let me paraphrase and analyse the reply from the author:

in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

If I've correctly grasped the meaning of the saying "in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real", I thought it implies that: G needs to optimize its trainable variables to get a higher reward(the probability of being real in his case, as supplemented). But hardly can anyone be convinced that by ~~penalizing more~~ scaling less the loss(negative log likelihood from the generator) down(amounts to penalizing more the network by a comparatively larger loss than the same network with the same loss but with a smaller reward between 0 and 1), the model won't drive us in the opposite direction. I think if the model do in the reverse way, that is, there exists a positive correlation between the loss and the reward, it would be more reasonable. In this implementation the correlation is negative, that is the more proper word with a larger likelihood more likely suffer from a bigger reward.

And subsequently the author wrote:

So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

@LantaoYu explained further that "For a good action that successfully fool the discriminator, you need to increase its probability in your distribution", but didn't articulate why "Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing". IIUC, E [Q(s,a) * log(p_\theta(a|s))] stands for the mean of the multiplication of the the probability of the sample being real and ~~the original loss~~the probability of the generated word corresponding to the target. But both in the paper and in this implementation the model is trying to minimize this, then how comes it that maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the function (2) in the paper? So, how can I comprehend "maximize E [Q(s,a) * log(p_\theta(a|s))]" correctly? Does maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the E [Q(s,a) * log(p_\theta(a|s))]?

Please help figure out anything that is wrong in my reasoning above. Thanks.

LantaoYu · 2018-06-26T04:05:14Z

@eduOS Thanks for your comment! Let's discuss your point one by one.

First, about your recommended "this answer" and the image. It's just an explanation of how the original GAN works and I don't see any contradictions here. The insight is GAN is a good framework for optimizing the symmetric and smooth JS divergence, but only for continuous random variables. So let's find out how to extend it to discrete sequences modeling.

Second, about your recommended tutorial and this quote

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

It should be noted that there is an error in this part of the tutorial and hence also in your quote. maximize tf.reduce_mean(tf.log(D_fake)) is equivalent to minimizing tf.reduce_mean(1 - tf.log(D_fake)), if you throw away the reduce_mean operation and the constant number 1. And if you look at the original paper, it says

And its meaning is "maximizing the likelihood of a fake sample being real is better that minimizing the likelihood of a fake sample being fake", and the reason is the latter will cause gradient vanishing and block optimizing, not others. But one important thing is we are always "maximizing the likelihood of a fake sample being real".

Third, about "the RL language part". I don't quite understand what you mean by "penalizing", as it seems not an "RL language". I think this part is pretty clear, in RL, the most important thing is to specify what is the reward, i.e. what action is good and what is bad. As I discussed, in GAN, when training G, you always want it to generate samples that D think is real, so the reward is just the likelihood of a sample being real. After agreeing on this, the rest is just RL policy gradient derivations, I recommend David SIlver's slides on Policy Gradient.

Fourth, about "E [Q(s,a) * log(p_\theta(a|s))]" and "both in the paper and in this implementation the model is trying to minimize this". Please look at the code carefully. In this line, we define the loss of G as "-E [Q(s,a) * log(p_\theta(a|s))]", and we are minimizing this loss. So we are minimizing the negative of the expectation, i.e. maximizing E [Q(s,a) * log(p_\theta(a|s))].

Again, thanks for your interest in my work. I do admit that there are some limitations of SeqGAN like high variance etc. Since it was done two years ago, I also recommend our latest paper and the code, which I believe is the state-of-the-art.

eduOS · 2018-06-29T07:07:27Z

@LantaoYu I've realized what I misunderstood. The larger the (negative likelihood * the reward) the larger the gradients and the better the parameters are optimized. I misinterpreted the combined loss.

luofuli mentioned this issue Jun 17, 2018

question about reward and loss function ChenChengKuan/SeqGAN_tensorflow#5

Closed

LantaoYu closed this as completed Jun 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding of reward and loss function #33

Understanding of reward and loss function #33

tocab commented Aug 21, 2017

kunrenzhilu commented Aug 22, 2017 •

edited

zhengliz commented Apr 20, 2018 •

edited

luofuli commented Jun 17, 2018 •

edited

LantaoYu commented Jun 23, 2018

eduOS commented Jun 26, 2018 •

edited

LantaoYu commented Jun 26, 2018 •

edited

eduOS commented Jun 29, 2018 •

edited

Understanding of reward and loss function #33

Understanding of reward and loss function #33

Comments

tocab commented Aug 21, 2017

kunrenzhilu commented Aug 22, 2017 • edited

zhengliz commented Apr 20, 2018 • edited

luofuli commented Jun 17, 2018 • edited

LantaoYu commented Jun 23, 2018

eduOS commented Jun 26, 2018 • edited

LantaoYu commented Jun 26, 2018 • edited

eduOS commented Jun 29, 2018 • edited

kunrenzhilu commented Aug 22, 2017 •

edited

zhengliz commented Apr 20, 2018 •

edited

luofuli commented Jun 17, 2018 •

edited

eduOS commented Jun 26, 2018 •

edited

LantaoYu commented Jun 26, 2018 •

edited

eduOS commented Jun 29, 2018 •

edited