New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding of reward and loss function #33
Comments
Why?
|
@tocab I agree with you. We are minimizing self.g_loss, which equals to maximizing the whole part inside |
Hi, the reward should be the likelihood of a generated sample being real. The intuitive explanation is, in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real. Back to your discussion: "Maximizing *** will simply push the reward to 0." It should be noted that when training generator, the discriminator serves as a fixed environment and the reward is simply an external signal from that environment, which is not trainable. To be more specific, see this line, the reward for generator is a placeholder, which is equivalent to a provided constant number. So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing. |
What I learned, in computer vision scenarios include: I found this answer is really of help: And so is this tutorial which runs that:
The way I intuit the above is as bellow: In this scenario:
If I've correctly grasped the meaning of the saying "in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real", I thought it implies that: G needs to optimize its trainable variables to get a higher reward(the probability of being real in his case, as supplemented). But hardly can anyone be convinced that by And subsequently the author wrote:
@LantaoYu explained further that "For a good action that successfully fool the discriminator, you need to increase its probability in your distribution", but didn't articulate why "Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing". IIUC, E [Q(s,a) * log(p_\theta(a|s))] stands for the mean of the multiplication of the the probability of the sample being real and Please help figure out anything that is wrong in my reasoning above. Thanks. |
@eduOS Thanks for your comment! Let's discuss your point one by one. First, about your recommended "this answer" and the image. It's just an explanation of how the original GAN works and I don't see any contradictions here. The insight is GAN is a good framework for optimizing the symmetric and smooth JS divergence, but only for continuous random variables. So let's find out how to extend it to discrete sequences modeling. Second, about your recommended tutorial and this quote
It should be noted that there is an error in this part of the tutorial and hence also in your quote. Third, about "the RL language part". I don't quite understand what you mean by "penalizing", as it seems not an "RL language". I think this part is pretty clear, in RL, the most important thing is to specify what is the reward, i.e. what action is good and what is bad. As I discussed, in GAN, when training G, you always want it to generate samples that D think is real, so the reward is just the likelihood of a sample being real. After agreeing on this, the rest is just RL policy gradient derivations, I recommend David SIlver's slides on Policy Gradient. Fourth, about "E [Q(s,a) * log(p_\theta(a|s))]" and "both in the paper and in this implementation the model is trying to minimize this". Please look at the code carefully. In this line, we define the loss of G as "-E [Q(s,a) * log(p_\theta(a|s))]", and we are minimizing this loss. So we are minimizing the negative of the expectation, i.e. maximizing E [Q(s,a) * log(p_\theta(a|s))]. Again, thanks for your interest in my work. I do admit that there are some limitations of SeqGAN like high variance etc. Since it was done two years ago, I also recommend our latest paper and the code, which I believe is the state-of-the-art. |
@LantaoYu I've realized what I misunderstood. The larger the (negative likelihood * the reward) the larger the gradients and the better the parameters are optimized. I misinterpreted the combined loss. |
Hello,
i don't understand the combination of reward and loss function. The label which are given to the discriminator are defined as followed:
positive_labels = [[0, 1] for _ in positive_examples] negative_labels = [[1, 0] for _ in negative_examples]
The reward then is always the second, positive label:
ypred_for_auc = sess.run(discriminator.ypred_for_auc, feed) ypred = np.array([item[1] for item in ypred_for_auc])
So if the reward gets larger, the samples are identified as real class.
But then in the loss function of the generator, the reward is multiplied to the loss:
self.g_loss = -tf.reduce_sum( tf.reduce_sum( tf.one_hot(tf.to_int32(tf.reshape(self.x, [-1])), self.num_emb, 1.0, 0.0) * tf.log( tf.clip_by_value(tf.reshape(self.g_predictions, [-1, self.num_emb]), 1e-20, 1.0) ), 1) * tf.reshape(self.rewards, [-1]) )
I don't understand that, since the loss gets minimized, the rewards will be minimized too. So shouldn't be taken the label
item[0]
forypred
?The text was updated successfully, but these errors were encountered: