Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use the expected output in decoder training? #76

Closed
xiongma opened this issue Feb 25, 2019 · 10 comments
Closed

Why use the expected output in decoder training? #76

xiongma opened this issue Feb 25, 2019 · 10 comments

Comments

@xiongma
Copy link

xiongma commented Feb 25, 2019

@Kyubyong I have a question, why use decoder_input in decoder training? I think it will influence the model output,

@xiongma
Copy link
Author

xiongma commented Feb 26, 2019 via email

@ty5491003
Copy link

In the paper section 3.1:

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

It takes a mask mechanism to prevent output from decoder_input. And the reason of using decoder_input, i think it is used to calc the loss value.
@policeme @moonlight1776

@ywl0911
Copy link

ywl0911 commented Mar 19, 2019

I am also confused about this question.

@xiongma
Copy link
Author

xiongma commented Mar 19, 2019

@ywl0911 if you understand it, please contract me, thx!

@ywl0911
Copy link

ywl0911 commented Mar 19, 2019

In the paper section 3.1:

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

It takes a mask mechanism to prevent output from decoder_input. And the reason of using decoder_input, i think it is used to calc the loss value.
@policeme @moonlight1776

@ty5491003
您好,能方便解释下为什么decoder部分用decoder_input作为输入,这样的话在测试的时候我们拿不到decoder_input如何进行测试呢~

@ty5491003
Copy link

@ywl0911 当我说出原因时你一定会觉得很搞笑。
在 test.py代码的数据处理部分:
test_batches, num_test_batches, num_test_samples = get_batch(hp.test1, hp.test1,
注意看两个参数,是将同一个 hp.test1输入了两次。

@ywl0911
Copy link

ywl0911 commented Mar 20, 2019

@ywl0911 当我说出原因时你一定会觉得很搞笑。
在 test.py代码的数据处理部分:
test_batches, num_test_batches, num_test_samples = get_batch(hp.test1, hp.test1,
注意看两个参数,是将同一个 hp.test1输入了两次。

这……,那就是这个代码这个地方有问题么,是不是应该改成将上个时刻decoder的输出作为下个时刻的输入~

@ni1lloc
Copy link

ni1lloc commented Mar 20, 2019

@ywl0911 当我说出原因时你一定会觉得很搞笑。
在 test.py代码的数据处理部分:
test_batches, num_test_batches, num_test_samples = get_batch(hp.test1, hp.test1,
注意看两个参数,是将同一个 hp.test1输入了两次。

这……,那就是这个代码这个地方有问题么,是不是应该改成将上个时刻decoder的输出作为下个时刻的输入~

Transformer.eval(), at model.py:152
In inference section, we send ['<s>'] + y[0..t-1] into the model, and then it returns y[0..t], both of them has a length of t+1. Repeat caculating until all sentence in the batch outputs '<pad>'(or reach the max length)

the second input is unused for predictions.
Just check the definition of get_batch() at data_load.py:132 and input_fn() at data_load.py:92

@trx14
Copy link

trx14 commented Apr 12, 2019

@ywl0911 当我说出原因时你一定会觉得很搞笑。
在 test.py代码的数据处理部分:
test_batches, num_test_batches, num_test_samples = get_batch(hp.test1, hp.test1,
注意看两个参数,是将同一个 hp.test1输入了两次。

这……,那就是这个代码这个地方有问题么,是不是应该改成将上个时刻decoder的输出作为下个时刻的输入~

Transformer.eval(), at model.py:152
In inference section, we send ['<s>'] + y[0..t-1] into the model, and then it returns y[0..t], both of them has a length of t+1. Repeat caculating until all sentence in the batch outputs '<pad>'(or reach the max length)

the second input is unused for predictions.
Just check the definition of get_batch() at data_load.py:132 and input_fn() at data_load.py:92

But why don't use this method in training? Because it was very slow?

@Pydataman
Copy link

when training, use the expected output in decoder training to accelerate convergence, teacher forcing。

@xiongma xiongma closed this as completed Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants