Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved feature inference #2103

Merged
merged 10 commits into from Oct 1, 2021
Merged

Conversation

anderleich
Copy link
Contributor

I've improved the feature inference pipeline to allow prior tokenization with joiners. Features can be dumped to a file now for debugging purposes during the vocabulary building phase.

It assumes source and features can be fed in two different ways:

  1. No previous tokenization:
SRC: this is a test.
FEATS: A A A B
RESULT: A A A B <null>
  1. Previously tokenized:
SRC: this is a test ■.
FEATS: A A A B A
RESULT: A A A B A

The second case might enable a more customized feature map.

@anderleich anderleich changed the title Improved feature inference transform Improved feature inference Sep 21, 2021
@francoishernandez
Copy link
Member

Hey @anderleich
Thanks for this new PR. Before going further, could you please pull the latest master and resolve the few conflicts. I had to make a few changes to allow the lint checks to pass #2106 .

@anderleich anderleich marked this pull request as draft September 22, 2021 14:13
@anderleich anderleich marked this pull request as ready for review September 23, 2021 11:45
@anderleich
Copy link
Contributor Author

I fixed some bugs in the code. It should be ready now

@anderleich
Copy link
Contributor Author

anderleich commented Sep 23, 2021

When feature merge type is concat, which options in the Transformer configuration should be changed?

rnn_size: 512
word_vec_size: 512
feat_vec_size: 16

I get the following error, as it is expecting 512 size tensors and not 512+16=528 size tensors

RuntimeError: Given normalized_shape=[512], expected input with shape [*, 512], but got input of size[197, 8, 528]

@anderleich
Copy link
Contributor Author

I found setting enc_rnn_size: 528 works as expected in the encoder side. In the case of the Transformer architecture option name is not self-explanatory though. I guess there is a legacy reason for this.

However, dec_rnn_size should be set to the default value 512. This is not possible currently due to the following check:

same_size = model_opt.enc_rnn_size == model_opt.dec_rnn_size

I guess we should remove that constraint

@francoishernandez
Copy link
Member

This constraint is inherent to the Transformer architecture.
That's linked to what I mentioned on your first PR. When concatenating features to the embeddings, you need to reduce the size of the embeddings for the final size in input to be the correct model size.
E.g. feat_vec_size 16 and rnn_size 512 --> word_vec_size 496
And, this means you can't share embeddings between encoder and decoder side, since encoder side embeddings will be 496 and decoder side embeddings will be 512.
Or, in the case of your comment above, feat_vec_size 16 + word_vec_size 512 --> 528 can't match with rnn_size of 512.
That's one of the cons of source features.

@anderleich
Copy link
Contributor Author

anderleich commented Sep 29, 2021

Thanks! That's what I finally did. I guess we are ready to merge, don't we?

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.
Also, isn't there anything to add to the docs/FAQ?

onmt/constants.py Show resolved Hide resolved
onmt/inputters/corpus.py Show resolved Hide resolved
onmt/inputters/corpus.py Show resolved Hide resolved
@anderleich
Copy link
Contributor Author

I've tried fixing what you mentioned.

Note: I've got another PR for the server part but I'll submit it in a new PR after this one is accepted.

@francoishernandez francoishernandez merged commit 990dcf6 into OpenNMT:master Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants