Improved feature inference #2103

anderleich · 2021-09-21T12:02:53Z

I've improved the feature inference pipeline to allow prior tokenization with joiners. Features can be dumped to a file now for debugging purposes during the vocabulary building phase.

It assumes source and features can be fed in two different ways:

No previous tokenization:

SRC: this is a test.
FEATS: A A A B
RESULT: A A A B <null>

Previously tokenized:

SRC: this is a test ￭.
FEATS: A A A B A
RESULT: A A A B A

The second case might enable a more customized feature map.

francoishernandez · 2021-09-22T10:20:05Z

Hey @anderleich
Thanks for this new PR. Before going further, could you please pull the latest master and resolve the few conflicts. I had to make a few changes to allow the lint checks to pass #2106 .

anderleich · 2021-09-23T11:46:15Z

I fixed some bugs in the code. It should be ready now

anderleich · 2021-09-23T11:51:27Z

When feature merge type is concat, which options in the Transformer configuration should be changed?

rnn_size: 512
word_vec_size: 512
feat_vec_size: 16

I get the following error, as it is expecting 512 size tensors and not 512+16=528 size tensors

RuntimeError: Given normalized_shape=[512], expected input with shape [*, 512], but got input of size[197, 8, 528]

anderleich · 2021-09-23T12:12:52Z

I found setting enc_rnn_size: 528 works as expected in the encoder side. In the case of the Transformer architecture option name is not self-explanatory though. I guess there is a legacy reason for this.

However, dec_rnn_size should be set to the default value 512. This is not possible currently due to the following check:

OpenNMT-py/onmt/utils/parse.py

Line 257 in a55f246

same_size = model_opt.enc_rnn_size == model_opt.dec_rnn_size

I guess we should remove that constraint

francoishernandez · 2021-09-28T15:02:13Z

This constraint is inherent to the Transformer architecture.
That's linked to what I mentioned on your first PR. When concatenating features to the embeddings, you need to reduce the size of the embeddings for the final size in input to be the correct model size.
E.g. feat_vec_size 16 and rnn_size 512 --> word_vec_size 496
And, this means you can't share embeddings between encoder and decoder side, since encoder side embeddings will be 496 and decoder side embeddings will be 512.
Or, in the case of your comment above, feat_vec_size 16 + word_vec_size 512 --> 528 can't match with rnn_size of 512.
That's one of the cons of source features.

anderleich · 2021-09-29T07:47:10Z

Thanks! That's what I finally did. I guess we are ready to merge, don't we?

francoishernandez

A few comments.
Also, isn't there anything to add to the docs/FAQ?

onmt/constants.py

onmt/inputters/corpus.py

onmt/utils/alignment.py

anderleich · 2021-10-01T13:13:55Z

I've tried fixing what you mentioned.

Note: I've got another PR for the server part but I'll submit it in a new PR after this one is accepted.

Ander Corral added 3 commits September 21, 2021 13:12

Improved feature inference

aff93cc

Dump source features to file

b9d1306

Fixed feature transform test·

d161752

anderleich changed the title ~~Improved feature inference transform~~ Improved feature inference Sep 21, 2021

anderleich and others added 3 commits September 22, 2021 12:39

Merge branch 'master' into master

376942b

Corrected flake8 checks

64520b9

Fixed 'src_feats_vocab' validation with multiple datasets

a232df2

anderleich marked this pull request as draft September 22, 2021 14:13

Ander Corral added 2 commits September 23, 2021 13:22

Fixed bugs in subword mappings

0c6b444

Merge remote-tracking branch 'upstream/master'

ccc162d

anderleich marked this pull request as ready for review September 23, 2021 11:45

francoishernandez reviewed Sep 29, 2021

View reviewed changes

onmt/constants.py Show resolved Hide resolved

onmt/inputters/corpus.py Show resolved Hide resolved

onmt/inputters/corpus.py Show resolved Hide resolved

francoishernandez reviewed Sep 30, 2021

View reviewed changes

onmt/utils/alignment.py Outdated Show resolved Hide resolved

Ander Corral added 2 commits September 30, 2021 14:20

Kept casing constants

0175318

Fixed issues

22d05d5

francoishernandez merged commit 990dcf6 into OpenNMT:master Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved feature inference #2103

Improved feature inference #2103

anderleich commented Sep 21, 2021

francoishernandez commented Sep 22, 2021

anderleich commented Sep 23, 2021

anderleich commented Sep 23, 2021 •

edited

anderleich commented Sep 23, 2021

francoishernandez commented Sep 28, 2021

anderleich commented Sep 29, 2021 •

edited

francoishernandez left a comment

anderleich commented Oct 1, 2021

Improved feature inference #2103

Improved feature inference #2103

Conversation

anderleich commented Sep 21, 2021

francoishernandez commented Sep 22, 2021

anderleich commented Sep 23, 2021

anderleich commented Sep 23, 2021 • edited

anderleich commented Sep 23, 2021

francoishernandez commented Sep 28, 2021

anderleich commented Sep 29, 2021 • edited

francoishernandez left a comment

Choose a reason for hiding this comment

anderleich commented Oct 1, 2021

anderleich commented Sep 23, 2021 •

edited

anderleich commented Sep 29, 2021 •

edited