[WIP] Support target features #2227

anderleich · 2022-10-24T15:03:54Z

This PR intends to add target features support to OpenNMT-py v3.0. All the code has been adapted for this new version.

Both source and target features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character ￨ as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I￨1￨3 love￨0￨1 eating￨0￨1 pizza￨0￨1
        path_tgt: tgt_with_features.txt  #  Me￨1 gusta￨0 comer￨0 pizza￨0
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src
tgt_vocab: data.vocab.tgt
n_src_feats: 2
n_tgt_feats: 1
src_feats_defaults: "0￨1"
tgt_feats_defaults: "1"
feat_merge: "sum"

For now, I've made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work.

@vince62s , do you think this a good starting point?

vince62s · 2022-10-26T09:31:51Z

Hello @anderleich yes good starting point even though would be good to fix the Lint & Tests and maybe add a test to show everything works fine.

anderleich · 2022-10-26T10:25:56Z

Yes, totally agree. I just wanted to make sure I was on the right track. For now, I've just made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work. Once I get the training part done, I will add the necessary tests to ensure everything works as expected.

v3.0

more docs update

onmt_server works with ctranslate2

Add CTranslate2 in requirements

Better docs

fix LM_scoring with v3

* fixed tensorboard logging * added test / validation tests * added tests in github actions

* fixed validation scoring * added default value to choices

various fixes. see comment in PR.

* revisit tgt_prefix

* sources and refs tokens are recovered with vocab.lookup_index * tests with dynamic scoring and copy are reactivated

* process transforms of buckets in batches rather than per example.

* pickable Vocab / v3.0.2

…tions (OpenNMT#2270) * use native crossentropy * doc * fix transform bug

* empty transformed buckets are replaced by None (in build_vocab, it is of size 1 so when the exemple is filtered for instance, we can't reach the first instance of the empty list) * detokenization in scoring_utils is done with apply_reverse * added _detokenize to BPETransform * handle empty TransformPipe by simply detokenizing with ' '.join()

* keep Label Smoothing for Validation (same as Train) * fix save transforms

* various fixes along v3.0.3

* wmt17 example

* better batching

* fix LM scoring

…atures

anderleich · 2023-01-07T21:09:57Z

Closing this PR as v3.0 branch was merged into master. I keep working on target features in a new one: #2289

anderleich added 2 commits October 24, 2022 16:16

Added target features support to build_vocab

ff6a605

Merge branch 'v3.0' into support_target_features

0e5cc73

l-k-11235 and others added 26 commits October 26, 2022 16:31

make prefetch_factor configurable in DataLoader

e42c209

bucket size ramp up

ddf5ddf

Merge pull request OpenNMT#2239 from vince62s/v3.0

24c0c4f

v3.0

bucket size ramp up is configurable

69e09db

more docs update

3f86f7d

changelog

2dda423

readme

f4cbb03

Merge pull request OpenNMT#2240 from vince62s/docs

ffac0bc

more docs update

fixed flake error

9b79dfb

fixed unit test error

a981b43

Better docs

06232c0

typo

7f8d26a

forgot to add moved modules

4584bfe

onmt_server works with ctranslate2

51b99f0

Fixed Indet.

21338bf

Merge pull request OpenNMT#2245 from Ehsan-Jahanbakhsh/ctranslate2server

8132c8a

onmt_server works with ctranslate2

Add CTranslate2 in requirements

793cba1

Merge pull request OpenNMT#2247 from guillaumekln/add-ct2-requirement

0937d36

Add CTranslate2 in requirements

add comment for LM upgrade

bcded17

Merge pull request OpenNMT#2241 from vince62s/betterdoc

d30366e

Better docs

Merge branch 'master' into facilitate-dataloading-optimization

f8e69a2

fix LM_scoring with v3

168f8df

Merge pull request OpenNMT#2248 from vince62s/fixlmscoring

13ad153

fix LM_scoring with v3

LMprior with CT2 to infer LM model

33b9110

simpler fix for special tokens order

2c756f8

back to list for consistent order

27c51fb

vince62s and others added 25 commits December 1, 2022 18:44

Update Translation.md

d61f22c

Fix tensorboard logging (OpenNMT#2260)

2eaeed3

* fixed tensorboard logging * added test / validation tests * added tests in github actions

Fix validation scoring (OpenNMT#2263)

dd28db1

* fixed validation scoring * added default value to choices

fixes (OpenNMT#2265)

08d2b99

various fixes. see comment in PR.

revisit tgt_prefix (OpenNMT#2267)

cadd99c

* revisit tgt_prefix

Optimize validation scoring (OpenNMT#2266)

70799ae

* sources and refs tokens are recovered with vocab.lookup_index * tests with dynamic scoring and copy are reactivated

Bucket processing (OpenNMT#2261)

874e18a

* process transforms of buckets in batches rather than per example.

pickable Vocab / v3.0.2 (OpenNMT#2268)

9698acd

* pickable Vocab / v3.0.2

Use native CrossEntropyLoss including label_smoothing + more optimisa…

9d617b8

…tions (OpenNMT#2270) * use native crossentropy * doc * fix transform bug

fixed coverage attention and translator for attn_debug (OpenNMT#2272)

b430e24

fix no tgt at inference (OpenNMT#2273)

ff9effd

keep Label Smoothing for Validation (same as Train) (OpenNMT#2274)

386e9be

* keep Label Smoothing for Validation (same as Train) * fix save transforms

revert approx normalization to accurate per item (OpenNMT#2275)

3b7c92b

Bump 3.0.3 (OpenNMT#2277)

5f32750

* various fixes along v3.0.3

Wmt17 example (OpenNMT#2278)

0ed5dac

* wmt17 example

better batching (OpenNMT#2279)

6c70fc4

* better batching

fixes (OpenNMT#2281)

f9c7ac8

fix LM scoring (OpenNMT#2284)

3ea1b66

* fix LM scoring

emb dropout in all cases (OpenNMT#2285)

f64ac05

fix bad mistake in lm prior (OpenNMT#2286)

edf4140

Merge remote-tracking branch 'upstream/master' into support_target_fe…

156f646

…atures

Update comment

8b5600f

Add target features support to training part

a8e6fe6

Merge branch 'v3.0' into support_target_features

471e12c

anderleich closed this Jan 7, 2023

anderleich reopened this Jan 7, 2023

anderleich mentioned this pull request Jan 7, 2023

[WIP] Support source and target features #2289

Closed

anderleich closed this Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support target features #2227

[WIP] Support target features #2227

anderleich commented Oct 24, 2022

vince62s commented Oct 26, 2022

anderleich commented Oct 26, 2022 •

edited

anderleich commented Jan 7, 2023

[WIP] Support target features #2227

[WIP] Support target features #2227

Conversation

anderleich commented Oct 24, 2022

vince62s commented Oct 26, 2022

anderleich commented Oct 26, 2022 • edited

anderleich commented Jan 7, 2023

anderleich commented Oct 26, 2022 •

edited