Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support target features #2227

Closed
wants to merge 62 commits into from

Conversation

anderleich
Copy link
Contributor

This PR intends to add target features support to OpenNMT-py v3.0. All the code has been adapted for this new version.

Both source and target features support has been refactored for a more simplified handling of features. The way features are passed to the system has been changed and now features are appended to the actual textual data instead of providing a separate file. This also simplifies the way features are passed during inference and to the server. It uses the special character as a feature separator, as in the previous versions of the OpenNMT framework. For instance:

 I│1│3 love│0│1 eating│0│1 pizza│0│1

I've also added a way to provide default values for features. This can be really useful when mixing task specific data (with features) with general data which has not been annotated. Additionally, the filterfeats transform is no longer required and features are checked in the corpus loading process.

A YAML configuration file would look like this:

data:
    train:
        path_src: src_with_features.txt  #  I│1│3 love│0│1 eating│0│1 pizza│0│1
        path_tgt: tgt_with_features.txt  #  Me│1 gusta│0 comer│0 pizza│0
        transforms: [onmt_tokenize, inferfeats, filtertoolong]
    valid:
        path_src: src_with_features.txt
        path_tgt: tgt_with_features.txt
        transforms: [onmt_tokenize, inferfeats]

save_data: ./data
n_sample: -1

# # Vocab opts
src_vocab: data.vocab.src
tgt_vocab: data.vocab.tgt
n_src_feats: 2
n_tgt_feats: 1
src_feats_defaults: "0│1"
tgt_feats_defaults: "1"
feat_merge: "sum"

For now, I've made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work.

@vince62s , do you think this a good starting point?

@vince62s
Copy link
Member

Hello @anderleich yes good starting point even though would be good to fix the Lint & Tests and maybe add a test to show everything works fine.

@anderleich
Copy link
Contributor Author

anderleich commented Oct 26, 2022

Yes, totally agree. I just wanted to make sure I was on the right track. For now, I've just made the necessary changes in the code for vocabulary generation. That is, to make onmt_build_vocab work. Once I get the training part done, I will add the necessary tests to ensure everything works as expected.

vince62s and others added 25 commits December 1, 2022 18:44
* fixed tensorboard logging
* added test / validation tests
* added tests in github actions
* fixed validation scoring
* added default value to  choices
various fixes. see comment in PR.
* sources and refs tokens are recovered with vocab.lookup_index
* tests with dynamic scoring and copy are reactivated
* process transforms of buckets in batches rather than per example.
* pickable Vocab / v3.0.2
…tions (OpenNMT#2270)

* use native crossentropy
* doc
* fix transform bug
* empty transformed buckets are replaced by None (in build_vocab, it is of size 1 so when the exemple is filtered for instance, we can't reach the first instance of the empty list)
* detokenization in scoring_utils is done with apply_reverse
* added _detokenize to BPETransform
* handle empty TransformPipe by simply detokenizing with ' '.join()
* keep Label Smoothing for Validation (same as Train)
* fix save transforms
* various fixes along v3.0.3
@anderleich
Copy link
Contributor Author

Closing this PR as v3.0 branch was merged into master. I keep working on target features in a new one: #2289

@anderleich anderleich closed this Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants