Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

en-ml Model and test data bug #4

Closed
santhoshtr opened this issue Oct 31, 2019 · 5 comments
Closed

en-ml Model and test data bug #4

santhoshtr opened this issue Oct 31, 2019 · 5 comments

Comments

@santhoshtr
Copy link
Contributor

(Apologies if this is not the right repository to report this issue)

The data prepared for the Malayalam language has an issue. Consistantly there is a space before and after the Virama ് (U+0D4D). It is a connecting character and there should not be space around it.

Here is an example https://object.pouta.csc.fi/OPUS-MT/eval/ml-en/Tatoeba.opus.bpe32k-bpe32k.mlen1.transformer.ml.en.test.txt

Here the first word "ഒന ് നാം" should be "ഒന്നാം" by removing spaces around ്.

@santhoshtr
Copy link
Contributor Author

On further testing, this issue is not limited to Malayalam, Tamil(ta) virama character 0BCD has the same issue. Example: https://object.pouta.csc.fi/OPUS-MT/eval/ta-en/Tatoeba.opus.bpe32k-bpe32k.taen1.transformer.ta.en.test.txt

Bengali(bn) has same issue https://object.pouta.csc.fi/OPUS-MT/models/bn-en/opus-2019-10-12.test.txt
Hindi too https://object.pouta.csc.fi/OPUS-MT/eval/hi-en/newsdev2014.opus.bpe32k-bpe32k.hien1.transformer.hi.en.test.txt

I think some text preparation step has bug.

@jorgtied
Copy link
Member

jorgtied commented Nov 6, 2019

This might be related to tokenization, which is not optimized for many languages. The new models will get rid of this and use SentencePiece for subword segmentation without prior tokenization. Hopefully, the issue will then go away.

@santhoshtr
Copy link
Contributor Author

I noticed that the en-mr model which was created using sentencepiece does not have this issue.

@jorgtied
Copy link
Member

I have uploaded development versions of additional translation models including one for en-ml using sentence-piece segmentation. Does that resolve the tokenisation issue?
https://github.com/Helsinki-NLP/Opus-MT/tree/master/train/work-spm/models

@santhoshtr
Copy link
Contributor Author

I took the model from https://object.pouta.csc.fi/OPUS-MT-dev/models/en-ml/opus-2019-12-19.zip and I can confirm that the issue is resolved. I have updated our experimental system http://opusmt.wmflabs.org/ with these models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants