-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
en-ml Model and test data bug #4
Comments
On further testing, this issue is not limited to Malayalam, Tamil(ta) virama character 0BCD has the same issue. Example: https://object.pouta.csc.fi/OPUS-MT/eval/ta-en/Tatoeba.opus.bpe32k-bpe32k.taen1.transformer.ta.en.test.txt Bengali(bn) has same issue https://object.pouta.csc.fi/OPUS-MT/models/bn-en/opus-2019-10-12.test.txt I think some text preparation step has bug. |
This might be related to tokenization, which is not optimized for many languages. The new models will get rid of this and use SentencePiece for subword segmentation without prior tokenization. Hopefully, the issue will then go away. |
I noticed that the en-mr model which was created using sentencepiece does not have this issue. |
I have uploaded development versions of additional translation models including one for en-ml using sentence-piece segmentation. Does that resolve the tokenisation issue? |
I took the model from https://object.pouta.csc.fi/OPUS-MT-dev/models/en-ml/opus-2019-12-19.zip and I can confirm that the issue is resolved. I have updated our experimental system http://opusmt.wmflabs.org/ with these models |
(Apologies if this is not the right repository to report this issue)
The data prepared for the Malayalam language has an issue. Consistantly there is a space before and after the Virama ് (U+0D4D). It is a connecting character and there should not be space around it.
Here is an example https://object.pouta.csc.fi/OPUS-MT/eval/ml-en/Tatoeba.opus.bpe32k-bpe32k.mlen1.transformer.ml.en.test.txt
Here the first word "ഒന ് നാം" should be "ഒന്നാം" by removing spaces around ്.
The text was updated successfully, but these errors were encountered: