en-ml Model and test data bug #4

santhoshtr · 2019-10-31T08:45:59Z

(Apologies if this is not the right repository to report this issue)

The data prepared for the Malayalam language has an issue. Consistantly there is a space before and after the Virama ് (U+0D4D). It is a connecting character and there should not be space around it.

Here is an example https://object.pouta.csc.fi/OPUS-MT/eval/ml-en/Tatoeba.opus.bpe32k-bpe32k.mlen1.transformer.ml.en.test.txt

Here the first word "ഒന ് നാം" should be "ഒന്നാം" by removing spaces around ്.

santhoshtr · 2019-10-31T08:56:55Z

On further testing, this issue is not limited to Malayalam, Tamil(ta) virama character 0BCD has the same issue. Example: https://object.pouta.csc.fi/OPUS-MT/eval/ta-en/Tatoeba.opus.bpe32k-bpe32k.taen1.transformer.ta.en.test.txt

Bengali(bn) has same issue https://object.pouta.csc.fi/OPUS-MT/models/bn-en/opus-2019-10-12.test.txt
Hindi too https://object.pouta.csc.fi/OPUS-MT/eval/hi-en/newsdev2014.opus.bpe32k-bpe32k.hien1.transformer.hi.en.test.txt

I think some text preparation step has bug.

jorgtied · 2019-11-06T06:55:38Z

This might be related to tokenization, which is not optimized for many languages. The new models will get rid of this and use SentencePiece for subword segmentation without prior tokenization. Hopefully, the issue will then go away.

santhoshtr · 2019-12-19T09:44:51Z

I noticed that the en-mr model which was created using sentencepiece does not have this issue.

jorgtied · 2019-12-19T17:52:54Z

I have uploaded development versions of additional translation models including one for en-ml using sentence-piece segmentation. Does that resolve the tokenisation issue?
https://github.com/Helsinki-NLP/Opus-MT/tree/master/train/work-spm/models

santhoshtr · 2019-12-20T05:26:02Z

I took the model from https://object.pouta.csc.fi/OPUS-MT-dev/models/en-ml/opus-2019-12-19.zip and I can confirm that the issue is resolved. I have updated our experimental system http://opusmt.wmflabs.org/ with these models

jorgtied closed this as completed Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

en-ml Model and test data bug #4

en-ml Model and test data bug #4

santhoshtr commented Oct 31, 2019

santhoshtr commented Oct 31, 2019

jorgtied commented Nov 6, 2019

santhoshtr commented Dec 19, 2019

jorgtied commented Dec 19, 2019

santhoshtr commented Dec 20, 2019

en-ml Model and test data bug #4

en-ml Model and test data bug #4

Comments

santhoshtr commented Oct 31, 2019

santhoshtr commented Oct 31, 2019

jorgtied commented Nov 6, 2019

santhoshtr commented Dec 19, 2019

jorgtied commented Dec 19, 2019

santhoshtr commented Dec 20, 2019