improve translation of punctuation #268

dingedi · 2022-05-30T07:22:51Z

this should improve some translations

example: https://libretranslate.com/?source=fr&target=en&q=bonjour
"bonjour" in French is translated "Hello." in English. with the previous fix the uppercase letters were corrected, with this fix the punctuation is corrected

"bonjour" in French will now be translated in "hello" in English

pierotofy · 2022-05-30T16:33:04Z

Nice! I wonder if long-term there's a way to encode this logic into the language model, but for now this should help with languages like French.

dingedi · 2022-05-31T08:02:04Z

I don't know if we can add logic in the models (ping @PJ-Finlay ) I think we just need to improve the models with more and better data sets

this seems a problem with many languages:

"hello" -> "Hola." (en->es)
"hello" -> "Hallo." (en->nl)
"hello" -> "Olá." (en->pt)
"hello" -> "Pronto?" (en->it)
"hello" -> "Hallo" (en->de) no punctuation problem but a capital letter is added

PJ-Finlay · 2022-05-31T12:47:20Z

The models are mapping Unicode in the source language to Unicode in the target language so it could be improved to better handle capitalization. This could be done by training on data that handles capitalization with the desired behavior.

We could either add new data to the LibreTranslate Community Datasets or filter existing data like this commit is doing. I've moved away from doing automated data modifications in the Argos Translate training scripts because it's slow. If we wrote a set of filters/rules we could process existing data to create a new dataset then train on that data.

SethFalco · 2022-06-26T16:15:20Z

@PJ-Finlay

Is there documentation with the sources Argos Translate or LibreTranslate uses to download/create the datasets the model(s) are trained on?

I'm curious what data is there and how large it is.

I don't know if it'll be much, but we could consider looking into exporting and including all approved strings on the public/hosted instance of Weblate? If that'd be useful, I'd happily look into how to dump this?

It mostly consists of open-source projects, so we should be free to use it.

Then the script and dump can be saved somewhere publicly accessible. (And the strings can be separated/grouped by license if that's a concern, so you can choose which strings you want by license.)

We could hit other Weblate instances as well, namely Fedora's which also has many projects.

I think it would be feasible to write a bash script which would do something like:

Query all components https://hosted.weblate.org/api/components/
Filter the list using a whitelist of licenses we're happy to accept, i.e. Apache-2, MIT, AGPL, CC-BY, etc.
Under each component that is a license we're happy with, query for all available languages: https://hosted.weblate.org/api/projects/{}/languages/
Under each language, query for all units https://hosted.weblate.org/api/translations/{}/{}/{}/units/
For all approved translations, extract the following and save it to a CSV or JSON file:
- component: code (source language) and license.
- language: code (target language)
- unit: source and target (translation)

Across all licenses and languages, and regardless of if the translation was approved or not:

hosted.weblate.org has 12,372,575 translated strings.
translate.fedoraproject.org has 2,459,960 translated strings.
translate.jellyfin.org has 156,359 translated strings.

There are other large instances too, like the openSUSE Weblate instance, but I can't check metrics for that one since I don't have an account and don't want to make one right now. ^-^'

https://weblate.org/en/discover/

PJ-Finlay · 2022-06-26T17:28:19Z

The data mostly comes from Opus currently but if we scrape other sources or build new datasets we could use them for training future models.

improve translation of punctuation

9831ba8

pierotofy merged commit ef8ccc2 into LibreTranslate:main May 30, 2022

dingedi mentioned this pull request Jun 26, 2022

Exclude or handle certain patterns differently argosopentech/argos-translate#276

Closed

SethFalco mentioned this pull request Sep 19, 2023

Inclusion of tldr-pages corpus from OPUS in data-index.json argosopentech/argos-train#33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve translation of punctuation #268

improve translation of punctuation #268

dingedi commented May 30, 2022

pierotofy commented May 30, 2022

dingedi commented May 31, 2022

PJ-Finlay commented May 31, 2022

SethFalco commented Jun 26, 2022 •

edited

Loading

PJ-Finlay commented Jun 26, 2022

improve translation of punctuation #268

improve translation of punctuation #268

Conversation

dingedi commented May 30, 2022

pierotofy commented May 30, 2022

dingedi commented May 31, 2022

PJ-Finlay commented May 31, 2022

SethFalco commented Jun 26, 2022 • edited Loading

PJ-Finlay commented Jun 26, 2022

SethFalco commented Jun 26, 2022 •

edited

Loading