Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve translation of punctuation #268

Merged
merged 1 commit into from
May 30, 2022
Merged

improve translation of punctuation #268

merged 1 commit into from
May 30, 2022

Conversation

dingedi
Copy link
Collaborator

@dingedi dingedi commented May 30, 2022

this should improve some translations

example: https://libretranslate.com/?source=fr&target=en&q=bonjour
"bonjour" in French is translated "Hello." in English. with the previous fix the uppercase letters were corrected, with this fix the punctuation is corrected

"bonjour" in French will now be translated in "hello" in English

@pierotofy
Copy link
Member

Nice! I wonder if long-term there's a way to encode this logic into the language model, but for now this should help with languages like French.

@pierotofy pierotofy merged commit ef8ccc2 into LibreTranslate:main May 30, 2022
@dingedi
Copy link
Collaborator Author

dingedi commented May 31, 2022

I don't know if we can add logic in the models (ping @PJ-Finlay ) I think we just need to improve the models with more and better data sets

this seems a problem with many languages:

  • "hello" -> "Hola." (en->es)
  • "hello" -> "Hallo." (en->nl)
  • "hello" -> "Olá." (en->pt)
  • "hello" -> "Pronto?" (en->it)
  • "hello" -> "Hallo" (en->de) no punctuation problem but a capital letter is added

@PJ-Finlay
Copy link
Contributor

The models are mapping Unicode in the source language to Unicode in the target language so it could be improved to better handle capitalization. This could be done by training on data that handles capitalization with the desired behavior.

We could either add new data to the LibreTranslate Community Datasets or filter existing data like this commit is doing. I've moved away from doing automated data modifications in the Argos Translate training scripts because it's slow. If we wrote a set of filters/rules we could process existing data to create a new dataset then train on that data.

@SethFalco
Copy link
Member

SethFalco commented Jun 26, 2022

@PJ-Finlay

Is there documentation with the sources Argos Translate or LibreTranslate uses to download/create the datasets the model(s) are trained on?

I'm curious what data is there and how large it is.

I don't know if it'll be much, but we could consider looking into exporting and including all approved strings on the public/hosted instance of Weblate? If that'd be useful, I'd happily look into how to dump this?

It mostly consists of open-source projects, so we should be free to use it.

Then the script and dump can be saved somewhere publicly accessible. (And the strings can be separated/grouped by license if that's a concern, so you can choose which strings you want by license.)

We could hit other Weblate instances as well, namely Fedora's which also has many projects.

I think it would be feasible to write a bash script which would do something like:


Across all licenses and languages, and regardless of if the translation was approved or not:

  • hosted.weblate.org has 12,372,575 translated strings.
  • translate.fedoraproject.org has 2,459,960 translated strings.
  • translate.jellyfin.org has 156,359 translated strings.

There are other large instances too, like the openSUSE Weblate instance, but I can't check metrics for that one since I don't have an account and don't want to make one right now. ^-^'

https://weblate.org/en/discover/

@PJ-Finlay
Copy link
Contributor

The data mostly comes from Opus currently but if we scrape other sources or build new datasets we could use them for training future models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants