-
-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve translation of punctuation #268
Conversation
Nice! I wonder if long-term there's a way to encode this logic into the language model, but for now this should help with languages like French. |
I don't know if we can add logic in the models (ping @PJ-Finlay ) I think we just need to improve the models with more and better data sets this seems a problem with many languages:
|
The models are mapping Unicode in the source language to Unicode in the target language so it could be improved to better handle capitalization. This could be done by training on data that handles capitalization with the desired behavior. We could either add new data to the LibreTranslate Community Datasets or filter existing data like this commit is doing. I've moved away from doing automated data modifications in the Argos Translate training scripts because it's slow. If we wrote a set of filters/rules we could process existing data to create a new dataset then train on that data. |
Is there documentation with the sources Argos Translate or LibreTranslate uses to download/create the datasets the model(s) are trained on? I'm curious what data is there and how large it is. I don't know if it'll be much, but we could consider looking into exporting and including all approved strings on the public/hosted instance of Weblate? If that'd be useful, I'd happily look into how to dump this? It mostly consists of open-source projects, so we should be free to use it. Then the script and dump can be saved somewhere publicly accessible. (And the strings can be separated/grouped by license if that's a concern, so you can choose which strings you want by license.) We could hit other Weblate instances as well, namely Fedora's which also has many projects. I think it would be feasible to write a bash script which would do something like:
Across all licenses and languages, and regardless of if the translation was approved or not:
There are other large instances too, like the openSUSE Weblate instance, but I can't check metrics for that one since I don't have an account and don't want to make one right now. ^-^' |
The data mostly comes from Opus currently but if we scrape other sources or build new datasets we could use them for training future models. |
this should improve some translations
example: https://libretranslate.com/?source=fr&target=en&q=bonjour
"bonjour" in French is translated "Hello." in English. with the previous fix the uppercase letters were corrected, with this fix the punctuation is corrected
"bonjour" in French will now be translated in "hello" in English