Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[add] translatepy v2.0 #15

Merged
merged 14 commits into from Jun 14, 2021
Merged

[add] translatepy v2.0 #15

merged 14 commits into from Jun 14, 2021

Conversation

ZhymabekRoman
Copy link
Contributor

@ZhymabekRoman ZhymabekRoman commented Jun 10, 2021

New features:
- Exception raising
- Proxy support (partly needs to be refined - WIP)
- A better class management, with base classes
- Full code refactoring
- New Bing Translate implementation
And more .....

WIP*:
- Fully implement text to spech function
- Convert ISO 639 to CSV
- Implement supported_languages method

*WIP - working in process

@Animenosekai
Copy link
Owner

Just cloned your fork, I'll check how it works!

@Animenosekai
Copy link
Owner

@ZhymabekRoman What do you think of my commit?

@Animenosekai
Copy link
Owner

Animenosekai commented Jun 10, 2021

Also, I thought about adding a better shell version (with something like inquirer) but that would add a dependency

@Animenosekai
Copy link
Owner

Animenosekai commented Jun 10, 2021

I really like the idea that we can very easily add more translators, and even open this to some sorts of plugins, because the user can add whatever BaseTranslator inherited class they want.

Also, I think that giving out the detected language rather than "auto" is needed because without it it would mean that the user needs to make another request when for example they are trying to translate something but show the translation only when the language is different from the original language

@ZhymabekRoman
Copy link
Contributor Author

What do you think of my commit?

Thank you, great work.

I thought about adding a better shell version (with something like inquirer) but that would add a dependency

I think we can use click library: https://click.palletsprojects.com/en/8.0.x/

@Animenosekai
Copy link
Owner

I think we can use click library: https://click.palletsprojects.com/en/8.0.x/

What's cool about inquirer is that you can make a nice looking list for choosing for example what the user wants to do (transliterate, translate, etc.)

@ZhymabekRoman
Copy link
Contributor Author

Do you think it would be a good idea to use CSV to store ISO 693 values? Wouldn't it be better to use JSON instead of csv?

New CSV table solves a few problems:

  1. It makes easier to change and access the data. The data looks very structured.
  2. This table also acts as a list of languages supported by the service

image

I also added to the list values that are not in the official ISO 639 standard. For example Yandex Translator can translate text in emoji, and only Yandex supports translation of Latin Kazakh and Cyrillic Uzbek

@Animenosekai
Copy link
Owner

Do you think it would be a good idea to use CSV to store ISO 693 values? Wouldn't it be better to use JSON instead of csv?

New CSV table solves a few problems:

  1. It makes easier to change and access the data. The data looks very structured.
  2. This table also acts as a list of languages supported by the service

image

I also added to the list values that are not in the official ISO 639 standard. For example Yandex Translator can translate text in emoji, and only Yandex supports translation of Latin Kazakh and Cyrillic Uzbek

I guess we could make the data in CSV and convert it to a Python dict so it is a native python object and there is no I/O time when launching translatepy

I never worked concretely with CSV but if you are familiar with it why not.

Also, we should add the translation to the other languages

@Animenosekai
Copy link
Owner

@ZhymabekRoman I just added the interactive interface to translatepy!

@ZhymabekRoman
Copy link
Contributor Author

I just added the interactive interface to translatepy!

Nice! I fixed some bug

@ZhymabekRoman
Copy link
Contributor Author

ZhymabekRoman commented Jun 12, 2021

So, I've reworked the ISO 639 data storage mechanism a little bit. All the information is stored in a CSV table, which is very easy to edit and the data looks structured. All of the ISO 639 data and the languages supported by the services were compiled from scratch from public sources.

And by the way, GitHub can display CSV in a browser: https://github.com/ZhymabekRoman/translate/blob/main/playground/iso639.csv

I also wrote a special script that converts CSV to Python code - named typle. This script is in translate/playground/export_csv_iso639_table.py, and it generates file iso639_table.py, which should be put into translate/translatepy/utils/ folder.

And yes by the way, I accidentally deleted the playground folder and all the scripts inside, if there were any needed scripts restore from GIT history.

Here are a couple of examples of changes. Previously, the Language class did not provide information about the languages which are not on the ISO 639 list but are supported by services. Now you can get full information about the language if you know the language code used by the translation service. For example, let's take the language supported by the Bing service - Chinese Simplified Language. The code of the language used by Bing for translation is zh-Hans. Let's get information about the language:

>>> Language.by_bing("zh-Hans")
Language(name='Chinese Simplified', alpha2='', alpha3='', in_foreign_languages={'sw': 'Kichina Kilichorahisishwa', 'ne': 'चिनियाँ सरलीकृत', 'sq': 'Kineze E Thjeshtuar', 'ht': 'Chinwa Senplifye', 'nl': 'Vereenvoudigd Chinees', 'be': 'Кітайскі спрошчаны', 'ga': 'Síneach A Simplithe', 'ba': 'Ябай ҡытай', 'ta': 'சீன எளிமைப்படுத்தப்பட்ட', 'mg': 'Sinoa Notsorina', 'pa': 'ਚੀਨੀ ਸਧਾਰਨ', 'gd': 'Sìnis Shimplichte', 'fi': 'Kiina, Yksinkertaistettu', 'ky': 'Кытай жөнөкөйлөштүрүлгөн', 'ar': 'الصينية المبسطة', 'he': 'סינית פשוטה', 'lt': 'Kinų Supaprastinta', 'uz': 'Xitoy Soddalashtirilgan', 'pl': 'Chiński Uproszczony', 'mi': 'Hainamana Ngāwari', 'ms': 'Cina', 'sv': 'Kinesiska, Förenklad', 'uk': 'Китайський спрощений', 'pt': 'Chinês Simplificado', 'vi': 'Trung Quốc Đơn Giản', 'hu': 'Egyszerűsített Kínai', 'cy': 'Tsieineaidd Simplified', 'gu': 'ચિની સરળીકૃત', 'eo': 'Ĉina Simpligita', 'km': 'ចិនសាមញ្ញ', 'no': 'Kinesisk (Forenklet)', 'bg': 'Китайски опростен', 'es': 'Chino Simplificado', 'cv': 'Китай ансатлатнӑ', 'et': 'Hiina Lihtsustatud', 'ja': '簡体字中国語', 'da': 'Forenklet Kinesisk', 'bn': 'সরলীকৃত চীনা', 'it': 'Cinese semplificato', 'en': 'Chinese Simplified', 'ca': 'Xinès Simplificat', 'th': 'ภาษาจีนประยุกต์ Name', 'tl': 'Pinasimpleng Tsino', 'la': 'Seres Simpliciores', 'te': 'సరళీకృత చైనీస్', 'tt': 'Кытайча', 'ko': '중국어 간체', 'xh': 'Isitshayina Esenziwe Lula', 'ml': 'Chinese Simplified', 'sl': 'Poenostavljeno Kitajsko', 'af': 'Vereenvoudigde Sjinees', 'fa': 'چینی ساده شده', 'tg': 'Чин соддакардашудаи', 'hy': 'Չինական պարզեցված', 'hi': 'चीनी सरलीकृत', 'my': 'တရုတ်ရိုးရှင်း', 'el': 'Κινέζικα Απλοποιημένα', 'id': 'Cina Disederhanakan', 'ka': 'ჩინური გამარტივებული', 'mk': 'Кинески-Поедноставен', 'cs': 'Zjednodušená Čínština', 'is': 'Kínverska Einfaldað', 'lo': 'ພາສາຈີນແບບງ່າຍ(ຈີນກາງ)', 'eu': 'Txinera Erraztua', 'mr': 'सरलीकृत चीनी', 'jv': 'Chinese Simplified', 'sr': 'Кинески', 'bs': 'Kineski Pojednostavljeni', 'kn': 'ಚೀನೀ ಸರಳೀಕೃತ', 'ru': 'Китайский упрощенный', 'zh': '简体中文', 'gl': 'Chinés Simplificado', 'si': 'සරල චීන', 'ro': 'Chineză Simplificată', 'su': 'Cina Saderhana', 'fr': 'Chinois Simplifié', 'ur': 'آسان کردہ چینی', 'sk': 'Zjednodušená Čínština', 'lb': 'Chinesesch (Vereinfacht)', 'hr': 'Kineski pojednostavljeni', 'am': 'ቻይንኛ ቀላል', 'yi': 'כינעזיש סימפּלאַפייד', 'mn': 'Хятадын Хялбаршуулсан', 'de': 'Chinesisch (Vereinfacht)', 'kk': 'Қытай жеңілдетілген', 'mt': 'Ċiniż Simplifikata', 'lv': 'Ķīniešu Vienkāršotā', 'tr': 'Basitleştirilmiş Çince', 'zu': 'Isi-Chinese Esenziwe Lula', 'az': 'Basitleştirilmiş çin'}, yandex='', google='zh-cn', bing='zh-Hans', reverso='', deepl='')

Great, we have full information about the language. Let's also try to get information about emoji, which only Yandex supports and is not on the official ISO 639 list:

>>> Language.by_yandex("emj")
Language(name='Emoji', alpha2='', alpha3='', in_foreign_languages={'sw': 'Emoji', 'ne': 'Emoji', 'sq': 'Emoji', 'ht': 'Anoji', 'nl': 'Emoji', 'be': 'Emoji', 'ga': 'Emoji', 'ba': 'Эмодзи', 'ta': 'ஈமோஜி', 'mg': 'Emoji', 'pa': 'Emoji', 'gd': 'Emoji', 'fi': 'Emoji', 'ky': 'Климаты мелүүн.', 'ar': 'رموز تعبيرية', 'he': 'Emoji', 'lt': 'Emoji', 'uz': 'Emoji', 'pl': 'Emoji', 'mi': 'Whakapā', 'ms': 'Smiley', 'sv': 'Emoji', 'uk': 'Емодзі', 'pt': 'Emoji', 'vi': 'Xúc', 'hu': 'Emoji', 'cy': 'Emoji', 'gu': 'ઇમોજી', 'eo': 'Emoji', 'km': 'អារម្មណ៍', 'no': 'Emoji', 'bg': 'Емоджи', 'es': 'Emoji', 'cv': 'Эмодзи', 'et': 'Emoji', 'ja': '絵文字', 'da': 'Emoji', 'bn': 'ইমোজি', 'it': 'Emoji', 'en': 'Emoji', 'ca': 'L"ús d"emoji', 'th': 'Emoji', 'tl': 'Mga Emoji', 'la': 'Emoji', 'te': 'Emoji', 'tt': 'Эмодзи', 'ko': '이모티콘', 'xh': 'Emoji', 'ml': 'Fast in malayalam', 'sl': 'Emoji', 'af': 'Emoji', 'fa': 'شکلک', 'tg': 'Эмодзи', 'hy': 'Էմոձի', 'hi': 'इमोजी', 'my': 'စိတ္၀င္စားစရာ', 'el': 'Emoji', 'id': 'Emoji', 'ka': 'ემოჯი', 'mk': 'Emoji', 'cs': 'Smajlík', 'is': 'Emoji', 'lo': 'ສັນຍາລັກ', 'eu': 'Emoji', 'mr': 'ईमोजी', 'jv': 'Emoji', 'sr': 'Емоји', 'bs': 'Emoji', 'kn': 'ಎಮೊಜಿಯನ್ನು', 'ru': 'Эмодзи', 'zh': '表情符号', 'gl': 'Emoji', 'si': 'එමොජි', 'ro': 'Emoji', 'su': 'Emoji', 'fr': 'Emoji', 'ur': 'Emoji', 'sk': 'Emoji', 'lb': 'Emoji', 'hr': 'Emoji', 'am': 'አዳዲስ', 'yi': 'עמאָדזשי', 'mn': 'Эможи', 'de': 'Emoji', 'kk': 'Эмодзи', 'mt': 'Emoji', 'lv': 'Emocijzīme', 'tr': 'Emoji', 'zu': 'Emoji', 'az': 'Emoji'}, yandex='emj', google='', bing='', reverso='', deepl='')

Also by this you can get about the language of the text. If before the language code was returned by the API

>>> from translatepy.translators import YandexTranslate
>>> dl = YandexTranslate()
>>> dl.language("Hello, how are you?")
LanguageResult(service=Yandex, source=Hello, how are you?, result=en)

Now returns the Language object

>>> from translatepy.translators import YandexTranslate
>>> dl = YandexTranslate()
>>> dl.language("Hello, how are you?")
LanguageResult(service=Yandex, source=Hello, how are you?, result=Language(name='English', alpha2='en', alpha3='eng', in_foreign_languages={'sw': 'Kiingereza', 'ne': 'नेपाली', 'sq': 'Anglisht', 'ht': 'Angle', 'nl': 'Engels', 'be': 'Англійскі', 'ga': 'Béarla', 'ba': 'Инглиз', 'ta': 'தமிழ்', 'mg': 'Malagasy', 'pa': 'ਅੰਗਰੇਜ਼ੀ', 'gd': 'Gaelic', 'fi': 'Englanti', 'ky': 'Кайнатылган.', 'ar': 'English', 'he': 'אנגלית', 'lt': 'Anglų', 'uz': 'Www uzbekona uz joni', 'pl': 'Angielski', 'mi': 'Maori', 'ms': 'Bahasa inggeris', 'sv': 'Engelsk', 'uk': 'Англійський', 'pt': 'Inglês', 'vi': 'Tiếng anh', 'hu': 'Angol', 'cy': 'Saesneg', 'gu': 'અંગ્રેજી', 'eo': 'La angla', 'km': 'គ្លេស', 'no': 'Engelsk', 'bg': 'Английски', 'es': 'Ingl', 'cv': 'Акӑлчанла', 'et': 'Inglise', 'ja': '英語', 'da': 'Engelsk', 'bn': 'বাংলা সেক্স ভিডিও', 'it': 'Inglese', 'en': 'English', 'ca': 'Anglès', 'th': 'ภาษาอังกฤษ', 'tl': 'Ingles', 'la': 'Anglorum', 'te': 'తెలుగు', 'tt': 'Инглизчә', 'ko': '영어', 'xh': 'Isixhosa', 'ml': 'മലയാളം', 'sl': 'Slovenian', 'af': 'Engels', 'fa': 'انگلیسی', 'tg': 'English', 'hy': 'Անգլերեն', 'hi': 'अंग्रेजी', 'my': 'အဂၤလိပ္စာ', 'el': 'Αγγλική', 'id': 'Inggris-US-sdh', 'ka': 'ინგლისური', 'mk': 'Англиски', 'cs': 'Anglický', 'is': 'Enska', 'lo': 'ອັງກິດ', 'eu': 'Euskara', 'mr': 'एचडी', 'jv': 'Inggris', 'sr': 'Енглески', 'bs': 'Engleski', 'kn': 'ಕನ್ನಡ', 'ru': 'Английский', 'zh': '中文', 'gl': 'Inglés', 'si': 'ඉංග්රීසි', 'ro': 'Română', 'su': 'Basa inggris', 'fr': 'Anglais', 'ur': 'انگریزی', 'sk': 'Anglický', 'lb': 'Englischsprachig', 'hr': 'Engleski', 'am': 'አማርኛ', 'yi': 'ענגליש', 'mn': 'Англи хэл', 'de': 'Englischsprachig', 'kk': 'Ағылшын', 'mt': 'Malti', 'lv': 'Angļu', 'tr': 'İngilizce', 'zu': 'Isizulu', 'az': 'İngilis dili'}, yandex='en', google='en', bing='en', reverso='en', deepl='EN'))

@ZhymabekRoman
Copy link
Contributor Author

ZhymabekRoman commented Jun 12, 2021

Lmao, I just remembered that it was possible to use named typle instead of creating separate results model classes

For example:

from collections import namedtuple
TranslationResult = namedtuple("TranslationResult", "service source source_language destination_language result")

@Animenosekai
Copy link
Owner

Lmao, I just remembered that it was possible to use named typle instead of creating separate results model classes

For example:

from collections import namedtuple
TranslationResult = namedtuple("TranslationResult", "service source source_language destination_language result")

I mean, using classes isn't that bad too lol

Also, I looked at the script creating the python version of the CSV: Did everything work while doing so much translation with Yandex?

(also if you changed all of the translations with the Yandex's ones we could merge it with the previous data, generated by translating using Google translate to get more data while checking the similarity to improve the accuracy)

@ZhymabekRoman ZhymabekRoman marked this pull request as ready for review June 14, 2021 11:00
@Animenosekai
Copy link
Owner

@ZhymabekRoman Do you think that we should keep the _translate, _transliterate, _spellcheck, etc. methods abstract?

Like we could just leave them as normal functions, raise an exception by default so that we don't need to add it and raise an exception on each translator class.

I think though that we should keep the _language_normalize and _language_denormalize abstract as they are needed.

@ZhymabekRoman
Copy link
Contributor Author

Did everything work while doing so much translation with Yandex?

Yes, I tried to make more than 100 000 requests - everything works fine

And I think the PR is ready. Idk why tests won't works, but in python interactive shell works fine

@ZhymabekRoman
Copy link
Contributor Author

@ZhymabekRoman Do you think that we should keep the _translate, _transliterate, _spellcheck, etc. methods abstract?

Like we could just leave them as normal functions, raise an exception by default so that we don't need to add it and raise an exception on each translator class.

I think though that we should keep the _language_normalize and _language_denormalize abstract as they are needed.

Hmmm, yeah, I think that's a great idea

@Animenosekai
Copy link
Owner

And I think the PR is ready. Idk why tests won't works, but in python interactive shell works fine

Yea, I think that I'll merge it and we'll continue the small changes on the main branch

@Animenosekai Animenosekai merged commit 418cb4f into Animenosekai:main Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants