Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing spaces between tags when using translate_html #71

Closed
thanhtoan1196 opened this issue Oct 29, 2022 · 2 comments
Closed

Missing spaces between tags when using translate_html #71

thanhtoan1196 opened this issue Oct 29, 2022 · 2 comments

Comments

@thanhtoan1196
Copy link

thanhtoan1196 commented Oct 29, 2022

Missing spaces between tags when using translate_html

Code

from translatepy import Translator
print(Translator().translate_html("<p>I am a student and <strong>you are a teacher</strong></p>", "de"))

Current:

<p>Ich bin Student und<strong>du bist ein Lehrer</strong></p>

Expected:

<p>Ich bin Student und <strong>du bist ein Lehrer</strong></p>
@Animenosekai
Copy link
Owner

Thanks for reaching us!

I could indeed reproduce your problem.

Just checked the code, we do not seem to remove the strings intentionally. This might be done by any translator outside translatepy.

Because we don't expect all the translators to support HTML translation, we need to separate each component of the HTML to translate them apart and reassemble everything at the end.

This has the side effect that each component is treated as separate, thus any cleaning (stripping the spaces for example) is done on every component.

<p>I am a student and <strong>you are a teacher</strong>, incredible</p>
~~~^^^^^^^^^^^^^^^^^^^~~~~~~~~^^^^^^^^^^^^^^^^^~~~~~~~~~^^^^^^^^^^^^~~~~
           1                          2                       3

These are 3 separate components, which will each be translated separately

Now, the problem is that we don't know what kind of cleaning is done by the translators, and it might even be different translators translating the different components.

For some differently structured languages, the translator might be adding or removing some kind of specific symbols which has a meaning in the resulting language.

The order of symbols in a single phrase might also need to be different.

Now if we introduce a basic checking before translating to see if we need to re-add spaces after the translation or not

...
if tail_space_before_translation and not result.endswith(" "):
    result += " "
...

It might work for Latin-based languages translations, but the translator might have deleted the spaces for a reason :

(will take my native languages for simplicity)

<p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p>

Should be translated in Japanese to

<p>僕は生徒で<strong>あなたは先生です</strong></p>

Notice that we removed the space, because we usually don't use lots of spaces in Japanese

We see that this behavior is also found when translating with translatepy

>>> from translatepy import Translate
>>> t = Translate()
>>> r = t.translate_html("<p>Je suis un étudiant <strong>et vous êtes un professeur</strong></p>", "Japanese")
>>> r
'<p>私は学生です<strong>そして、あなたは先生です</strong></p>'

(which is a weird translation because of the component separation, but that's another topic)

I would need to come up with a better algorithm to translate HTML content without losing the context (language wise and HTML wise) but I guess that would require complex NLP

If you have any idea, I would welcome them.

If you have any question or issue, feel free to ask them!

Oh, and sorry for being a bit inactive lately, but school work is way busier compared to what I previously had...

@Animenosekai
Copy link
Owner

Closing this for now, since it's been a while since this got any activity.

I partly continued this discussion in #93 if you are interested.

Feel free to reply if you want to reopen it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants