New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification and workflow needed: How to translate/localize links within the Qubes (doc) website? #3547

Open
tokideveloper opened this Issue Feb 6, 2018 · 32 comments

Comments

Projects
None yet
3 participants
@tokideveloper

tokideveloper commented Feb 6, 2018

In order to specify a translation workflow/guidelines, we need to specify how to translate/localize links within the Qubes OS (doc) website. In this specific issue, I would like to discuss ways to do so.

Here are some key questions (checked if solved):

  • How to translate links (without a fragment) in general?
  • How to deal with fragments (*) in links?
  • Which language code ("English", "en", "en-US", "eng" etc.) to use to differ the languages? (**)
  • How to automate the translation/l10n of links as well as possible?
  • Use relative links instead of absolute ones?

(*) A fragment is the part after a hash sign ("#"), here: leading to a specific header on the linked page.
(**) "en" seems to be the currently used one. See the redirect_from lists in the YAML front matters in the Markdown files.


Related issues:

#2824
#1452
#1333

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 6, 2018

Concerning automated link translation, see this idea.

Any comments?

Concerning automated link translation, see this idea.

Any comments?

@andrewdavidwong andrewdavidwong added this to the Documentation/website milestone Feb 7, 2018

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 7, 2018

Member

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point:
https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

Member

andrewdavidwong commented Feb 7, 2018

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point:
https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 7, 2018

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point:
https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

I'm not sure if we talk about the same thing. Maybe I used the term "relative link" ambiguous.

With a "relative link" I mean rather a "relative path" (not URL) in the sense that the path does not begin with a slash /, like local paths on a Linux machine. However, URLs are always absolute in my understanding.

For example, while https://www.qubes-os.org/doc/doc-guidelines/ and /doc/doc-guidelines/ are absolute paths following my definition, the paths ../, ../../intro/ and intro/ are relative ones. (Let's say that these relative links exist on the page /doc/doc-guidelines/ then they would lead to /doc, /intro and /doc/doc-guidelines/intro respectively. See my prototype.)

Use relative links instead of absolute ones?

I recently updated the documentation guidelines on this point:
https://www.qubes-os.org/doc/doc-guidelines/#markdown-conventions

(In short: Yes, please use relative instead of absolute paths.)

I'm not sure if we talk about the same thing. Maybe I used the term "relative link" ambiguous.

With a "relative link" I mean rather a "relative path" (not URL) in the sense that the path does not begin with a slash /, like local paths on a Linux machine. However, URLs are always absolute in my understanding.

For example, while https://www.qubes-os.org/doc/doc-guidelines/ and /doc/doc-guidelines/ are absolute paths following my definition, the paths ../, ../../intro/ and intro/ are relative ones. (Let's say that these relative links exist on the page /doc/doc-guidelines/ then they would lead to /doc, /intro and /doc/doc-guidelines/intro respectively. See my prototype.)

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 8, 2018

Member

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

Member

andrewdavidwong commented Feb 8, 2018

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 8, 2018

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

I see. Thank you.

So, now I want to discuss the use of

  • relative paths (../doc-guidelines),
  • absolute paths (/doc/doc-guidelines) and
  • "prefixed paths" ({{ page.langprefix }}/doc/doc-guidelines).

Relative Paths

Advantages
  • No absolute prefix needed. Thus, no prefix to adapt. Thus, no explicit localization needed (besides fragments?).
Disadvantages
  • All paths in all the canonical files have to be converted first.
  • It is harder to see where a relative path points to. Thus, rather error-prone.
  • When copying parts of an existing page to another page, all the relative paths have to be checked.

Absolute Paths

Advantages
  • Easy to see where an absolute path points to.
  • Robust when moving/copying (parts of) pages.
  • No conversion of the existing paths needed.
Disadvantages
  • They have to be localized manually. Automated localization could be hard, too.

"Prefixed Paths"

Advantages
  • Easy to see where a "prefixed path" points to.
  • Robust when moving/copying (parts of) pages.
  • When converting existing paths, only the language-dependent ones have to be prefixed.
  • Localization can be automated quite easily since only the YAML front matters need to be localized. Thus, much less error-prone and more generic.
Disadvantages
  • Prefixing of existing paths needed, plus extending the YAML front matter (*).

(*) I tried to set a variable langprefix within the Liquid code of my langswitch prototype, hoping that the variable would exist when printing the {{ content }}, but it does not seem to work.

Hint: When I tried out "prefixed paths", some strange behaviour appeared (paths with a literally leading slash in the source MD file became relative ones in the produced HTML files). So, one should test "prefixed paths" with all possibilities of creating links in advance.

Oh, I see. Yes, I think we're talking about two different things. My main concern is to avoid https://www.qubes-os.org/doc/ in favor of /doc/, since the former prevents easy navigation on a locally-served copy of the website.

I see. Thank you.

So, now I want to discuss the use of

  • relative paths (../doc-guidelines),
  • absolute paths (/doc/doc-guidelines) and
  • "prefixed paths" ({{ page.langprefix }}/doc/doc-guidelines).

Relative Paths

Advantages
  • No absolute prefix needed. Thus, no prefix to adapt. Thus, no explicit localization needed (besides fragments?).
Disadvantages
  • All paths in all the canonical files have to be converted first.
  • It is harder to see where a relative path points to. Thus, rather error-prone.
  • When copying parts of an existing page to another page, all the relative paths have to be checked.

Absolute Paths

Advantages
  • Easy to see where an absolute path points to.
  • Robust when moving/copying (parts of) pages.
  • No conversion of the existing paths needed.
Disadvantages
  • They have to be localized manually. Automated localization could be hard, too.

"Prefixed Paths"

Advantages
  • Easy to see where a "prefixed path" points to.
  • Robust when moving/copying (parts of) pages.
  • When converting existing paths, only the language-dependent ones have to be prefixed.
  • Localization can be automated quite easily since only the YAML front matters need to be localized. Thus, much less error-prone and more generic.
Disadvantages
  • Prefixing of existing paths needed, plus extending the YAML front matter (*).

(*) I tried to set a variable langprefix within the Liquid code of my langswitch prototype, hoping that the variable would exist when printing the {{ content }}, but it does not seem to work.

Hint: When I tried out "prefixed paths", some strange behaviour appeared (paths with a literally leading slash in the source MD file became relative ones in the produced HTML files). So, one should test "prefixed paths" with all possibilities of creating links in advance.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 9, 2018

Member

Why would absolute paths have to be localized manually when the others don't?

Member

andrewdavidwong commented Feb 9, 2018

Why would absolute paths have to be localized manually when the others don't?

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 9, 2018

Why would absolute paths have to be localized manually when the others don't?

Let's say that, for example, the page /doc/doc-guidelines/ shall link to /doc/.

If this is done by the absolute path /doc/ then translators have to translate it to /de-DE/doc/.

On the opposite, a relative path like ../.., pointing to /doc/, must be translated to ../.., too. Thus, no "translation" is needed.

Also, the "prefixed path" {{ page.langprefix }}/doc/ does not need to be "translated" (it's still {{ page.langprefix }}/doc/ in the translated version). However, the prefix {{ page.langprefix }} must already exist in the canonical version (and therefore has to be inserted, but only once for all translations). In addition, the value for page.langprefix must be set in the YAML front matter (in this example to the value /de-DE), but this can easily be done by an awk script or something.

Thus, both relative and "prefixed" paths don't need an explicit translation. They are already translated implicitly.

Why would absolute paths have to be localized manually when the others don't?

Let's say that, for example, the page /doc/doc-guidelines/ shall link to /doc/.

If this is done by the absolute path /doc/ then translators have to translate it to /de-DE/doc/.

On the opposite, a relative path like ../.., pointing to /doc/, must be translated to ../.., too. Thus, no "translation" is needed.

Also, the "prefixed path" {{ page.langprefix }}/doc/ does not need to be "translated" (it's still {{ page.langprefix }}/doc/ in the translated version). However, the prefix {{ page.langprefix }} must already exist in the canonical version (and therefore has to be inserted, but only once for all translations). In addition, the value for page.langprefix must be set in the YAML front matter (in this example to the value /de-DE), but this can easily be done by an awk script or something.

Thus, both relative and "prefixed" paths don't need an explicit translation. They are already translated implicitly.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 10, 2018

Member

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'
Member

andrewdavidwong commented Feb 10, 2018

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'
@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 14, 2018

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'

I think it's hard to decide whether a string is a link or not if you don't use a MD/HTML/YAML parser.

But even if we would use an appropriate parser, there could be corner cases where it's still hard to decide.

Let's say there are these lines:

<a href="/">To the root directory of the canonical/English/official version.</a>
...
<a href="/">To the root directory of the localized version in your language.</a>
...
<img src="/to/the/language-independent/logo.png">
...
Use `[here I am][/somewhere/in/the/repo]` to create a labeled link.
...
<a href="http://example.org/doc/">To the doc's root directory on another planet.</a>

The slashes must be interpreted differently, depending on the context, and thus, they could need different translations.

Any reason we can't just do a recursive find-and-replace? Something like:

$ find . -type f -print0 | xargs -0 sed -i 's#/doc/#/de-DE/doc/#g'

I think it's hard to decide whether a string is a link or not if you don't use a MD/HTML/YAML parser.

But even if we would use an appropriate parser, there could be corner cases where it's still hard to decide.

Let's say there are these lines:

<a href="/">To the root directory of the canonical/English/official version.</a>
...
<a href="/">To the root directory of the localized version in your language.</a>
...
<img src="/to/the/language-independent/logo.png">
...
Use `[here I am][/somewhere/in/the/repo]` to create a labeled link.
...
<a href="http://example.org/doc/">To the doc's root directory on another planet.</a>

The slashes must be interpreted differently, depending on the context, and thus, they could need different translations.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 15, 2018

Member

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

Member

andrewdavidwong commented Feb 15, 2018

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 15, 2018

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise:

  1. Get a list of all existing permalinks.
  2. Copy all files in the repo. Do the next two steps only on the copies.
  3. Automatically prefix all (permalink) paths in all files with a unique placeholder.
  4. Manually check the placeholders to be in the correct place and nowhere else.
  5. Upload the temporary copy to Transifex.
  6. Automatically replace the placeholders on a temporary copy with the language-dependent path prefix /de-DE etc.
  7. On future changes, do the above steps only on the differences.

In more detail:

(1) First, we get a list of all existing permalinks (like /, /doc/ etc.):

cd REPO
grep -re 'permalink: ' . | grep --invert-match -e './_config.yml' | cut -f 2 -d' ' | grep -e '^/' | sort

(2) Then we copy the files of the canonical version to a dedicated directory, let's call it new_lang_prefixed_DATETIME where DATETIME is the current date and time.

(3) There, into all files, we automatically insert a (hopefully) unique prefix like %LangPrefix% in front of all permalink strings that look like translatable paths, depending on the language HTML/MD/YAML etc., for example:

  • [/doc/] to [%LangPrefix%/doc/] in MD files,
  • (/) to (%LangPrefix%/) in MD files,
  • permalink: /doc/anti-evil-maid/ to permalink: %LangPrefix%/doc/anti-evil-maid/ in the YAML front matters,
  • href="/doc/" to href="%LangPrefix%/doc/" in HTML files and
  • src="/" to src="%LangPrefix%/" in HTML files.

This way, at least all paths should be covered. Hopefully, we won't miss any path.

(4) In a next step, we manually check all occurrences of %LangPrefix% that they shall be transformed to /de-DE etc. in the final files. If there is a failed check then we replace the prefix %LangPrefix% with %NoLangPrefix%.

(5) Upload the files to Transifex and tell the translators not to translate these special prefixes.

(6) Then we automatically go through all translation languages and all translated files and modify them by replacing all occurrences of %LangPrefix% with /de-DE etc. and %NoLangPrefix% with the empty string.

(7) In the future, when some of the canonical files change then we copy only the modified files to a new new_lang_prefixed_DATETIME directory and repeat the steps as described above only on the differences to the least recently new_lang_prefixed_DATETIME directory (via an appropriate use of the diff tool, for example). This way, we will reduce efforts and focus only on the changes.

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

EDIT: I swapped steps 5 and 6 to be able to upload only language-independent versions.

tokideveloper commented Feb 15, 2018

Why not simply run different commands on .md and .html files, or do the recursive find-and-replace only on the .md files (which are the vast majority), then manually edit the .html files?

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise:

  1. Get a list of all existing permalinks.
  2. Copy all files in the repo. Do the next two steps only on the copies.
  3. Automatically prefix all (permalink) paths in all files with a unique placeholder.
  4. Manually check the placeholders to be in the correct place and nowhere else.
  5. Upload the temporary copy to Transifex.
  6. Automatically replace the placeholders on a temporary copy with the language-dependent path prefix /de-DE etc.
  7. On future changes, do the above steps only on the differences.

In more detail:

(1) First, we get a list of all existing permalinks (like /, /doc/ etc.):

cd REPO
grep -re 'permalink: ' . | grep --invert-match -e './_config.yml' | cut -f 2 -d' ' | grep -e '^/' | sort

(2) Then we copy the files of the canonical version to a dedicated directory, let's call it new_lang_prefixed_DATETIME where DATETIME is the current date and time.

(3) There, into all files, we automatically insert a (hopefully) unique prefix like %LangPrefix% in front of all permalink strings that look like translatable paths, depending on the language HTML/MD/YAML etc., for example:

  • [/doc/] to [%LangPrefix%/doc/] in MD files,
  • (/) to (%LangPrefix%/) in MD files,
  • permalink: /doc/anti-evil-maid/ to permalink: %LangPrefix%/doc/anti-evil-maid/ in the YAML front matters,
  • href="/doc/" to href="%LangPrefix%/doc/" in HTML files and
  • src="/" to src="%LangPrefix%/" in HTML files.

This way, at least all paths should be covered. Hopefully, we won't miss any path.

(4) In a next step, we manually check all occurrences of %LangPrefix% that they shall be transformed to /de-DE etc. in the final files. If there is a failed check then we replace the prefix %LangPrefix% with %NoLangPrefix%.

(5) Upload the files to Transifex and tell the translators not to translate these special prefixes.

(6) Then we automatically go through all translation languages and all translated files and modify them by replacing all occurrences of %LangPrefix% with /de-DE etc. and %NoLangPrefix% with the empty string.

(7) In the future, when some of the canonical files change then we copy only the modified files to a new new_lang_prefixed_DATETIME directory and repeat the steps as described above only on the differences to the least recently new_lang_prefixed_DATETIME directory (via an appropriate use of the diff tool, for example). This way, we will reduce efforts and focus only on the changes.

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

EDIT: I swapped steps 5 and 6 to be able to upload only language-independent versions.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 15, 2018

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

Another method could be to override the files in new_lang_prefixed_DATETIME with newer versions, rather than storing new versions in their own directories. Thus, only one new_lang_prefixed_DATETIME directory is needed, making the suffix _DATETIME superfluous.

Of course, obsolete new_lang_prefixed_DATETIME directories may be removed. The directories might be useful if a new translation language appears since the newest version of a path-prefixed and manually inspected file should be uploaded. So, the directories would work as a cache.

Another method could be to override the files in new_lang_prefixed_DATETIME with newer versions, rather than storing new versions in their own directories. Thus, only one new_lang_prefixed_DATETIME directory is needed, making the suffix _DATETIME superfluous.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 16, 2018

In the algorithm above, I forgot the redirect-from links. So, whenever it's about permalinks then all redirect-from links must be considered, too.

In the algorithm above, I forgot the redirect-from links. So, whenever it's about permalinks then all redirect-from links must be considered, too.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 17, 2018

Member

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise: [...]

It sounds like this procedure would be something the localization team (including you) performs. If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

Member

andrewdavidwong commented Feb 17, 2018

Okay, I see that the vast majority should be handled automatically while some corner cases should be inspected manually. So, what about this compromise: [...]

It sounds like this procedure would be something the localization team (including you) performs. If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 17, 2018

If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

Okay, thank you! Of course, we'll try to minimize possible impacts on the canonical English documentation. But some things for that are not yet clear for me:

  • Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?
  • When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?

If it doesn't entail any changes to the canonical English documentation, the details of the procedure for accomplishing the agreed-upon end result are up to you.

Okay, thank you! Of course, we'll try to minimize possible impacts on the canonical English documentation. But some things for that are not yet clear for me:

  • Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?
  • When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?
@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 18, 2018

Member

Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?

@marmarek is going to make (a) separate submodule(s) for the actual translated content (#2925).

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?

I think this is what #2930 is about.

Member

andrewdavidwong commented Feb 18, 2018

Where (directory and/or repo) can we put all our folders (languages, doc etc.) and files (content, layout etc.) concerning translations?

@marmarek is going to make (a) separate submodule(s) for the actual translated content (#2925).

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

When it's about going live, the canonical English documentation should insert a language switch which contains links that are labeled with translated words and pointing to unofficial (i.e. translated) pages. Thus, (a) some minor adjustments on the canonical documentation and (b) some trust in translators etc. seem to be necessary. How to handle this?

I think this is what #2930 is about.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Feb 20, 2018

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

I see and agree. But how can we verify that a translation of the warning is correct? Spontaneously, I got this idea: We enter the translated warning into several translation machines, let each machine translate the string into all languages we know well enough and then we check the translations for plausibility.

The "unverified translation" warning layouts are trickier. For example, we can't allow unverified translations of the warning itself, since a malicious translator could alter the warning such that it's no longer about the translation being unverified. So, those will probably have to stay in the main repo.

I see and agree. But how can we verify that a translation of the warning is correct? Spontaneously, I got this idea: We enter the translated warning into several translation machines, let each machine translate the string into all languages we know well enough and then we check the translations for plausibility.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Feb 21, 2018

Member

Sounds good to me. Similarly, given how short the warning is, we could try to have multiple (hopefully) independent human translators translate (or verify) it for each language.

Member

andrewdavidwong commented Feb 21, 2018

Sounds good to me. Similarly, given how short the warning is, we could try to have multiple (hopefully) independent human translators translate (or verify) it for each language.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Mar 19, 2018

How to deal with fragments (*) in links?

Let me explain why it is problematic. The main concerns are the headings which get IDs created by the Markdown processor.

Let's say a translator wants to translate a link with fragment /file/#good-morning pointing to the heading Good Morning! in the document /file/.

To know how to translate it correctly, for example into German, the translator has to do several steps:

  1. Find the file the link/URL of the fragment is pointing to. (That is, in the list of files in Transifex, find the MD file with the given permalink /file/ in the YAML header.)
  2. In that file, look for the correct heading of the target of the fragment: Good Morning!.
  3. Look for the translation of Good Morning!, which is Guten Morgen!. If it's not yet translated then translate it first.
  4. Transform a copy of that translation (Guten Morgen! to guten-morgen) to match the ID the heading Guten Morgen! will have after processing the MD file to an HTML file.
  5. Enter that transformed result (guten-morgen) as the translated fragment. The resulting link is /de-DE/file/#guten-morgen (note that inserting /de-DE is another problem not discussed in this post).

(Note that step 2 and subsequent ones are different if there is no heading but any HTML element with that ID.)

These steps are cumbersome, error-prone and inconvenient. Also, if someone changes a header again then all related links/URLs have to be found and adapted again.

To deal with it in a better way, I suggest the following solution. The translator does NOT translate any fragments. Instead, a machine inserts additional empty anchors into the headings in the resulting HTML files. The IDs of these new anchors match the IDs of the appropriate headings in the canonical version.

Following the example:

  1. Let the heading in the (MD-processed) canonical HTML file be <h3 id="good-morning">Good Morning!</h3>.
  2. Let the heading in the (MD-processed) translated HTML file be <h3 id="guten-morgen">Guten Morgen!</h3>.
  3. Add the ID good-morning from step 1 to a new anchor within the heading in step 2: <h3 id="guten-morgen"><a id="good-morning"></a>Guten Morgen!</h3>.

(Note: Skip step 3 if both IDs in the result would be equal.)

This way, the fragments given in the canonical files will also work with(in) the translated files. Thus, /de-DE/file/#good-morning (and /de-DE/file/#guten-morgen) will work.

How to deal with fragments (*) in links?

Let me explain why it is problematic. The main concerns are the headings which get IDs created by the Markdown processor.

Let's say a translator wants to translate a link with fragment /file/#good-morning pointing to the heading Good Morning! in the document /file/.

To know how to translate it correctly, for example into German, the translator has to do several steps:

  1. Find the file the link/URL of the fragment is pointing to. (That is, in the list of files in Transifex, find the MD file with the given permalink /file/ in the YAML header.)
  2. In that file, look for the correct heading of the target of the fragment: Good Morning!.
  3. Look for the translation of Good Morning!, which is Guten Morgen!. If it's not yet translated then translate it first.
  4. Transform a copy of that translation (Guten Morgen! to guten-morgen) to match the ID the heading Guten Morgen! will have after processing the MD file to an HTML file.
  5. Enter that transformed result (guten-morgen) as the translated fragment. The resulting link is /de-DE/file/#guten-morgen (note that inserting /de-DE is another problem not discussed in this post).

(Note that step 2 and subsequent ones are different if there is no heading but any HTML element with that ID.)

These steps are cumbersome, error-prone and inconvenient. Also, if someone changes a header again then all related links/URLs have to be found and adapted again.

To deal with it in a better way, I suggest the following solution. The translator does NOT translate any fragments. Instead, a machine inserts additional empty anchors into the headings in the resulting HTML files. The IDs of these new anchors match the IDs of the appropriate headings in the canonical version.

Following the example:

  1. Let the heading in the (MD-processed) canonical HTML file be <h3 id="good-morning">Good Morning!</h3>.
  2. Let the heading in the (MD-processed) translated HTML file be <h3 id="guten-morgen">Guten Morgen!</h3>.
  3. Add the ID good-morning from step 1 to a new anchor within the heading in step 2: <h3 id="guten-morgen"><a id="good-morning"></a>Guten Morgen!</h3>.

(Note: Skip step 3 if both IDs in the result would be equal.)

This way, the fragments given in the canonical files will also work with(in) the translated files. Thus, /de-DE/file/#good-morning (and /de-DE/file/#guten-morgen) will work.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 22, 2018

Member

Hmm, this looks like applying such fixups in md file wouldn't work. Which means translated offline documentation will be slightly limited. IMO it would be desirable to come back to the idea of having all changes applied in md files (maybe some layouts changes for that?). But we can go back to this later.

Member

marmarek commented Mar 22, 2018

Hmm, this looks like applying such fixups in md file wouldn't work. Which means translated offline documentation will be slightly limited. IMO it would be desirable to come back to the idea of having all changes applied in md files (maybe some layouts changes for that?). But we can go back to this later.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Mar 24, 2018

Just before I forget it: Another reason for fixups after Jekyll-execution is the translation of redirecting pages.

But this could also be done by a specific execution of Jekyll while there is a dedicated (i.e. language-dependent) customized redirect template /_layouts/redirect.html.

Just before I forget it: Another reason for fixups after Jekyll-execution is the translation of redirecting pages.

But this could also be done by a specific execution of Jekyll while there is a dedicated (i.e. language-dependent) customized redirect template /_layouts/redirect.html.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Mar 24, 2018

How to translate links (without a fragment) in general?

It's quite late for this important question. So, here we go:

The URL path of a translated page shall get a language-(region?-)dependent super-directory and the rest of the URL shall remain as it does for the canonical version.

Example: The German version of https://www.qubes-os.org/doc/contributing/ shall be https://www.qubes-os.org/de/doc/contributing/ or https://www.qubes-os.org/de-DE/doc/contributing/, depending on the language code we want to use.

Also see this post.

How to translate links (without a fragment) in general?

It's quite late for this important question. So, here we go:

The URL path of a translated page shall get a language-(region?-)dependent super-directory and the rest of the URL shall remain as it does for the canonical version.

Example: The German version of https://www.qubes-os.org/doc/contributing/ shall be https://www.qubes-os.org/de/doc/contributing/ or https://www.qubes-os.org/de-DE/doc/contributing/, depending on the language code we want to use.

Also see this post.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Mar 31, 2018

Which language code ("English", "en", "en-US", "eng" etc.) to use to differ the languages?

Currently, en is used as redirections to the canonical version. It's a language code without a specified region.

Instead, I would prefer the format LANGUAGE-REGION as listed in this ISO table (beside region-less codes). Pros are:

  • It's clear which variety to use (e.g. either British or American English (en-GB or en-US)). Note that currently, e.g. "color" and "colour" coexist in the documentation.
  • It's very unlikely that a top directory will be created in the canonical version that collides with the code. E.g. bg, meaning "background" or such, could also be the name of a top directory in the canonical version, colliding with bg for "Bulgarian". Contrarily, bg-BG is probably not "background-BackGround" or such.
  • The set of these languages/varieties is larger than without a region code.
  • It's future-proof in case that people would beg for their region-specified language down the road.

One thing on the downside is that we would have to add redirections from (or permalinks to?) the en-US versions (The canonical version is written in American English, isn't it?) in the YAML front matters. Also note that Wikipedia seems to be fine with region-less language codes for their sub-domains.

How to deal with the permalink URLs of the canonical version? I see two main ways:

  • We don't touch them (i.e. don't add an en-US top directory),
  • we add an en-US top directory.

While the first one will

  • keep things simple for the canonical version and
  • mark the canonical version as the canonical version better,

the latter has the advantage that all paths would start with a language code, making them consistent. I'm open for both options.

What do you think?

Which language code ("English", "en", "en-US", "eng" etc.) to use to differ the languages?

Currently, en is used as redirections to the canonical version. It's a language code without a specified region.

Instead, I would prefer the format LANGUAGE-REGION as listed in this ISO table (beside region-less codes). Pros are:

  • It's clear which variety to use (e.g. either British or American English (en-GB or en-US)). Note that currently, e.g. "color" and "colour" coexist in the documentation.
  • It's very unlikely that a top directory will be created in the canonical version that collides with the code. E.g. bg, meaning "background" or such, could also be the name of a top directory in the canonical version, colliding with bg for "Bulgarian". Contrarily, bg-BG is probably not "background-BackGround" or such.
  • The set of these languages/varieties is larger than without a region code.
  • It's future-proof in case that people would beg for their region-specified language down the road.

One thing on the downside is that we would have to add redirections from (or permalinks to?) the en-US versions (The canonical version is written in American English, isn't it?) in the YAML front matters. Also note that Wikipedia seems to be fine with region-less language codes for their sub-domains.

How to deal with the permalink URLs of the canonical version? I see two main ways:

  • We don't touch them (i.e. don't add an en-US top directory),
  • we add an en-US top directory.

While the first one will

  • keep things simple for the canonical version and
  • mark the canonical version as the canonical version better,

the latter has the advantage that all paths would start with a language code, making them consistent. I'm open for both options.

What do you think?

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Mar 31, 2018

Member

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Other than that, sounds good to me.

Member

andrewdavidwong commented Mar 31, 2018

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Other than that, sounds good to me.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Mar 31, 2018

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Entering the URL to the website of Mozilla https://www.mozilla.org/ redirects to https://www.mozilla.org/de/ for me.

Entering https://www.mozilla.org/en/ redirects to https://www.mozilla.org/en-US/ in my case.

There is also a language switch on the bottom offering other languages.

It seems that they both use LANGUAGE-REGION and LANGUAGE mixed. The only rule I see there is: If there are at least two translations into the equal language but with different regions then use LANGUAGE-REGION. (Otherwise, use LANGUAGE-REGION or LANGUAGE.)

There are also codes which aren't in the mentioned list, e.g. Frysk (fy-NL). Don't know where it's from.

EDIT: Interestingly, when I visit https://www.mozilla.org/de/ using the text web browser elinks then I can see a list of links on top of the page. These links point to the available languages. The two top-most links are:

A "canonical" link on Wikipedia also points to the German version in my case. So, maybe we don't really understand "canonical"? END OF EDIT.

tokideveloper commented Mar 31, 2018

Definitely this one:

We don't touch them (i.e. don't add an en-US top directory),

No language or region code for the canonical URLs. (And this is not elitism about English, BTW. I would say the same thing if the documentation were in any other language.)

There are good reasons that no major website has language or region codes in any of their canonical URLs. However, if anyone can provide a counterexample (of a major website that does this), I'd be interested to see it.

Entering the URL to the website of Mozilla https://www.mozilla.org/ redirects to https://www.mozilla.org/de/ for me.

Entering https://www.mozilla.org/en/ redirects to https://www.mozilla.org/en-US/ in my case.

There is also a language switch on the bottom offering other languages.

It seems that they both use LANGUAGE-REGION and LANGUAGE mixed. The only rule I see there is: If there are at least two translations into the equal language but with different regions then use LANGUAGE-REGION. (Otherwise, use LANGUAGE-REGION or LANGUAGE.)

There are also codes which aren't in the mentioned list, e.g. Frysk (fy-NL). Don't know where it's from.

EDIT: Interestingly, when I visit https://www.mozilla.org/de/ using the text web browser elinks then I can see a list of links on top of the page. These links point to the available languages. The two top-most links are:

A "canonical" link on Wikipedia also points to the German version in my case. So, maybe we don't really understand "canonical"? END OF EDIT.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Mar 31, 2018

Member

Interesting. I agree that this is a good counterexample, and I agree that what you describe in your edit is puzzling. I think both approaches are reasonable. In our case, it might still make sense to leave the canonical English version without a language code, since there's no way our localization will be as thorough as Mozilla's anytime soon.

Member

andrewdavidwong commented Mar 31, 2018

Interesting. I agree that this is a good counterexample, and I agree that what you describe in your edit is puzzling. I think both approaches are reasonable. In our case, it might still make sense to leave the canonical English version without a language code, since there's no way our localization will be as thorough as Mozilla's anytime soon.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 31, 2018

Member

👍 for keeping canonical version without language code - if nothing else, to clearly mark it as canonical one.

Member

marmarek commented Mar 31, 2018

👍 for keeping canonical version without language code - if nothing else, to clearly mark it as canonical one.

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 31, 2018

Member

As for language codes with or without region - indeed adding region code seams reasonable.

Member

marmarek commented Mar 31, 2018

As for language codes with or without region - indeed adding region code seams reasonable.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Apr 1, 2018

Thank you both Andrew and Marek!

Let's summarize it:

  • For the official version (I'll call it "official" now rather than "canonical"): No language or region code (thus: no new top directory in the URL).
  • For translated versions: Always a top directory in the URL containing the language code together with a region code formatted as LANGUAGE-REGION.

However, for internal processing purposes only, I suggest to use en for the official version. Reasons:

  • en is currently used in the redirection paths. So, I'll just use an existing name and won't create an additional one.
  • en is neither en-US nor en-GB and thus fits our current "needs" of using an "almost-English" language due to the lack of native speakers.
  • en doesn't steal either en-US or en-GB and thus could be adapted in the future in case we get ample man power of native speakers.

Thank you both Andrew and Marek!

Let's summarize it:

  • For the official version (I'll call it "official" now rather than "canonical"): No language or region code (thus: no new top directory in the URL).
  • For translated versions: Always a top directory in the URL containing the language code together with a region code formatted as LANGUAGE-REGION.

However, for internal processing purposes only, I suggest to use en for the official version. Reasons:

  • en is currently used in the redirection paths. So, I'll just use an existing name and won't create an additional one.
  • en is neither en-US nor en-GB and thus fits our current "needs" of using an "almost-English" language due to the lack of native speakers.
  • en doesn't steal either en-US or en-GB and thus could be adapted in the future in case we get ample man power of native speakers.
@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Apr 1, 2018

Member

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

Member

andrewdavidwong commented Apr 1, 2018

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Jun 2, 2018

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

@andrewdavidwong I see. I'm not sure yet but we'll see.

However, for internal processing purposes only, I suggest to use en for the official version.

I guess it depends on what practical effects this will have on our workflow. If it only happens inside of scripts (i.e., documentation contributors and maintainers don't have to change anything), then I'm on board.

@andrewdavidwong I see. I'm not sure yet but we'll see.

@tokideveloper

This comment has been minimized.

Show comment
Hide comment
@tokideveloper

tokideveloper Jun 2, 2018

@andrewdavidwong, @marmarek

I reviewed my algorithm shown in a previous post. Here are my outcomes:

  • The handling of copies (steps 1, 5 and 6 (in a sense)) should be discussed in another thread.
  • The handling of differences (step 7) is not really explained. Now, I thought about it and the result is that I have to adapt the algorithm to get it work right. So, here is the new version of how to treat a Markdown file (without mentioning the copy thing):
  1. Get a list of all existing permalinks and redirect_from links as listed in the YAML front matters of all files.
  2. Automatically, in the file, prefix all paths that are in that list with the placeholder %UndecidedLangPrefix%. The resulting state of the file may be called "UndecidedVersion".
  3. If available, apply the patch Decision.patch generated during step 7 of the last run. Rejected hunks may be ignored or even deleted.
  4. If there is still an %UndecidedLangPrefix% placeholder within the file then notify a person responsible to do this:
    1. Replace all occurrences of %UndecidedLangPrefix% with %LangPrefix% if the concerned links have to be translated (most frequent case).
    2. Replace all occurrences of %UndecidedLangPrefix% with %NoLangPrefix% if the concerned links must not be translated (probably seldom).
  5. Check that there is no %UndecidedLangPrefix% in the file. If there is one then go back to step 4.
  6. The current state of the file may be called "DecidedVersion".
  7. Save the difference from "UndecidedVersion" to "DecidedVersion" as a patch called Decision.patch.
  8. Upload the file to Transifex and tell the translators not to touch the placeholders.
  9. Download a translated version of that file from Transifex. Let's say it's in German.
  10. Replace all occurrences of %LangPrefix% EDIT and %ExtraLangPrefix% END EDIT with /de-DE.
  11. Replace all occurrences of %NoLangPrefix% with the empty string.

By using the patch Decision.patch, we'll save time in the next runs since only these spots of %UndecidedLangPrefix% must be adapted where the patch couldn't be applied.

EDIT As an additional step between 5 and 6 or between 7 and 8: Where necessary, add %ExtraLangPrefix% labels in front of all paths to translate that erroneously have not been detected. Save it as a patch and apply that patch in an earlier step in future runs. END EDIT

Of course, already existing sub-strings in the original files that are equal to the placeholders have to be escaped/treated specially.

If a demo example is needed then I'll write and post one.

tokideveloper commented Jun 2, 2018

@andrewdavidwong, @marmarek

I reviewed my algorithm shown in a previous post. Here are my outcomes:

  • The handling of copies (steps 1, 5 and 6 (in a sense)) should be discussed in another thread.
  • The handling of differences (step 7) is not really explained. Now, I thought about it and the result is that I have to adapt the algorithm to get it work right. So, here is the new version of how to treat a Markdown file (without mentioning the copy thing):
  1. Get a list of all existing permalinks and redirect_from links as listed in the YAML front matters of all files.
  2. Automatically, in the file, prefix all paths that are in that list with the placeholder %UndecidedLangPrefix%. The resulting state of the file may be called "UndecidedVersion".
  3. If available, apply the patch Decision.patch generated during step 7 of the last run. Rejected hunks may be ignored or even deleted.
  4. If there is still an %UndecidedLangPrefix% placeholder within the file then notify a person responsible to do this:
    1. Replace all occurrences of %UndecidedLangPrefix% with %LangPrefix% if the concerned links have to be translated (most frequent case).
    2. Replace all occurrences of %UndecidedLangPrefix% with %NoLangPrefix% if the concerned links must not be translated (probably seldom).
  5. Check that there is no %UndecidedLangPrefix% in the file. If there is one then go back to step 4.
  6. The current state of the file may be called "DecidedVersion".
  7. Save the difference from "UndecidedVersion" to "DecidedVersion" as a patch called Decision.patch.
  8. Upload the file to Transifex and tell the translators not to touch the placeholders.
  9. Download a translated version of that file from Transifex. Let's say it's in German.
  10. Replace all occurrences of %LangPrefix% EDIT and %ExtraLangPrefix% END EDIT with /de-DE.
  11. Replace all occurrences of %NoLangPrefix% with the empty string.

By using the patch Decision.patch, we'll save time in the next runs since only these spots of %UndecidedLangPrefix% must be adapted where the patch couldn't be applied.

EDIT As an additional step between 5 and 6 or between 7 and 8: Where necessary, add %ExtraLangPrefix% labels in front of all paths to translate that erroneously have not been detected. Save it as a patch and apply that patch in an earlier step in future runs. END EDIT

Of course, already existing sub-strings in the original files that are equal to the placeholders have to be escaped/treated specially.

If a demo example is needed then I'll write and post one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment