Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ICU ParagraphLayout with something else? #6922

Open
matthijskooijman opened this issue Sep 30, 2018 · 21 comments
Open

Replace ICU ParagraphLayout with something else? #6922

matthijskooijman opened this issue Sep 30, 2018 · 21 comments
Labels
Milestone

Comments

@matthijskooijman
Copy link
Contributor

@matthijskooijman matthijskooijman commented Sep 30, 2018

OpenTTD currently uses the ICU ParagraphLayout API for doing
internationalized wordwrapping. This usage is a problematic, since
ICU deprecated and removed their layout API for being buggy and
unmaintained. The ParagraphLayout API that OpenTTD uses is still present
in ICU, but it relies on the removed Layout API. This gap can be be
filled using the icu-le-hb library, which implements the layout API on
top of the Harfbuzz library.

However, it seems icu-le-hb is more of a proof of concept and not really
maintained either. Additionally, because ParagraphLayout is in ICU and
icu-le-hb depends on ICU, this gives an undesirable circular dependency,
which is problematic for at least Debian, since it complicates updating
to new ICU versions,.

See https://bugs.debian.org/894159 for more info on this, in particular
https://bugs.debian.org/894159#45 and onwards. The Debian version of
this issue is https://bugs.debian.org/897233

It seems it would be beneficial to move away from the ICU
ParagraphLayout API, so it can be removed from Debian (OpenTTD is the
only user in Debian), but also because it has indeed proven to be buggy
in the past (OpenTTD talks about ICU-related crashes, though this was
before the icu-le-hb implementation).

In the above linked bug, the following was written:

Another solution is of course to disable ParagraphLayout. László also
asked if OpenTTD, being the only user of this API, could migrate to
another solution. I've discussed this with OpenTTD upstream yesterday,
and they were already aware of the layout API removal and have been
casually looking at Harfbuzz and Pango as a replacement, but they do not
see an easy solution yet. ParagraphLayout seems to fit their usecase
quite neatly: they need internationalized word-wrapping of text (e.g.
also supporting right-to-left locales). Harfbuzz does not seem to offer
that, and Pango seems heavy-handed (and might not be easy to adapt to
OpenTTD's SDL renderer, and might not be portable enough).
This is bad to read. I had the hope there's an easy solution and/or the
replacement might be already in the making.

Neither me or upstream has much experience in this field, perhaps you
have a different suggestion for an alternative?
I don't know any other alternative. Only OpenTTD uses the Paragraph Layout
API and it makes me wonder what other solution the other projects use? I
may think other games like Lincity-NG[3] also need internationalized text
placement and/or LibreOffice still need to handle this as well. Do these
have an alternative solution?

Since the above was written, the ICU-in-Debian folks have asked about
any progress on the OpenTTD side of things, since they really like us to
remove this dependency from OpenTTD.

This has been discussed on IRC on a few occasions. Some observations:

  • The ICU layout stuff was removed in ICU 58.
  • Pango does layouting, but does not handle sorting, collation, etc. So
    when using Pango for layout, we'd need ICU anyway. Pango is fairly
    heavy-handed, depending on (I think) glib and gdk rendering as well.
  • IIRC Harfbuzz handles rendering of glyphs and lines, but not
    directionality detection and multi-line layouting.
  • icu-le-hb seems to be mostly unmaintained, and might be meant more of
    an example implementation than a production library. I believe it is
    mostly an API translation layer from HB to ICU Layout, without (much)
    code that actually does something.
  • ICU seems to be generally buggy too, so if we can stop depending on
    (parts of) it, that could be a feature. It seems we'll still need it
    for collating, but it could be that the ParagraphLayout code is the
    most buggy part (wishful thinking)?
  • Vendoring icu-le-hb in OpenTTD does not seem to work, since we depend
    on ICU ParagraphLayout which depends on ICU Layout (now provided by
    icu-le-hb). Vendoring both icu-le-hb and the ParagraphLayout code
    from ICU could solve this.
  • Taking the code from icu-le-hb and ICU ParagraphLayout and creating a
    new library from that might be an interesting approach. The code can
    then probably be cleaned up from there, since the ICU Layout API is
    no longer required, and can be removed internally. It might also not
    be needed to preserve the ParagraphLayout API exactly, if we feel
    that should be cleaned up (to be closer to the HB API for example),
    we can also do that.
  • It might be interesting to see what other software does (e.g.
    LibreOffice can probably do this thing as well). Does everyone use
    Pango? Or do they implement their own word-wrapping on top of HB
    layouting?

Did anyone make any progress on this subject in the last few months? @LordAro, @TrueBrain?

@nielsmh
Copy link
Contributor

@nielsmh nielsmh commented Sep 30, 2018

I would strongly considering the option of "just winging it". Of all the languages OpenTTD is currently translated into, all except three use regular ASCII spaces to separate words and indicate acceptable line break points.

The exceptions are the two Chinese langauges, and Japanese, and for those three you can more or less insert line breaks at any point. A slightly more correct algorithm would avoid inserting breaks before punctuation characters and after opening brackets, but I think those can be looked up in the Unicode database which should be included in ICU.

The full Unicode algorithm is rather long. Implementing it in full should be outside the scope of OpenTTD, but implementing a heavily simplified version might be okay?

@nielsmh nielsmh added the ICU label Sep 30, 2018
@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Oct 4, 2018

IIRC the ASCII-space approach is already implemented when compiling without ICU, so adapting that with a minimal approach for non-ASCII locales might be the easiest way to implement your suggestion. OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes. Hm.

Also, I'm not entirely sure if word-breaking is the only (or hardest) part of ICU-layout that needs to be replaced.

I guess more investigation in the current code is needed, but I can't really justify the time needed for that...

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Oct 4, 2018

OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes.

Just had a look at the code, it seems that there's already a split between WITH_ICU_SORT and WITH_ICU_LAYOUT. But without ICU layout, the FallbackParagraphLayout class is used, which apparently supports no RTL layouting either. It does already do space-based line breaking (in NextLine) and in absence of spaces breaks whenever the line is full (which would suffice for your "you you can more or less insert line breaks at any point" proposal). It already seems to support font-switching as well (e.g. drawing a line consisting of multiple "runs" with different fonts), which seems to be handled externally to the layouter and passed on the constructor through runs:

FallbackParagraphLayout::FallbackParagraphLayout(WChar *buffer, int length, FontMap &runs) : buffer_begin(buffer), buffer(buffer), runs(runs)

If we take this route, I wonder if we still need Harfbuzz? These (ancient) Harfbuzz docs suggest Harfbuzz only does shaping of text (given a run of text, a font, language, etc. decide what glyphs to use and how to position them).

I believe the shaping in the fallback layouter is fairly simple and consists of simply summing the character widths (and I suspect the glyph selection is also a simple direct mapping of Unicode code points to glyphs in the font). I'm not sure if this would be sufficient or would really need something like Harfbuzz (which also poses the question how this interacts with the fonts we use, which I suspect are custom OpenTTD bitmap fonts, or do we use freetype stuff?).

@michicc
Copy link
Member

@michicc michicc commented Oct 4, 2018

Shaping of text is important for various languages/writing systems, most visible for Arabic, but also some indian and asian scripts. Also, proper handling of mixed RTL/LTR content (e.g. English NewGRF texts embedded in arabic) is somewhat complicated.

Aditionally, even for latin-based scripts, there are some things which need combining diacritical glyphs.

For the Windows Uniscribe implementation I took the approach to handle the Bitmap font with the fallback layouter, as our bitmap font doesn't include anything that needs complex glyph shaping. If you select a language that needs more glyphs, we try to switch to a FreeType font anyway.

@frosch123
Copy link
Member

@frosch123 frosch123 commented Oct 7, 2018

I don't think anyone will work on a replacement as long as the current one is in place.

So, for debian I recommend to configure with "--without-icu-layout".
That way we may generate some interest/demand in actually supporting complicated languages.

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

I tried running without ICU layout and compared the results with the language set to Arabic. I was actually expecting the layout and word wrapping to become worse, but it actually seems the entire glyph selection changed as well. E.g., this is the settings screen without ICU layout:

openttd 1 8 0_003

And here's the settings screen with ICU layout:

openttd 1 8 0_004

For easy comparison, here's a closeup. Without:

image

With:

image

I don't read Arabic, so it looks to me like the text is completely different (and if it is correct with ICU layout, it would be likely that it is wrong without ICU layout). Perhaps the text is essentially the same, but some characters are intended to kerned or otherwise combined, changing their appearance?

I suspect that we are now no longer using ICU for any layouting, while I believe we could still use ICU for glyph selection and maybe some other stuff too?

Any Arabic speakers reading along here that can tell me how the text in the two versions compares? Is the second one incorrect, or just a bit less elegant? Does it even say the same thing?

Another thing is that on startup, OpenTTD now shows an error message saying there is no support for RTL-languages (which might not be entirely correct, there is at least some support) and suggesting to compile with ICU enabled (which might be counterproductive).

Summarizing: I'm not quite ready to drop ICU layout support for the Debian packages just yet, this needs a bit more investigation (but I am intending to drop it, as suggested by Frosch to perhaps force some progress, or at least make the life of the ICU Debian maintainers a bit easier).

@khaledhosny
Copy link

@khaledhosny khaledhosny commented Nov 17, 2018

The first image is completely broken (the text is left-to-right instead of right-to-left, and the glyphs has the wrong contextual shapes), the second one is mostly correct but the mixed Arabic and English text seems to be incorrectly re-ordered (the order of the Arabic and English text makes no sense, which likely means bidirectional text is incorrectly set up).

This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.

Oh, I swapped the images, w00ps. I just updated thet comment to fix the labeling (the images are still in the same order, so your comment still applies).

So indeed, disabling ICU layout completely breaks RTL languages, then. That's a bit more severe than I had hoped, but perhaps fixing glyph selection / shaping and adding RTL detection would not be too hard.

I also compared the Chinese (Traditional) language, and as @nielsmh predicted that only has some small word-wrapping differences, but otherwise looks identical.

Looking at the ICU API, it seems it still offers some useful building blocks that could be used to implement the missing piecies:

  • BreakIterator for finding linebreak positions and character separation points (for editing).
  • Bidirectional algorithm, which is an implementation of the Unicode bidi algorithm, figuring out what characters are LTR and RTL.
  • Shaping for Arabic text. I wonder if other languages also need shaping?

An alternative for shaping could be to use Harfbuzz, which AFAIU supports (only) shaping. It also seems to suggest shaping is needed for more than arabic (also syriac, indic, thai, ...), but perhaps that's what ICU means with "arabic languages"?

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

I also checked Hebrew, the other RTL language supported by OpenTTD now, which also has the character order reversed (but the glyphs seem to be the same, so it probably requires no shaping).

@nielsmh
Copy link
Contributor

@nielsmh nielsmh commented Nov 17, 2018

Arabic languages are quite specifically the languages using the same family of writing systems. There's some differences in orthography and alphabet but I believe they are all supported by a single shaping engine and Unicode block. The ICU u_shapeArabic function is not really suited for GUI applications, what it does is basically replace "natural" Unicode code points in the text with "compatibility" code points representing pre-shaped Arabic text. As they write in the documentation linked, it's mainly intended for character terminals and similar where a full Unicode text shaping system isn't practical, but Arabic text is still required.

Indian languages like Devanagari (easily identified by their "hanging baseline") are a separate beast, and while they don't require the bi-directional algorithm, they instead require different, complex reordering of glyphs. They depend on tables in the font data for this, and for combining multiple code points into single glyphs. (The latter should be very similar to OpenType automatic ligatures for Latin alphabet text.)

The conclusion must be, if only ICU is available then u_shapeArabic could be used for Arabic, but if a full Harfbuzz-based shaping engine is available then that should be used instead. The game should probably prevent the user from choosing a UI language without layout support too.

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

So, it sounds like adding harfbuzz as a dependency for shaping makes sense. We'll need something else for the bidirectionality detection then - The Harfbuzz docs suggest using ICU or fribidi. For linebreaking, the Harfbuzz docs also suggest using the ICU break iterator API.

I'm about to upload a version to Debian that compiles without the ICU layout API. After that, I probably won't find the time to actually implement Harfbuzz in OpenTTD. Any takers for this project?

@khaledhosny
Copy link

@khaledhosny khaledhosny commented Nov 17, 2018

I haven’t checked what ICU Layout does exactly, but ICU + HarfBuzz should have all the API one needs to replicate its functionality, so I think it isn’t a bad idea for OpenTTD to just copy it, replace the uses of the deprecated API with direct HarfBuzz calls and always use it. ICU shaping is limited and supports only a subset of Arabic and in general is of limited use, HarfBuzz should be used instead.

@LordAro
Copy link
Member

@LordAro LordAro commented Nov 17, 2018

I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib. However, now that ICU is Linux-only (well, not Windows or OSX) anyway, it's probably less of an issue

@nielsmh
Copy link
Contributor

@nielsmh nielsmh commented Nov 17, 2018

Should we perhaps flag UI languages that require shaping in a way that prevents them from being used in builds that lack complex text support?

An additional issue, tangential to this, is: What about user-supplied text, what is the primary reading direction for that? Things such as company names, signs, and chat messages. Just assume the UI language directionality, or somehow extend user-supplied strings with input language used? For instance, a German player downloading a savegame from a Saudi-Arabian player, that might cause some labels to render wrong.

@khaledhosny
Copy link

@khaledhosny khaledhosny commented Nov 17, 2018

I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib.

HarfBuzz does not require GLib (Pango does), it has it an optional dependency for Unicode functions, alternative dependency for this is ICU so projects using ICU naturally use it for Unicode functions in HarfBuzz.

@LordAro
Copy link
Member

@LordAro LordAro commented Nov 17, 2018

https://packages.debian.org/stretch/libharfbuzz0b

Doesn't seem to be what Debian (or Arch) thinks...

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

What about user-supplied text, what is the primary reading direction for that?

AFAIU, the direction can be derived from the individual characters. Currently, mixing RTL and LTR text in a single string is, AFAIK already used (for things like "OpenTTD" or "readme.txt" inside a translated string, see the screenshots a few comments back) and supported (using ICU, possibly through ICU layout to figure out what parts are RTL and what parts are LTR). So, that should work out, I believe.

@matthijskooijman
Copy link
Contributor Author

@matthijskooijman matthijskooijman commented Nov 17, 2018

@LordAro, I suspect you can choose to compile HB with or without glib integration. Debian uses with glib, but since it is easy to install it, there's only a bigger install footprint for using HB on Debian, but if you need to compile everything yourself, you can just leave it out.

@stale
Copy link

@stale stale bot commented Jan 24, 2019

This issue has been automatically marked as stale because it has not had any activity in the last two months.
If you believe the issue is still relevant, please test on the latest nightly and report back.
It will be closed if no further activity occurs within 7 days.
Thank you for your contributions.

@stale stale bot added the stale label Jan 24, 2019
@LordAro LordAro removed the stale label Jan 24, 2019
@LordAro LordAro added this to the 1.9.0 milestone Jan 24, 2019
@TrueBrain
Copy link
Member

@TrueBrain TrueBrain commented Mar 3, 2019

Although we all agree this needs a solution, it is not happening for 1.9. Another attempt in 1.10?

Sorry, I hope this is not too much of an issue :(

@TrueBrain TrueBrain added the pinned label Mar 3, 2019
@TrueBrain TrueBrain modified the milestones: 1.9.0, 1.10.0 Mar 3, 2019
@LordAro LordAro modified the milestones: 1.10.0, 1.11.0 Jan 12, 2020
@LordAro
Copy link
Member

@LordAro LordAro commented Jan 12, 2020

Aaand bumping again! Turns out rewriting the text rendering engine is quite a lot of work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.