Replace ICU ParagraphLayout with something else? #6922

matthijskooijman · 2018-09-30T10:31:15Z

OpenTTD currently uses the ICU ParagraphLayout API for doing
internationalized wordwrapping. This usage is a problematic, since
ICU deprecated and removed their layout API for being buggy and
unmaintained. The ParagraphLayout API that OpenTTD uses is still present
in ICU, but it relies on the removed Layout API. This gap can be be
filled using the icu-le-hb library, which implements the layout API on
top of the Harfbuzz library.

However, it seems icu-le-hb is more of a proof of concept and not really
maintained either. Additionally, because ParagraphLayout is in ICU and
icu-le-hb depends on ICU, this gives an undesirable circular dependency,
which is problematic for at least Debian, since it complicates updating
to new ICU versions,.

See https://bugs.debian.org/894159 for more info on this, in particular
https://bugs.debian.org/894159#45 and onwards. The Debian version of
this issue is https://bugs.debian.org/897233

It seems it would be beneficial to move away from the ICU
ParagraphLayout API, so it can be removed from Debian (OpenTTD is the
only user in Debian), but also because it has indeed proven to be buggy
in the past (OpenTTD talks about ICU-related crashes, though this was
before the icu-le-hb implementation).

In the above linked bug, the following was written:

Another solution is of course to disable ParagraphLayout. László also
asked if OpenTTD, being the only user of this API, could migrate to
another solution. I've discussed this with OpenTTD upstream yesterday,
and they were already aware of the layout API removal and have been
casually looking at Harfbuzz and Pango as a replacement, but they do not
see an easy solution yet. ParagraphLayout seems to fit their usecase
quite neatly: they need internationalized word-wrapping of text (e.g.
also supporting right-to-left locales). Harfbuzz does not seem to offer
that, and Pango seems heavy-handed (and might not be easy to adapt to
OpenTTD's SDL renderer, and might not be portable enough).
This is bad to read. I had the hope there's an easy solution and/or the
replacement might be already in the making.

Neither me or upstream has much experience in this field, perhaps you
have a different suggestion for an alternative?
I don't know any other alternative. Only OpenTTD uses the Paragraph Layout
API and it makes me wonder what other solution the other projects use? I
may think other games like Lincity-NG[3] also need internationalized text
placement and/or LibreOffice still need to handle this as well. Do these
have an alternative solution?

Since the above was written, the ICU-in-Debian folks have asked about
any progress on the OpenTTD side of things, since they really like us to
remove this dependency from OpenTTD.

This has been discussed on IRC on a few occasions. Some observations:

The ICU layout stuff was removed in ICU 58.
Pango does layouting, but does not handle sorting, collation, etc. So
when using Pango for layout, we'd need ICU anyway. Pango is fairly
heavy-handed, depending on (I think) glib and gdk rendering as well.
IIRC Harfbuzz handles rendering of glyphs and lines, but not
directionality detection and multi-line layouting.
icu-le-hb seems to be mostly unmaintained, and might be meant more of
an example implementation than a production library. I believe it is
mostly an API translation layer from HB to ICU Layout, without (much)
code that actually does something.
ICU seems to be generally buggy too, so if we can stop depending on
(parts of) it, that could be a feature. It seems we'll still need it
for collating, but it could be that the ParagraphLayout code is the
most buggy part (wishful thinking)?
Vendoring icu-le-hb in OpenTTD does not seem to work, since we depend
on ICU ParagraphLayout which depends on ICU Layout (now provided by
icu-le-hb). Vendoring both icu-le-hb and the ParagraphLayout code
from ICU could solve this.
Taking the code from icu-le-hb and ICU ParagraphLayout and creating a
new library from that might be an interesting approach. The code can
then probably be cleaned up from there, since the ICU Layout API is
no longer required, and can be removed internally. It might also not
be needed to preserve the ParagraphLayout API exactly, if we feel
that should be cleaned up (to be closer to the HB API for example),
we can also do that.
It might be interesting to see what other software does (e.g.
LibreOffice can probably do this thing as well). Does everyone use
Pango? Or do they implement their own word-wrapping on top of HB
layouting?

Did anyone make any progress on this subject in the last few months? @LordAro, @TrueBrain?

nielsmh · 2018-09-30T16:57:44Z

I would strongly considering the option of "just winging it". Of all the languages OpenTTD is currently translated into, all except three use regular ASCII spaces to separate words and indicate acceptable line break points.

The exceptions are the two Chinese langauges, and Japanese, and for those three you can more or less insert line breaks at any point. A slightly more correct algorithm would avoid inserting breaks before punctuation characters and after opening brackets, but I think those can be looked up in the Unicode database which should be included in ICU.

The full Unicode algorithm is rather long. Implementing it in full should be outside the scope of OpenTTD, but implementing a heavily simplified version might be okay?

matthijskooijman · 2018-10-04T13:29:45Z

IIRC the ASCII-space approach is already implemented when compiling without ICU, so adapting that with a minimal approach for non-ASCII locales might be the easiest way to implement your suggestion. OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes. Hm.

Also, I'm not entirely sure if word-breaking is the only (or hardest) part of ICU-layout that needs to be replaced.

I guess more investigation in the current code is needed, but I can't really justify the time needed for that...

matthijskooijman · 2018-10-04T13:57:46Z

OTOH, in this case we want ICU support, but not ICU-layout support, so that might require more invasive changes.

Just had a look at the code, it seems that there's already a split between WITH_ICU_SORT and WITH_ICU_LAYOUT. But without ICU layout, the FallbackParagraphLayout class is used, which apparently supports no RTL layouting either. It does already do space-based line breaking (in NextLine) and in absence of spaces breaks whenever the line is full (which would suffice for your "you you can more or less insert line breaks at any point" proposal). It already seems to support font-switching as well (e.g. drawing a line consisting of multiple "runs" with different fonts), which seems to be handled externally to the layouter and passed on the constructor through runs:

OpenTTD/src/gfx_layout.cpp

Line 466 in 9cf999b

    
           FallbackParagraphLayout::FallbackParagraphLayout(WChar *buffer, int length, FontMap &runs) : buffer_begin(buffer), buffer(buffer), runs(runs)

If we take this route, I wonder if we still need Harfbuzz? These (ancient) Harfbuzz docs suggest Harfbuzz only does shaping of text (given a run of text, a font, language, etc. decide what glyphs to use and how to position them).

I believe the shaping in the fallback layouter is fairly simple and consists of simply summing the character widths (and I suspect the glyph selection is also a simple direct mapping of Unicode code points to glyphs in the font). I'm not sure if this would be sufficient or would really need something like Harfbuzz (which also poses the question how this interacts with the fonts we use, which I suspect are custom OpenTTD bitmap fonts, or do we use freetype stuff?).

michicc · 2018-10-04T18:43:19Z

Shaping of text is important for various languages/writing systems, most visible for Arabic, but also some indian and asian scripts. Also, proper handling of mixed RTL/LTR content (e.g. English NewGRF texts embedded in arabic) is somewhat complicated.

Aditionally, even for latin-based scripts, there are some things which need combining diacritical glyphs.

For the Windows Uniscribe implementation I took the approach to handle the Bitmap font with the fallback layouter, as our bitmap font doesn't include anything that needs complex glyph shaping. If you select a language that needs more glyphs, we try to switch to a FreeType font anyway.

frosch123 · 2018-10-07T16:30:32Z

I don't think anyone will work on a replacement as long as the current one is in place.

So, for debian I recommend to configure with "--without-icu-layout".
That way we may generate some interest/demand in actually supporting complicated languages.

matthijskooijman · 2018-11-17T00:06:09Z

I tried running without ICU layout and compared the results with the language set to Arabic. I was actually expecting the layout and word wrapping to become worse, but it actually seems the entire glyph selection changed as well. E.g., this is the settings screen without ICU layout:

And here's the settings screen with ICU layout:

For easy comparison, here's a closeup. Without:

With:

I don't read Arabic, so it looks to me like the text is completely different (and if it is correct with ICU layout, it would be likely that it is wrong without ICU layout). Perhaps the text is essentially the same, but some characters are intended to kerned or otherwise combined, changing their appearance?

I suspect that we are now no longer using ICU for any layouting, while I believe we could still use ICU for glyph selection and maybe some other stuff too?

Any Arabic speakers reading along here that can tell me how the text in the two versions compares? Is the second one incorrect, or just a bit less elegant? Does it even say the same thing?

Another thing is that on startup, OpenTTD now shows an error message saying there is no support for RTL-languages (which might not be entirely correct, there is at least some support) and suggesting to compile with ICU enabled (which might be counterproductive).

Summarizing: I'm not quite ready to drop ICU layout support for the Debian packages just yet, this needs a bit more investigation (but I am intending to drop it, as suggested by Frosch to perhaps force some progress, or at least make the life of the ICU Debian maintainers a bit easier).

khaledhosny · 2018-11-17T04:23:40Z

The first image is completely broken (the text is left-to-right instead of right-to-left, and the glyphs has the wrong contextual shapes), the second one is mostly correct but the mixed Arabic and English text seems to be incorrectly re-ordered (the order of the Arabic and English text makes no sense, which likely means bidirectional text is incorrectly set up).

This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.

matthijskooijman · 2018-11-17T09:12:08Z

This is surprising, since the ICU layout version should be the correct, which makes me wonder if the source text is correct (can you kink to it) or whether some thing is broken with ICU layout or how it is used.

Oh, I swapped the images, w00ps. I just updated thet comment to fix the labeling (the images are still in the same order, so your comment still applies).

So indeed, disabling ICU layout completely breaks RTL languages, then. That's a bit more severe than I had hoped, but perhaps fixing glyph selection / shaping and adding RTL detection would not be too hard.

I also compared the Chinese (Traditional) language, and as @nielsmh predicted that only has some small word-wrapping differences, but otherwise looks identical.

Looking at the ICU API, it seems it still offers some useful building blocks that could be used to implement the missing piecies:

BreakIterator for finding linebreak positions and character separation points (for editing).
Bidirectional algorithm, which is an implementation of the Unicode bidi algorithm, figuring out what characters are LTR and RTL.
Shaping for Arabic text. I wonder if other languages also need shaping?

An alternative for shaping could be to use Harfbuzz, which AFAIU supports (only) shaping. It also seems to suggest shaping is needed for more than arabic (also syriac, indic, thai, ...), but perhaps that's what ICU means with "arabic languages"?

matthijskooijman · 2018-11-17T09:21:36Z

I also checked Hebrew, the other RTL language supported by OpenTTD now, which also has the character order reversed (but the glyphs seem to be the same, so it probably requires no shaping).

nielsmh · 2018-11-17T09:45:28Z

Arabic languages are quite specifically the languages using the same family of writing systems. There's some differences in orthography and alphabet but I believe they are all supported by a single shaping engine and Unicode block. The ICU u_shapeArabic function is not really suited for GUI applications, what it does is basically replace "natural" Unicode code points in the text with "compatibility" code points representing pre-shaped Arabic text. As they write in the documentation linked, it's mainly intended for character terminals and similar where a full Unicode text shaping system isn't practical, but Arabic text is still required.

Indian languages like Devanagari (easily identified by their "hanging baseline") are a separate beast, and while they don't require the bi-directional algorithm, they instead require different, complex reordering of glyphs. They depend on tables in the font data for this, and for combining multiple code points into single glyphs. (The latter should be very similar to OpenType automatic ligatures for Latin alphabet text.)

The conclusion must be, if only ICU is available then u_shapeArabic could be used for Arabic, but if a full Harfbuzz-based shaping engine is available then that should be used instead. The game should probably prevent the user from choosing a UI language without layout support too.

matthijskooijman · 2018-11-17T18:07:55Z

So, it sounds like adding harfbuzz as a dependency for shaping makes sense. We'll need something else for the bidirectionality detection then - The Harfbuzz docs suggest using ICU or fribidi. For linebreaking, the Harfbuzz docs also suggest using the ICU break iterator API.

I'm about to upload a version to Debian that compiles without the ICU layout API. After that, I probably won't find the time to actually implement Harfbuzz in OpenTTD. Any takers for this project?

khaledhosny · 2018-11-17T18:19:52Z

I haven’t checked what ICU Layout does exactly, but ICU + HarfBuzz should have all the API one needs to replicate its functionality, so I think it isn’t a bad idea for OpenTTD to just copy it, replace the uses of the deprecated API with direct HarfBuzz calls and always use it. ICU shaping is limited and supports only a subset of Arabic and in general is of limited use, HarfBuzz should be used instead.

LordAro · 2018-11-17T18:27:04Z

I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib. However, now that ICU is Linux-only (well, not Windows or OSX) anyway, it's probably less of an issue

nielsmh · 2018-11-17T18:27:15Z

Should we perhaps flag UI languages that require shaping in a way that prevents them from being used in builds that lack complex text support?

An additional issue, tangential to this, is: What about user-supplied text, what is the primary reading direction for that? Things such as company names, signs, and chat messages. Just assume the UI language directionality, or somehow extend user-supplied strings with input language used? For instance, a German player downloading a savegame from a Saudi-Arabian player, that might cause some labels to render wrong.

khaledhosny · 2018-11-17T18:50:59Z

I had my concerns about Harfbuzz (& Pango) previously, as they pull in a lot of extra dependencies via glib.

HarfBuzz does not require GLib (Pango does), it has it an optional dependency for Unicode functions, alternative dependency for this is ICU so projects using ICU naturally use it for Unicode functions in HarfBuzz.

LordAro · 2018-11-17T18:57:02Z

https://packages.debian.org/stretch/libharfbuzz0b

Doesn't seem to be what Debian (or Arch) thinks...

matthijskooijman · 2018-11-17T18:57:02Z

What about user-supplied text, what is the primary reading direction for that?

AFAIU, the direction can be derived from the individual characters. Currently, mixing RTL and LTR text in a single string is, AFAIK already used (for things like "OpenTTD" or "readme.txt" inside a translated string, see the screenshots a few comments back) and supported (using ICU, possibly through ICU layout to figure out what parts are RTL and what parts are LTR). So, that should work out, I believe.

matthijskooijman · 2018-11-17T18:58:34Z

@LordAro, I suspect you can choose to compile HB with or without glib integration. Debian uses with glib, but since it is easy to install it, there's only a bigger install footprint for using HB on Debian, but if you need to compile everything yourself, you can just leave it out.

stale · 2019-01-24T21:05:16Z

This issue has been automatically marked as stale because it has not had any activity in the last two months.
If you believe the issue is still relevant, please test on the latest nightly and report back.
It will be closed if no further activity occurs within 7 days.
Thank you for your contributions.

TrueBrain · 2019-03-03T11:18:34Z

Although we all agree this needs a solution, it is not happening for 1.9. Another attempt in 1.10?

Sorry, I hope this is not too much of an issue :(

LordAro · 2020-01-12T20:06:59Z

Aaand bumping again! Turns out rewriting the text rendering engine is quite a lot of work

Gymnasiast · 2020-12-22T12:33:20Z

I don't know if it helps since we don't understand BiDi that well, but for OpenRCT2 we used FriBidi to reorder Arabic. This was implemented in OpenRCT2/OpenRCT2#12837.

srl295 · 2022-03-03T17:18:35Z

oh hi

it seems icu-le-hb is more of a proof of concept and not really maintained either

or it's just done and there's not much more to do with it.

Additionally, because ParagraphLayout is in ICU and
icu-le-hb depends on ICU, this gives an undesirable circular dependency,
which is problematic for at least Debian

It's painful for everyone. I have https://unicode-org.atlassian.net/browse/ICU-6829 filed with the at-that-time ideas. But I'm no longer funded for ICU. Pull requests welcome if anyone wants to help out.

nielsmh added the component: ICU Issue caused by problems with ICU layout label Sep 30, 2018

LordAro mentioned this issue Jan 4, 2019

The Debian and Ubuntu builds all depend on shockingly out of date libicu and libpng packages. #7008

Closed

stale bot added the stale Stale issues label Jan 24, 2019

LordAro removed the stale Stale issues label Jan 24, 2019

LordAro added this to the 1.9.0 milestone Jan 24, 2019

TrueBrain added the pinned label Mar 3, 2019

TrueBrain modified the milestones: 1.9.0, 1.10.0 Mar 3, 2019

LordAro modified the milestones: 1.10.0, 1.11.0 Jan 12, 2020

TrueBrain removed the pinned label Jan 3, 2021

TrueBrain removed this from the 1.11.0 milestone Jan 5, 2021

TrueBrain mentioned this issue Apr 30, 2023

Feature: [Linux] RTL support with modern versions of ICU #10747

Merged

6 tasks

TrueBrain closed this as completed in #10747 May 1, 2023

glx22 mentioned this issue Jan 13, 2024

[Bug]: Font rendering is ugly on linux #11765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ICU ParagraphLayout with something else? #6922

Replace ICU ParagraphLayout with something else? #6922

matthijskooijman commented Sep 30, 2018 •

edited

Loading

nielsmh commented Sep 30, 2018

matthijskooijman commented Oct 4, 2018

matthijskooijman commented Oct 4, 2018

michicc commented Oct 4, 2018

frosch123 commented Oct 7, 2018 •

edited

Loading

matthijskooijman commented Nov 17, 2018 •

edited

Loading

khaledhosny commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

nielsmh commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

khaledhosny commented Nov 17, 2018

LordAro commented Nov 17, 2018 •

edited

Loading

nielsmh commented Nov 17, 2018

khaledhosny commented Nov 17, 2018

LordAro commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

stale bot commented Jan 24, 2019

TrueBrain commented Mar 3, 2019

LordAro commented Jan 12, 2020

Gymnasiast commented Dec 22, 2020

srl295 commented Mar 3, 2022

Replace ICU ParagraphLayout with something else? #6922

Replace ICU ParagraphLayout with something else? #6922

Comments

matthijskooijman commented Sep 30, 2018 • edited Loading

nielsmh commented Sep 30, 2018

matthijskooijman commented Oct 4, 2018

matthijskooijman commented Oct 4, 2018

michicc commented Oct 4, 2018

frosch123 commented Oct 7, 2018 • edited Loading

matthijskooijman commented Nov 17, 2018 • edited Loading

khaledhosny commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

nielsmh commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

khaledhosny commented Nov 17, 2018

LordAro commented Nov 17, 2018 • edited Loading

nielsmh commented Nov 17, 2018

khaledhosny commented Nov 17, 2018

LordAro commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

matthijskooijman commented Nov 17, 2018

stale bot commented Jan 24, 2019

TrueBrain commented Mar 3, 2019

LordAro commented Jan 12, 2020

Gymnasiast commented Dec 22, 2020

srl295 commented Mar 3, 2022

matthijskooijman commented Sep 30, 2018 •

edited

Loading

frosch123 commented Oct 7, 2018 •

edited

Loading

matthijskooijman commented Nov 17, 2018 •

edited

Loading

LordAro commented Nov 17, 2018 •

edited

Loading