Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Fusxfaranto · 2018-04-06T19:46:58Z

Fixes #6938. Unfortunately, this adds a great deal of complexity, even with only supporting the non-special cases for UTF-8 uppercasing. Still, I don't think there's a significantly better approach here. Notably, a large table is needed, which I decided to stick in localisation/ConversionTables.cpp (is there a better place for it?).

Fusxfaranto · 2018-04-06T19:50:00Z

src/openrct2/core/String.cpp

+                uint32 upperWord;
+                if ((*src & 0xF0) == 0xF0)
+                {
+                    word = ((uint8)src[0] << 24) + ((uint8)src[1] << 16) + ((uint8)src[2] << 8) + (uint8)src[3];


An invalid UTF-8 string could cause an out-of-bounds access here, is this a possible input? I assumed not, but if it's a possible situation, I should definitely add a null check for each index.

AaronVanGeffen · 2018-04-07T16:16:17Z

Thank you for your PR. I'm sure it was an interesting issue to work on.

Personally, I think we should really start thinking about using ICU for unicode conversions and transliterations instead of building our own implementation for this sort of thing. We can drop a lot of our conversion tables at that point, too.

Gymnasiast · 2018-04-09T07:20:05Z

I second that. ICU is also required for proper support for Arabic and other RTL languages.

janisozaur · 2018-04-09T07:21:45Z

A vote for ICU from me too. The only thing that's left now is for someone to implement it.

IntelOrca · 2018-04-09T07:42:23Z

LCMapStringEx can be used on Windows to do unicode transformations.

IntelOrca · 2018-04-13T10:03:47Z

@Fusxfaranto What platform / OS are you developing on?

janisozaur · 2018-04-13T11:49:21Z

@IntelOrca

Linux/gcc

IntelOrca · 2018-04-13T16:02:51Z

Ok cool, would you be able to try moving our current tables to instead use libicu?

Broxzier · 2018-04-13T17:11:15Z

test/tests/StringTest.cpp

+
+TEST_F(StringTest, ToUpper_Basic)
+{
+    utf8 *actual = (utf8*)alloca(20 * sizeof(utf8));


Any reason for using alloca over utf8 actual[20]? The compiler knows the size already.

Nope, good catch.

Fusxfaranto · 2018-04-13T17:43:56Z

Sure, I'll give it a shot this weekend.

IntelOrca · 2018-04-13T17:45:16Z

@Fusxfaranto Thanks, you only need to do it for non-Windows platforms. I will write the Windows side of the code afterwards using the native Windows API.

Fusxfaranto · 2018-04-13T17:54:01Z

Sounds good, someone else will need to verify the changes on OSX though. Also, should I update this PR, or make a new one?

IntelOrca · 2018-04-13T18:02:47Z

@Fusxfaranto You can update this existing one.

Fusxfaranto · 2018-04-14T20:48:13Z

I attempted to start on this, but it appears I don't have enough competence in CMake to actually add the ICU dependency (just using find_package(ICU 55.0 REQUIRED) fails with a mysterious error message that I can't manage to decipher). If someone is willing to figure out most of the setup, I'm happy to do the implementation, but otherwise I don't think I can make much progress.

janisozaur · 2018-04-14T20:56:13Z

You can leave the ICU handling in CMake to me, for the initial PR you can use whatever works on your system, which I imagine would be to add -licuuc -licudata -licui18n to libopenrct2, as based on .pc files.

Fusxfaranto · 2018-04-16T02:14:12Z

Didn't manage to get the tests to link correctly, so I haven't actually run them yet, but I figured I could leave that to @janisozaur.

Gymnasiast · 2018-04-25T09:32:46Z

Could you rebase this PR to develop?

AaronVanGeffen · 2018-05-06T12:57:43Z

Pitched in on some of the work required. I have raised the minimum cmake version to 3.7, as that version introduces a FindICU module. cmake isn't playing nice yet, though, so this requires some more attention. @janisozaur, perhaps you could have a look?

As we are introducing a dependency on icu, I think this is a good time to get rid of the iconv dependency in win1252_to_utf8 (Localisation.cpp) as well. The icu package provides nicer C++ abstractions which we should leverage.

AaronVanGeffen · 2018-05-06T13:26:27Z

src/openrct2/core/String.cpp

+        icu::UnicodeString convertString(src.data(), codepage);
+        std::string result;
+
+        if (dstCodePage == CODE_PAGE::CP_UTF8)


Note: this isn't converting from UTF-8 to any other code page yet.

Is it not possible for you to feed the same code page IDs to ICU?

The UnicodeString class does not seem to have such a constructor, no, unfortunately… It looks like going from a UnicodeString to any other encoding has to be done with a bunch of C-style functions instead, too. Less than ideal.

Are there any functions in libicu to convert from number to name?

Not as far as I'm aware. I think the best option will be something like sprintf(s, "windows-%s", codepage), but I'm not sure that works for everything, so I'll have to check that.

ucnv_getStandard seems to be the function we're looking for.

Hmm, are you sure? I was under the impression that that function took an ICU-internal index.

ccoors · 2018-05-14T19:35:11Z

Can another example be added to the string test? The single char 'ﬁ' should uppercase to two chars, 'F' and 'I'.

AaronVanGeffen · 2018-05-14T20:21:21Z

Some things left to do for this PR:

Implement String::ToUpper for Windows. @IntelOrca suggests using LCMapStringEx.
Test for any regressions. Check for performance regressions, too.
Travis needs to install libicu before building. The iconv dependency can be dropped. (@janisozaur, could you tackle this?)

Needs external dependencies update, and not blocking for this PR imo:

Android needs libicu added to its dependencies, or an equivalent library should be used.

AaronVanGeffen · 2018-05-17T19:48:46Z

It looks like all tests are passing with the latest amendments. From my perspective, all that's left for this PR is to get ICU bundled in the Xcode dependency package, if the version macOS is shipping isn't usable. I'm hoping @LRFLEW will be able to do this.

As @janisozaur mentioned, the Android dependencies will need to be updated as well. This will be tackled along with #6771, and is therefore not blocking for this PR.

Co-authored-by: Fusxfaranto <fusxfaranto@gmail.com>

LCMapStringEx does not unfold ligatures if there is no uppercase equivalent.

This fixes cmake's find_package() for cross-compilation

AaronVanGeffen · 2018-05-22T17:11:27Z

I've spent a good hour fiddling with Xcode to get the dependencies sorted. To do this, I used binaries from Homebrew, whose identity paths I adjusted using install_name_tool.

Update SDL2 to version 2.0.8.
Introduce dependency on ICU, bundling version 61.1.
Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources.
Drop dependency on external SDL2_ttf.

CI appears to be happy, too. However, I'd really appreciate it if someone other than me could try an Xcode build of the game on this branch.

wchar_t typically uses UTF-32 codepoints on Linux, unlike Windows, which uses UTF-16.

* Update SDL2 to version 2.0.8. * Introduce dependency on ICU, bundling version 61.1. * Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources. * Drop dependency on external SDL2_ttf.

Gymnasiast

Tested this on macOS. It compiles correctly and banners correctly display upper case text.

Fusxfaranto commented Apr 6, 2018

View reviewed changes

Broxzier reviewed Apr 13, 2018

View reviewed changes

Fusxfaranto force-pushed the develop branch from d70af4c to 619943e Compare April 16, 2018 02:11

AaronVanGeffen added the pending rebase PR needs to be rebased. label Apr 26, 2018

AaronVanGeffen force-pushed the develop branch from 619943e to 6282a27 Compare May 6, 2018 12:53

AaronVanGeffen added work in progress and removed pending rebase PR needs to be rebased. labels May 6, 2018

AaronVanGeffen reviewed May 6, 2018

View reviewed changes

AaronVanGeffen force-pushed the develop branch from f258920 to e92dee2 Compare May 6, 2018 18:07

AaronVanGeffen force-pushed the develop branch 2 times, most recently from b631c6a to 1394352 Compare May 14, 2018 18:58

AaronVanGeffen force-pushed the develop branch from 5ac46f4 to 1888e64 Compare May 14, 2018 19:44

AaronVanGeffen force-pushed the develop branch from 9c9a2aa to 7a252ea Compare May 17, 2018 19:27

AaronVanGeffen and others added 17 commits May 22, 2018 17:37

Use ICU for converting strings to UTF-8 instead of our own tables.

ee8bf9b

Co-authored-by: Fusxfaranto <fusxfaranto@gmail.com>

Split off GetIcuCodePage to its own function.

ea80f0e

Allow converting strings between code pages in both directions.

f29b42c

Co-authored-by: Fusxfaranto <fusxfaranto@gmail.com>

Implement ICU support for uppercasing, with tests.

a91dd6a

Remove dependency on iconv.

392459f

Add windows implementation for ToUpper

386ab1b

Change ICU variant of String::ToUpper for string_view argument.

8e919d2

Use Windows API for utf8-utf16 conversions

085d855

Improve Windows implementation of ToUpper

6109a9b

Replace non-Windows versions of ToUtf8 and ToUtf16 with ICU calls.

13e3528

Split ToUpper tests into more granular subtests.

4c67c0e

Do not require ICU on MinGW and MSVC targets.

bf1fd99

Fix ToUpper tests on Windows

71a2cb4

LCMapStringEx does not unfold ligatures if there is no uppercase equivalent.

Update compilation instructions.

7e7e042

Fix MinGW compilation.

bab66d1

Add explicit -m32 CXXFLAGS for 32-bit job on Travis

5360921

This fixes cmake's find_package() for cross-compilation

Fix testpaint.

6f9226a

AaronVanGeffen force-pushed the develop branch from cca2a4a to 08578a9 Compare May 22, 2018 15:47

AaronVanGeffen removed the Xcode Fix Required label May 22, 2018

AaronVanGeffen added 3 commits May 22, 2018 19:51

Fix String::ToUtf8 and String::ToUtf16 on Linux.

3fd7590

wchar_t typically uses UTF-32 codepoints on Linux, unlike Windows, which uses UTF-16.

Rewrite CodePageFromUTF8 as CodePageFromUnicode.

b836ad0

Update Xcode project dependencies to v17.

1363f35

* Update SDL2 to version 2.0.8. * Introduce dependency on ICU, bundling version 61.1. * Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources. * Drop dependency on external SDL2_ttf.

AaronVanGeffen force-pushed the develop branch from d0bd3c3 to 1363f35 Compare May 22, 2018 17:52

AaronVanGeffen added the pending review label May 22, 2018

Gymnasiast approved these changes May 23, 2018

View reviewed changes

AaronVanGeffen removed the pending review label May 23, 2018

AaronVanGeffen merged commit 77b09b3 into OpenRCT2:develop May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Fusxfaranto commented Apr 6, 2018

Fusxfaranto Apr 6, 2018

AaronVanGeffen commented Apr 7, 2018

Gymnasiast commented Apr 9, 2018

janisozaur commented Apr 9, 2018

IntelOrca commented Apr 9, 2018

IntelOrca commented Apr 13, 2018

janisozaur commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Broxzier Apr 13, 2018 •

edited

Loading

Fusxfaranto Apr 13, 2018

Fusxfaranto commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Fusxfaranto commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Fusxfaranto commented Apr 14, 2018

janisozaur commented Apr 14, 2018

Fusxfaranto commented Apr 16, 2018

Gymnasiast commented Apr 25, 2018

AaronVanGeffen commented May 6, 2018

AaronVanGeffen May 6, 2018

IntelOrca May 6, 2018

AaronVanGeffen May 6, 2018

IntelOrca May 6, 2018

Fusxfaranto May 6, 2018

AaronVanGeffen May 6, 2018

Fusxfaranto May 6, 2018

ccoors commented May 14, 2018

AaronVanGeffen commented May 14, 2018 •

edited

Loading

AaronVanGeffen commented May 17, 2018 •

edited

Loading

AaronVanGeffen commented May 22, 2018

Gymnasiast left a comment

Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Conversation

Fusxfaranto commented Apr 6, 2018

Choose a reason for hiding this comment

AaronVanGeffen commented Apr 7, 2018

Gymnasiast commented Apr 9, 2018

janisozaur commented Apr 9, 2018

IntelOrca commented Apr 9, 2018

IntelOrca commented Apr 13, 2018

janisozaur commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Broxzier Apr 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fusxfaranto commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Fusxfaranto commented Apr 13, 2018

IntelOrca commented Apr 13, 2018

Fusxfaranto commented Apr 14, 2018

janisozaur commented Apr 14, 2018

Fusxfaranto commented Apr 16, 2018

Gymnasiast commented Apr 25, 2018

AaronVanGeffen commented May 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ccoors commented May 14, 2018

AaronVanGeffen commented May 14, 2018 • edited Loading

AaronVanGeffen commented May 17, 2018 • edited Loading

AaronVanGeffen commented May 22, 2018

Gymnasiast left a comment

Choose a reason for hiding this comment

Broxzier Apr 13, 2018 •

edited

Loading

AaronVanGeffen commented May 14, 2018 •

edited

Loading

AaronVanGeffen commented May 17, 2018 •

edited

Loading