Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385

Merged
merged 20 commits into from
May 23, 2018

Conversation

Fusxfaranto
Copy link
Contributor

Fixes #6938. Unfortunately, this adds a great deal of complexity, even with only supporting the non-special cases for UTF-8 uppercasing. Still, I don't think there's a significantly better approach here. Notably, a large table is needed, which I decided to stick in localisation/ConversionTables.cpp (is there a better place for it?).

uint32 upperWord;
if ((*src & 0xF0) == 0xF0)
{
word = ((uint8)src[0] << 24) + ((uint8)src[1] << 16) + ((uint8)src[2] << 8) + (uint8)src[3];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An invalid UTF-8 string could cause an out-of-bounds access here, is this a possible input? I assumed not, but if it's a possible situation, I should definitely add a null check for each index.

@AaronVanGeffen
Copy link
Member

Thank you for your PR. I'm sure it was an interesting issue to work on.

Personally, I think we should really start thinking about using ICU for unicode conversions and transliterations instead of building our own implementation for this sort of thing. We can drop a lot of our conversion tables at that point, too.

@Gymnasiast
Copy link
Member

I second that. ICU is also required for proper support for Arabic and other RTL languages.

@janisozaur
Copy link
Member

A vote for ICU from me too. The only thing that's left now is for someone to implement it.

@IntelOrca
Copy link
Contributor

LCMapStringEx can be used on Windows to do unicode transformations.

@IntelOrca
Copy link
Contributor

@Fusxfaranto What platform / OS are you developing on?

@janisozaur
Copy link
Member

@IntelOrca

Linux/gcc

@IntelOrca
Copy link
Contributor

Ok cool, would you be able to try moving our current tables to instead use libicu?


TEST_F(StringTest, ToUpper_Basic)
{
utf8 *actual = (utf8*)alloca(20 * sizeof(utf8));
Copy link
Member

@Broxzier Broxzier Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using alloca over utf8 actual[20]? The compiler knows the size already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good catch.

@Fusxfaranto
Copy link
Contributor Author

Sure, I'll give it a shot this weekend.

@IntelOrca
Copy link
Contributor

@Fusxfaranto Thanks, you only need to do it for non-Windows platforms. I will write the Windows side of the code afterwards using the native Windows API.

@Fusxfaranto
Copy link
Contributor Author

Sounds good, someone else will need to verify the changes on OSX though. Also, should I update this PR, or make a new one?

@IntelOrca
Copy link
Contributor

@Fusxfaranto You can update this existing one.

@Fusxfaranto
Copy link
Contributor Author

I attempted to start on this, but it appears I don't have enough competence in CMake to actually add the ICU dependency (just using find_package(ICU 55.0 REQUIRED) fails with a mysterious error message that I can't manage to decipher). If someone is willing to figure out most of the setup, I'm happy to do the implementation, but otherwise I don't think I can make much progress.

@janisozaur
Copy link
Member

You can leave the ICU handling in CMake to me, for the initial PR you can use whatever works on your system, which I imagine would be to add -licuuc -licudata -licui18n to libopenrct2, as based on .pc files.

@Fusxfaranto
Copy link
Contributor Author

Didn't manage to get the tests to link correctly, so I haven't actually run them yet, but I figured I could leave that to @janisozaur.

@Gymnasiast
Copy link
Member

Could you rebase this PR to develop?

@AaronVanGeffen AaronVanGeffen added the pending rebase PR needs to be rebased. label Apr 26, 2018
@AaronVanGeffen AaronVanGeffen added work in progress and removed pending rebase PR needs to be rebased. labels May 6, 2018
@AaronVanGeffen
Copy link
Member

Pitched in on some of the work required. I have raised the minimum cmake version to 3.7, as that version introduces a FindICU module. cmake isn't playing nice yet, though, so this requires some more attention. @janisozaur, perhaps you could have a look?

As we are introducing a dependency on icu, I think this is a good time to get rid of the iconv dependency in win1252_to_utf8 (Localisation.cpp) as well. The icu package provides nicer C++ abstractions which we should leverage.

icu::UnicodeString convertString(src.data(), codepage);
std::string result;

if (dstCodePage == CODE_PAGE::CP_UTF8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this isn't converting from UTF-8 to any other code page yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible for you to feed the same code page IDs to ICU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnicodeString class does not seem to have such a constructor, no, unfortunately… It looks like going from a UnicodeString to any other encoding has to be done with a bunch of C-style functions instead, too. Less than ideal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any functions in libicu to convert from number to name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as far as I'm aware. I think the best option will be something like sprintf(s, "windows-%s", codepage), but I'm not sure that works for everything, so I'll have to check that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucnv_getStandard seems to be the function we're looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are you sure? I was under the impression that that function took an ICU-internal index.

@ccoors
Copy link
Member

ccoors commented May 14, 2018

Can another example be added to the string test? The single char 'fi' should uppercase to two chars, 'F' and 'I'.

@AaronVanGeffen
Copy link
Member

AaronVanGeffen commented May 14, 2018

Some things left to do for this PR:

  • Implement String::ToUpper for Windows. @IntelOrca suggests using LCMapStringEx.
  • Test for any regressions. Check for performance regressions, too.
  • Travis needs to install libicu before building. The iconv dependency can be dropped. (@janisozaur, could you tackle this?)

Needs external dependencies update, and not blocking for this PR imo:

  • Android needs libicu added to its dependencies, or an equivalent library should be used.

@AaronVanGeffen
Copy link
Member

AaronVanGeffen commented May 17, 2018

It looks like all tests are passing with the latest amendments. From my perspective, all that's left for this PR is to get ICU bundled in the Xcode dependency package, if the version macOS is shipping isn't usable. I'm hoping @LRFLEW will be able to do this.

As @janisozaur mentioned, the Android dependencies will need to be updated as well. This will be tackled along with #6771, and is therefore not blocking for this PR.

@AaronVanGeffen
Copy link
Member

I've spent a good hour fiddling with Xcode to get the dependencies sorted. To do this, I used binaries from Homebrew, whose identity paths I adjusted using install_name_tool.

  • Update SDL2 to version 2.0.8.
  • Introduce dependency on ICU, bundling version 61.1.
  • Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources.
  • Drop dependency on external SDL2_ttf.

CI appears to be happy, too. However, I'd really appreciate it if someone other than me could try an Xcode build of the game on this branch.

wchar_t typically uses UTF-32 codepoints on Linux, unlike Windows, which uses UTF-16.
* Update SDL2 to version 2.0.8.
* Introduce dependency on ICU, bundling version 61.1.
* Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources.
* Drop dependency on external SDL2_ttf.
Copy link
Member

@Gymnasiast Gymnasiast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on macOS. It compiles correctly and banners correctly display upper case text.

@AaronVanGeffen AaronVanGeffen merged commit 77b09b3 into OpenRCT2:develop May 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants