-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ICU for UTF-8 conversion and add support for unicode uppercasing #7385
Conversation
src/openrct2/core/String.cpp
Outdated
uint32 upperWord; | ||
if ((*src & 0xF0) == 0xF0) | ||
{ | ||
word = ((uint8)src[0] << 24) + ((uint8)src[1] << 16) + ((uint8)src[2] << 8) + (uint8)src[3]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An invalid UTF-8 string could cause an out-of-bounds access here, is this a possible input? I assumed not, but if it's a possible situation, I should definitely add a null check for each index.
Thank you for your PR. I'm sure it was an interesting issue to work on. Personally, I think we should really start thinking about using ICU for unicode conversions and transliterations instead of building our own implementation for this sort of thing. We can drop a lot of our conversion tables at that point, too. |
I second that. ICU is also required for proper support for Arabic and other RTL languages. |
A vote for ICU from me too. The only thing that's left now is for someone to implement it. |
LCMapStringEx can be used on Windows to do unicode transformations. |
@Fusxfaranto What platform / OS are you developing on? |
Ok cool, would you be able to try moving our current tables to instead use libicu? |
test/tests/StringTest.cpp
Outdated
|
||
TEST_F(StringTest, ToUpper_Basic) | ||
{ | ||
utf8 *actual = (utf8*)alloca(20 * sizeof(utf8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using alloca
over utf8 actual[20]
? The compiler knows the size already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good catch.
Sure, I'll give it a shot this weekend. |
@Fusxfaranto Thanks, you only need to do it for non-Windows platforms. I will write the Windows side of the code afterwards using the native Windows API. |
Sounds good, someone else will need to verify the changes on OSX though. Also, should I update this PR, or make a new one? |
@Fusxfaranto You can update this existing one. |
I attempted to start on this, but it appears I don't have enough competence in CMake to actually add the ICU dependency (just using |
You can leave the ICU handling in CMake to me, for the initial PR you can use whatever works on your system, which I imagine would be to add |
Didn't manage to get the tests to link correctly, so I haven't actually run them yet, but I figured I could leave that to @janisozaur. |
Could you rebase this PR to |
Pitched in on some of the work required. I have raised the minimum cmake version to 3.7, as that version introduces a FindICU module. As we are introducing a dependency on icu, I think this is a good time to get rid of the iconv dependency in |
src/openrct2/core/String.cpp
Outdated
icu::UnicodeString convertString(src.data(), codepage); | ||
std::string result; | ||
|
||
if (dstCodePage == CODE_PAGE::CP_UTF8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this isn't converting from UTF-8 to any other code page yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible for you to feed the same code page IDs to ICU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnicodeString class does not seem to have such a constructor, no, unfortunately… It looks like going from a UnicodeString to any other encoding has to be done with a bunch of C-style functions instead, too. Less than ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any functions in libicu to convert from number to name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as far as I'm aware. I think the best option will be something like sprintf(s, "windows-%s", codepage)
, but I'm not sure that works for everything, so I'll have to check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucnv_getStandard seems to be the function we're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you sure? I was under the impression that that function took an ICU-internal index.
b631c6a
to
1394352
Compare
Can another example be added to the string test? The single char 'fi' should uppercase to two chars, 'F' and 'I'. |
Some things left to do for this PR:
Needs external dependencies update, and not blocking for this PR imo:
|
It looks like all tests are passing with the latest amendments. From my perspective, all that's left for this PR is to get ICU bundled in the Xcode dependency package, if the version macOS is shipping isn't usable. I'm hoping @LRFLEW will be able to do this. As @janisozaur mentioned, the Android dependencies will need to be updated as well. This will be tackled along with #6771, and is therefore not blocking for this PR. |
Co-authored-by: Fusxfaranto <fusxfaranto@gmail.com>
Co-authored-by: Fusxfaranto <fusxfaranto@gmail.com>
LCMapStringEx does not unfold ligatures if there is no uppercase equivalent.
This fixes cmake's find_package() for cross-compilation
I've spent a good hour fiddling with Xcode to get the dependencies sorted. To do this, I used binaries from Homebrew, whose identity paths I adjusted using
CI appears to be happy, too. However, I'd really appreciate it if someone other than me could try an Xcode build of the game on this branch. |
wchar_t typically uses UTF-32 codepoints on Linux, unlike Windows, which uses UTF-16.
* Update SDL2 to version 2.0.8. * Introduce dependency on ICU, bundling version 61.1. * Introduce FreeType2 dependency now that we have adopted SDL2_ttf in our own sources. * Drop dependency on external SDL2_ttf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this on macOS. It compiles correctly and banners correctly display upper case text.
Fixes #6938. Unfortunately, this adds a great deal of complexity, even with only supporting the non-special cases for UTF-8 uppercasing. Still, I don't think there's a significantly better approach here. Notably, a large table is needed, which I decided to stick in
localisation/ConversionTables.cpp
(is there a better place for it?).