Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Unicode support in the console #1379

Merged
merged 22 commits into from
Jul 10, 2023
Merged

Conversation

dgelessus
Copy link
Contributor

@dgelessus dgelessus commented Apr 16, 2023

This implements full Unicode support in the console engine internals (fixes #479). The in-game console itself only has limited Unicode support for now:

  • A few string fields are still limited to 8-bit Latin-1 characters, namely the input line and display buffer.
  • plDebugText, which the console uses for text rendering, still only supports ASCII characters (not even Latin-1). Non-ASCII characters are not visible in the input and output.

Non-ASCII characters can be typed into the input, and even though they are invisible, they will be passed through correctly to Python and/or the console engine (unlike before). Cut/copy/paste also work correctly with non-ASCII characters.

So for example you can type or paste this command to bind the Ü key (semicolon on US keyboards):

Keyboard.BindConsoleCmd Ü Graphics.Show.Physicals

And you can run this in the Python console and get the correct output:

>>> print(ascii("Ü"))
'\xdc'

The previous implementation used very "C-style" string management (not in a good way). I've reimplemented most of it using ST::string, which should make the code easier to follow. The diff is quite messy though - apologies in advance to whoever decides to review this. You might want to skip over pfConsoleCommands.cpp and the other command implementations at first - they mostly contain boring char * -> ST::string changes.

Obviously I didn't test every single console command - only a few common ones and those where Unicode is obviously relevant.

@@ -230,7 +230,7 @@ bool hsStream::AtEnd()

bool hsStream::IsTokenSeparator(char c)
{
return (isspace(c) || c==',' || c=='=');
return (isspace(static_cast<unsigned char>(c)) || c==',' || c=='=');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?? ctype functions (isspace, isalnum, etc) take an int, which both a signed and unsigned implementation of char should automatically and safely promote to... Why all the casts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the problem. char is signed and gets sign-extended to int, but the ctype functions are only defined for unsigned char values and EOF. See also e. g. https://nullprogram.com/blog/2023/02/11/#character-classification-and-mapping

MSVCRT has a debug assertion ensuring that you don't pass any other negative values. That started failing once I typed non-ASCII characters into the console, so I went ahead and fixed the issue for all ctype calls (there weren't all that many).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that make sense... Although perhaps it makes more sense to test for that specific case, since characters outside the 0-127 range will always be UTF-8 multibyte sequence bytes when coming from ST::strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean exactly... something like c < 0x80 && isspace(c)? I don't think that's necessary if c is already char-sized. We don't use locales, so all the is functions will return false for all bytes outside the ASCII range.

(I did use c < 0x80 checks where c is a larger type like wchar_t or uint32_t, because just a cast isn't sufficient there.)

{
if (name.empty())
{
if (statusStr)
*statusStr = "Object name is nil";
statusStr = "Object name is nil";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These would be a good place to use ST_LITERAL or the _st literal suffix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep. I used it more consistently elsewhere, but got lazy when I saw the 6000-line file 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, added ST_LITERAL in the relevant places. I didn't adjust every single PrintString call though - that would be quite a bit of work and not really worth it. (This is the dev console - I don't think we need to micro-optimize performance and memory usage that much.)


memset( fHistory, 0, sizeof( fHistory ) );
for (size_t i = 0; i < kNumHistoryTypes; i++) {
fHistory[i] = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't mind requiring string_theory >= 3.4....

Suggested change
fHistory[i] = {};
fHistory[i].clear();

Copy link
Contributor Author

@dgelessus dgelessus Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Plasma already uses ST::string::clear() elsewhere, despite the CMakeLists.txt only saying string_theory 2.0 - that should probably be updated...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I remember now - fHistory[i] is actually a struct. This line initializes/clears all fields in the struct, not just the ST::string, so clear() doesn't work here.

Sources/Plasma/FeatureLib/pfConsole/pfConsole.cpp Outdated Show resolved Hide resolved
Sources/Plasma/FeatureLib/pfConsole/pfConsole.cpp Outdated Show resolved Hide resolved
Comment on lines +133 to +135
// Supposedly it's *possible* to use a non-trivial class in a union,
// but it seems complex and hard to do correctly,
// so let's just put this in its own field.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code could probably be replaced with std::variant, but that's a very different PR.

Sources/Plasma/FeatureLib/pfConsole/pfConsoleCommands.cpp Outdated Show resolved Hide resolved
Sources/Plasma/FeatureLib/pfConsoleCore/pfConsoleCmd.cpp Outdated Show resolved Hide resolved
Sources/Plasma/FeatureLib/pfConsoleCore/pfConsoleCmd.cpp Outdated Show resolved Hide resolved
Sources/Plasma/FeatureLib/pfConsoleCore/pfConsoleCmd.cpp Outdated Show resolved Hide resolved
Sources/Plasma/FeatureLib/pfConsoleCore/pfConsoleCmd.cpp Outdated Show resolved Hide resolved
Comment on lines 518 to 525
ST::string name = cmd->GetName() + ' ';
pfConsoleCmdGroup* group = cmd->GetParent();
while (group != nullptr && group != pfConsoleCmdGroup::GetBaseGroup()) {
name = group->GetName() + '.' + name;
group = group->GetParent();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use ST::string_stream to avoid extra allocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't easily implemented with ST::string_stream, because that only allows appending to the end, but the string here is built from back to front. I suppose the strings could be collected into a std::vector and then joined at the end, to avoid at least some intermediate allocations.

Copy link
Member

@Hoikas Hoikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but I'd like to ensure that all of @zrax's items are cleared up before proceeding.

dgelessus and others added 22 commits July 1, 2023 13:53
Out-of-range values, such as non-ASCII *signed* chars, cause undefined
behavior according to the standard (and an assertion failure in the case
of MSVCRT).
This is enough to make it not throw exceptions instantly when running a
console command with non-ASCII characters, but some things probably
still break with non-ASCII characters (and the console still can't
actually render them).
This should fix any remaining encoding mess-ups on the output end. Not
that it really matters, because the console still only displays ASCII...
This allows typing non-ASCII characters into the console and gets them
through unharmed to Python (not just Plasma console commands), although
they are still invisible in both the input and output.
Multiple places in the code already use ST::string::clear(), which was
added in string_theory 3.4.
Co-authored-by: Adam Johnson <AdamJohnso@gmail.com>
@dgelessus
Copy link
Contributor Author

Rebased to fix a conflict.

I think I had addressed/fixed all the things from the reviews - let me know if I missed anything.

@Hoikas
Copy link
Member

Hoikas commented Jul 8, 2023

Barring any objections, I intend to merge this tonight.

@Hoikas
Copy link
Member

Hoikas commented Jul 10, 2023

A couple of days late, but here we go...

@Hoikas Hoikas merged commit bc3b31b into H-uru:master Jul 10, 2023
14 checks passed
@dgelessus dgelessus deleted the pfconsole_unicode branch November 29, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ST::stringify the console
3 participants