-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Unicode support in the console #1379
Conversation
@@ -230,7 +230,7 @@ bool hsStream::AtEnd() | |||
|
|||
bool hsStream::IsTokenSeparator(char c) | |||
{ | |||
return (isspace(c) || c==',' || c=='='); | |||
return (isspace(static_cast<unsigned char>(c)) || c==',' || c=='='); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?? ctype functions (isspace
, isalnum
, etc) take an int
, which both a signed and unsigned implementation of char
should automatically and safely promote to... Why all the casts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the problem. char
is signed and gets sign-extended to int
, but the ctype functions are only defined for unsigned char
values and EOF
. See also e. g. https://nullprogram.com/blog/2023/02/11/#character-classification-and-mapping
MSVCRT has a debug assertion ensuring that you don't pass any other negative values. That started failing once I typed non-ASCII characters into the console, so I went ahead and fixed the issue for all ctype calls (there weren't all that many).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that make sense... Although perhaps it makes more sense to test for that specific case, since characters outside the 0-127 range will always be UTF-8 multibyte sequence bytes when coming from ST::strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean exactly... something like c < 0x80 && isspace(c)
? I don't think that's necessary if c
is already char
-sized. We don't use locales, so all the is
functions will return false
for all bytes outside the ASCII range.
(I did use c < 0x80
checks where c
is a larger type like wchar_t
or uint32_t
, because just a cast isn't sufficient there.)
{ | ||
if (name.empty()) | ||
{ | ||
if (statusStr) | ||
*statusStr = "Object name is nil"; | ||
statusStr = "Object name is nil"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These would be a good place to use ST_LITERAL
or the _st
literal suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep. I used it more consistently elsewhere, but got lazy when I saw the 6000-line file 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, added ST_LITERAL
in the relevant places. I didn't adjust every single PrintString
call though - that would be quite a bit of work and not really worth it. (This is the dev console - I don't think we need to micro-optimize performance and memory usage that much.)
|
||
memset( fHistory, 0, sizeof( fHistory ) ); | ||
for (size_t i = 0; i < kNumHistoryTypes; i++) { | ||
fHistory[i] = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't mind requiring string_theory >= 3.4....
fHistory[i] = {}; | |
fHistory[i].clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Plasma already uses ST::string::clear()
elsewhere, despite the CMakeLists.txt only saying string_theory 2.0 - that should probably be updated...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I remember now - fHistory[i]
is actually a struct. This line initializes/clears all fields in the struct, not just the ST::string
, so clear()
doesn't work here.
// Supposedly it's *possible* to use a non-trivial class in a union, | ||
// but it seems complex and hard to do correctly, | ||
// so let's just put this in its own field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this code could probably be replaced with std::variant
, but that's a very different PR.
ST::string name = cmd->GetName() + ' '; | ||
pfConsoleCmdGroup* group = cmd->GetParent(); | ||
while (group != nullptr && group != pfConsoleCmdGroup::GetBaseGroup()) { | ||
name = group->GetName() + '.' + name; | ||
group = group->GetParent(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use ST::string_stream
to avoid extra allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't easily implemented with ST::string_stream
, because that only allows appending to the end, but the string here is built from back to front. I suppose the strings could be collected into a std::vector
and then joined at the end, to avoid at least some intermediate allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, but I'd like to ensure that all of @zrax's items are cleared up before proceeding.
Out-of-range values, such as non-ASCII *signed* chars, cause undefined behavior according to the standard (and an assertion failure in the case of MSVCRT).
This is enough to make it not throw exceptions instantly when running a console command with non-ASCII characters, but some things probably still break with non-ASCII characters (and the console still can't actually render them).
This should fix any remaining encoding mess-ups on the output end. Not that it really matters, because the console still only displays ASCII...
This allows typing non-ASCII characters into the console and gets them through unharmed to Python (not just Plasma console commands), although they are still invisible in both the input and output.
Multiple places in the code already use ST::string::clear(), which was added in string_theory 3.4.
Co-authored-by: Adam Johnson <AdamJohnso@gmail.com>
dc49f67
to
f6e6687
Compare
Rebased to fix a conflict. I think I had addressed/fixed all the things from the reviews - let me know if I missed anything. |
Barring any objections, I intend to merge this tonight. |
A couple of days late, but here we go... |
This implements full Unicode support in the console engine internals (fixes #479). The in-game console itself only has limited Unicode support for now:
plDebugText
, which the console uses for text rendering, still only supports ASCII characters (not even Latin-1). Non-ASCII characters are not visible in the input and output.Non-ASCII characters can be typed into the input, and even though they are invisible, they will be passed through correctly to Python and/or the console engine (unlike before). Cut/copy/paste also work correctly with non-ASCII characters.
So for example you can type or paste this command to bind the Ü key (semicolon on US keyboards):
And you can run this in the Python console and get the correct output:
The previous implementation used very "C-style" string management (not in a good way). I've reimplemented most of it using
ST::string
, which should make the code easier to follow. The diff is quite messy though - apologies in advance to whoever decides to review this. You might want to skip over pfConsoleCommands.cpp and the other command implementations at first - they mostly contain boringchar *
->ST::string
changes.Obviously I didn't test every single console command - only a few common ones and those where Unicode is obviously relevant.