Win64 unicode #332

nicolas-cellier-aka-nice · 2018-12-31T17:50:45Z

We can now at least compile the whole VM and main plugins with -DUNICODE

…me/path Note: the Microsoft windows API mostly uses the W version (for enabling internationalized image name/path) the image uses UTF8 encoded bytes string for communication with the VM (this is best for compatibility with Unix/Mac) The idea here is that the implementation maintains both versions of the UTF8 and UTF16 path/name The appropriate macro returning a TCHAR * are also provided. This is in order to support the generic version using TCHAR, which are normally used to ease transition to UNICODE. Note about string length: No effort has been made so far to support long path names for image and VM. The path is limited to MAX_PATH in UTF16. UTF8 can eventually consume more characters than UTF16 (not necessarily more bytes). Thus, the ASCII version has been made longer (MAX_PATH_UTF8) in order to avoid an even more restrictive limit.

OK, OK, they are compatible with 16 bits windows ;) But these are not recommended any more especially if we want to go toward UNICODE paths everywhere.

This is for the generated temporary file $$squeak$$.bmp The path to temporary directory could be non ASCII, so let's be more robust to UNICODE.

See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/control87-controlfp-control87-2?view=vs-2017 Setting these flags triggers an assertion failure when compiled with MSVC in debug mode Assertion failure (mask&~(_MCW_DN|_MCW_EM|_MCW_RC))==0

Due to this, roundUpToPage is truncating addresses > 2^32 With MSVC, minAddress is > 2^32, and it then takes ages to generate a VirtualAlloc > minAddress (several minutes).

using `toUnicode` does not do the right thing: it promotes each UTF8 byte code to short... ... which can hardly work beyond ASCII.

toUnicode is just expanding bytes to short, and re-interpret them UTF16, not regarding original encoding. This works well for pure ASCII, but is not really defined for other encodings. We should prefer using UTF8 by default for all the image-VM string transfer.

Why to get rid of toUnicode fromUnicode fromSqueak fromSqueak2? - we'd rather use UTF8 everywhere. - and it's the last usage remaining! Now the clients must use a TEXT() macro also for the format. In the future, we could eventually translate the VM messages... Get rid of wsprintf variant which has no character limit and might be prone to buffer overrun. An alternative for supporting both ASCII and WIDE is the _tcs* family in <tchar.h> See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/vsprintf-vsprintf-l-vswprintf-vswprintf-l-vswprintf-l?view=vs-2017 In order to avoid buffer overrun, prefer a vsnprintf variant Note: I did also update the prototype in sqWin32SpurAlloc, but not the contents. It is dead code, and only there for testing purposes. We'd better remove it!

For UNICODE compatibility, - every String coming from image to the VM should better be interpreted UTF8, and converted to wide String via MultiByteToWideChar() - every String going to the image from the VM should better be converted from Wide string to UTF8 via WideCharToMultiByte() See: https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-widechartomultibyte Side note: there is also a _tcsrrchr in <tchar.h> at least since visual studio 2015 See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrchr-wcsrchr-mbsrchr-mbsrchr-l?view=vs-2017 <tchar.h> is also supported by mingw so if ever we need lstrrchr again, we'll use that.

Reminder: even in WIN64, _WIN32 is defined, so the comment was a bit misleading anyway.

…lusive options Therefore, when we define `WIN32_FILE_SUPPORT` we must also define `NO_STD_FILE_SUPPORT` Until now, this knowledge was in build.win*/common/Makefile.* (and a legacy MSVC project in plaftorms/win32/misc/Squeak.dsp - on pourrait réduire la voilure !!!) Since it is easy to forget the second define, lets extend the knowledge in the sqPlatformSpecific.h

Even if buf were re-interpreted as WCHAR* when -DUNICODE, strcmp wouldn't do the right thing (it will stop at first ASCII because high 8 bits will be zero!) We thus use the TCHAR*dedicated `_tcscmp`. If ever we want to switch to UTF16 (WCHAR*) windowClassName, then it will be `wcscmp`.

That's the limit of using compiler warnings: we only focus on the sections we compile... BTW, ifdef NewspeakVM, OK, but what do Pharo people think about it?

Choose the UNICODE variant, because error messages are presumably localized an may use non ASCII characters The alternative would be to use `_ftprintf(stderr,TEXT("%s"),gai_strerror(gaiError))` and let -DUNICODE decide...

The low level `RegQueryValueEx` deals with a char*, but here char* just means some un-interpreted bytes, not a string! If we compile with `-DUNICODE` the un-interpreted bytes will contain a WCHAR* And if we want to properly NULL terminate this WCHAR*, then we need 2 null bytes, not 1!

Rationale: there's no urge in providing localized UNICODE info... That's IP addresses, etc... Eventually, server names could be UNICODE but this seems to be a real mess! The short term goal is to enable compilation with -DUNICODE For longer term, we'll see later.

We now interpret iconPath as UTF-8 encoded We convert it to WideChar and call the W version. TODO: for now, do not deal with UNC long filenames...

though, vfprintf does not, so reconcile by using _vftprintf from <tchar.h>

so let's use appropriate TCHAR* functions/macros

…simply sprintf We have to test #ifdef UNICODE, and if so, convert to UTF8

But there is no need for UNICODE in VM_VERSION_TEXT Revert to plain ASCII and rename it VM_VERSION_VERBOSE

Let iniName be WCHAR unconditionally. Get manufacturer and model into a UTF16 buffer, then convert them to UTF8 while at it, protect buffer overrun strcat - > wcsNcat

We want to generate an UTF8 report (do we really?). So provide a `RegLookupUTF8String`, so as to make all the queries with W variant, then convert all information to UTF8 While at it, use snprintf instead of sprintf and strncat instead of strcat

There was a mixture of TCHAR* and char* which could not work with -DUNICODE Who knows, the TempPath might be localized, so go UNICODE...

I'm very sorry to dilapidate all this historical knowledge, but frankly YAGNI! Even if we are a museum, we can't expose all our art in permanent collections! Note that minimal version is already set to XP (via WINVER:=-D_WIN32_WINNT=0x0501 -DWINVER=0x0501 in Makefile.tools) So we're keeping this stuf for nothing, and now it gets in our way to UNICODE.

These are the last two obstacles that generate compiler warnings with -DUNICODE No compiler warning does not mean that UNICODE is OK and ready to go. It means that we have at least eliminated all the trivial TCHAR*/char*/WCHAR* mismatch (and there were a bunch of them!!!) Also the code is like a battlefield with lot of different ad hoc recipes, that would deserve more uniform approach (refactoring) But small steps! To reach the current stage, we have to overhaul printCommandLine() It can't answer a TCHAR* when sqMain expects a char*argv[]... So, like what is done in WinMain thru getCommandLineW, we first produce a Wide command line, then convert to UTF8. There is currently no provision for conversion failure: I don't know how to report and exit when the VM is ran as a service. Note that command line parsing is then working the other way around: some of the arguments are converted back to UTF16. I call this style tricotage coding: une maille à l'endroit, une maille à l'envers. Maybe a deeper change will be to WCHAR* all the way down, but let's differ this decision. Small steps! Note that another source of problems is RegQueryValueExW. When we query a WCHAR* string, it is not necessarily NULL terminated. But every answer is handled as raw un-interpretd-bytes (char *), whose byte length is answered in dwSize. That's not the WCHAR character length! So buffer[dwSize] = 0 is not the right thing: - if buffer is declared WVCHAR*, this would write the NULL well beyond the real string length (and eventually overrun the buffer); - if buffer is declared char*, this would not guaranty a terminating NULL WCHAR (we need 2 zero bytes to make a NULL WCHAR). Usage of this function will require a careful review, the price of low level API... Also note that we seem to compile with flag -DNO_SERVICE right now. Why? I have not enough knowledge in this area and lack tests/examples. In that conditions, it's not easy to test the modifications! Any help will be greatly appreciated. The good part is that It won't break current VM usage... You know what I think of dead-code, but half alive Frankenstein code scares me as well ;)

wcsncat does not care about the write limit! (buffer overrun). It only specifies the maximum number of characters to read from source... This way, we pay one more Shlemiel the painter run, and write hyper convoluted code. Nice!

Like wcsncat, the ugly and correct usage is strncat( dest, src, sizeof(dest) - 1 - strlen(dest) );

Here, i don't want to redefine error() to take a TCHAR* So the simplest alternative is to switch to MessageBoxA But vmLogDirA may contain UNICODE, and UTF8 encoded character may be mangled in the MessageBox. The right way is to switch to W variant unconditionnally, and interpret msg as UTF8...

The joy of 0-based indices... [skip travis]

nicolas-cellier-aka-nice · 2019-01-02T14:01:57Z

platforms/win32/vm/sqWin32Main.c

-    if (_strnicmp(keyName, "\\registry\\machine\\", 18) == 0) {
-      memcpy(keyName, keyName+18, strlen(keyName)-17);
+    if (_wcsnicmp(keyName, L"\\registry\\machine\\", 18) == 0) {
+      memmove(keyName, keyName+18*sizeof(WCHAR), (wcslen(keyName)-17)*sizeof(WCHAR));


Err: keyName is WCHAR*, keyName+18 skips 18 characters, keyName+18*sizeof(WCHAR) skips 36 characters!

Fixed in 3e51616

Note that keyName was declared char * in the original code. But now it is WCHAR*, so keyName+18 already does the right thing (skip 18 char) while modified code would now skip 36! Note that original code was using memcpy for this case of overlapping memory which is BAD and eventually UB. [skip travis]

…alling-convention-flag Exposing the ABI selection to the image

nicolas-cellier-aka-nice added 30 commits December 30, 2018 18:29

DropPlugin: modernize OpenFile/_lwrite/_lclose API

8bdd0b6

OK, OK, they are compatible with 16 bits windows ;) But these are not recommended any more especially if we want to go toward UNICODE paths everywhere.

Drop plugin: let tempPathName be Wide

cb32e41

This is for the generated temporary file $$squeak$$.bmp The path to temporary directory could be non ASCII, so let's be more robust to UNICODE.

Fix 2 potential buffer overrun in sqWin32Service.c

85e0883

The pageSize and pageMask are too short on WIN64

49eff31

Due to this, roundUpToPage is truncating addresses > 2^32 With MSVC, minAddress is > 2^32, and it then takes ages to generate a VirtualAlloc > minAddress (several minutes).

Use the eventually true UNICODE imageNameT if -DUNICODE

627bc5e

using `toUnicode` does not do the right thing: it promotes each UTF8 byte code to short... ... which can hardly work beyond ASCII.

#if 0 ? YAGNI ! we now use builtin _WIN32 _WIN64 anyway

2024d43

Reminder: even in WIN64, _WIN32 is defined, so the comment was a bit misleading anyway.

which SetUpPreferences()? There is no SetUpPreferences()!

e1e83f7

RegisterWindowMessage takes a TCHAR *

8638522

Tu quoque NewspeakVM, frustra TEXT macro...

84a8d17

That's the limit of using compiler warnings: we only focus on the sections we compile... BTW, ifdef NewspeakVM, OK, but what do Pharo people think about it?

gai_strerror returns a TCHAR*, we cannot simply fprintf it...

7d3264e

Choose the UNICODE variant, because error messages are presumably localized an may use non ASCII characters The alternative would be to use `_ftprintf(stderr,TEXT("%s"),gai_strerror(gaiError))` and let -DUNICODE decide...

iconPath is char*, LoadImage expects a TCHAR*

475d84c

We now interpret iconPath as UTF-8 encoded We convert it to WideChar and call the W version. TODO: for now, do not deal with UNC long filenames...

And account for the fact that iconPath is not NULL-TERMINATED!

ef245b6

DPRINTF must take a TCHAR*fmt because wvsprintf does!

07ff6a6

though, vfprintf does not, so reconcile by using _vftprintf from <tchar.h>

NOTIFYICONDATA.szTip maybe a WCHAR* if -DUNICODE

dfe4d09

so let's use appropriate TCHAR* functions/macros

_DISPLAY_DEVICE.DeviceString may be a WCHAR* if -DUNICODE, we cannot …

db33158

…simply sprintf We have to test #ifdef UNICODE, and if so, convert to UTF8

VM_VERSION_TEXT is a TCHAR*, we cannot simply fprintf

46bd992

But there is no need for UNICODE in VM_VERSION_TEXT Revert to plain ASCII and rename it VM_VERSION_VERBOSE

iniName, manufacturer and model may be WCHAR* if -DUNICODE

c5f207c

Let iniName be WCHAR unconditionally. Get manufacturer and model into a UTF16 buffer, then convert them to UTF8 while at it, protect buffer overrun strcat - > wcsNcat

Make stderrName and stdoutName be WCHAR*

8b14fbf

There was a mixture of TCHAR* and char* which could not work with -DUNICODE Who knows, the TempPath might be localized, so go UNICODE...

nicolas-cellier-aka-nice added 5 commits January 1, 2019 15:38

Fix 3 potential buffer overrun

e5cd4fb

Fix my own confusion about wcsncat

4ded318

wcsncat does not care about the write limit! (buffer overrun). It only specifies the maximum number of characters to read from source... This way, we pay one more Shlemiel the painter run, and write hyper convoluted code. Nice!

Fix: Shlemiel the painter needs to paint strncat too

af19ed7

Like wcsncat, the ugly and correct usage is strncat( dest, src, sizeof(dest) - 1 - strlen(dest) );

Fix another potential Buffer overrun in sqWin32MIDI.c

252e2a8

The joy of 0-based indices... [skip travis]

nicolas-cellier-aka-nice commented Jan 2, 2019

View reviewed changes

nicolas-cellier-aka-nice merged commit 7136c67 into Cog Jan 2, 2019

nicolas-cellier-aka-nice deleted the WIN64_UNICODE branch January 2, 2019 14:54

tesonep added a commit to tesonep/opensmalltalk-vm that referenced this pull request Sep 1, 2021

Merge pull request OpenSmalltalk#332 from pharo-project/feat/adding-c…

2e3d3fa

…alling-convention-flag Exposing the ABI selection to the image

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Win64 unicode #332

Win64 unicode #332

nicolas-cellier-aka-nice commented Dec 31, 2018

nicolas-cellier-aka-nice Jan 2, 2019

nicolas-cellier-aka-nice Jan 2, 2019

Win64 unicode #332

Win64 unicode #332

Conversation

nicolas-cellier-aka-nice commented Dec 31, 2018

nicolas-cellier-aka-nice Jan 2, 2019

Choose a reason for hiding this comment

nicolas-cellier-aka-nice Jan 2, 2019

Choose a reason for hiding this comment