Win64 unicode #332

…me/path Note: the Microsoft windows API mostly uses the W version (for enabling internationalized image name/path) the image uses UTF8 encoded bytes string for communication with the VM (this is best for compatibility with Unix/Mac) The idea here is that the implementation maintains both versions of the UTF8 and UTF16 path/name The appropriate macro returning a TCHAR * are also provided. This is in order to support the generic version using TCHAR, which are normally used to ease transition to UNICODE. Note about string length: No effort has been made so far to support long path names for image and VM. The path is limited to MAX_PATH in UTF16. UTF8 can eventually consume more characters than UTF16 (not necessarily more bytes). Thus, the ASCII version has been made longer (MAX_PATH_UTF8) in order to avoid an even more restrictive limit.

OK, OK, they are compatible with 16 bits windows ;) But these are not recommended any more especially if we want to go toward UNICODE paths everywhere.

This is for the generated temporary file $$squeak$$.bmp The path to temporary directory could be non ASCII, so let's be more robust to UNICODE.

See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/control87-controlfp-control87-2?view=vs-2017 Setting these flags triggers an assertion failure when compiled with MSVC in debug mode Assertion failure (mask&~(_MCW_DN|_MCW_EM|_MCW_RC))==0

Due to this, roundUpToPage is truncating addresses > 2^32 With MSVC, minAddress is > 2^32, and it then takes ages to generate a VirtualAlloc > minAddress (several minutes).

using `toUnicode` does not do the right thing: it promotes each UTF8 byte code to short... ... which can hardly work beyond ASCII.

toUnicode is just expanding bytes to short, and re-interpret them UTF16, not regarding original encoding. This works well for pure ASCII, but is not really defined for other encodings. We should prefer using UTF8 by default for all the image-VM string transfer.

Why to get rid of toUnicode fromUnicode fromSqueak fromSqueak2? - we'd rather use UTF8 everywhere. - and it's the last usage remaining! Now the clients must use a TEXT() macro also for the format. In the future, we could eventually translate the VM messages... Get rid of wsprintf variant which has no character limit and might be prone to buffer overrun. An alternative for supporting both ASCII and WIDE is the _tcs* family in <tchar.h> See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/vsprintf-vsprintf-l-vswprintf-vswprintf-l-vswprintf-l?view=vs-2017 In order to avoid buffer overrun, prefer a vsnprintf variant Note: I did also update the prototype in sqWin32SpurAlloc, but not the contents. It is dead code, and only there for testing purposes. We'd better remove it!

For UNICODE compatibility, - every String coming from image to the VM should better be interpreted UTF8, and converted to wide String via MultiByteToWideChar() - every String going to the image from the VM should better be converted from Wide string to UTF8 via WideCharToMultiByte() See: https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-widechartomultibyte Side note: there is also a _tcsrrchr in <tchar.h> at least since visual studio 2015 See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrchr-wcsrchr-mbsrchr-mbsrchr-l?view=vs-2017 <tchar.h> is also supported by mingw so if ever we need lstrrchr again, we'll use that.

Reminder: even in WIN64, _WIN32 is defined, so the comment was a bit misleading anyway.

…lusive options Therefore, when we define `WIN32_FILE_SUPPORT` we must also define `NO_STD_FILE_SUPPORT` Until now, this knowledge was in build.win*/common/Makefile.* (and a legacy MSVC project in plaftorms/win32/misc/Squeak.dsp - on pourrait réduire la voilure !!!) Since it is easy to forget the second define, lets extend the knowledge in the sqPlatformSpecific.h

Even if buf were re-interpreted as WCHAR* when -DUNICODE, strcmp wouldn't do the right thing (it will stop at first ASCII because high 8 bits will be zero!) We thus use the TCHAR*dedicated `_tcscmp`. If ever we want to switch to UTF16 (WCHAR*) windowClassName, then it will be `wcscmp`.

That's the limit of using compiler warnings: we only focus on the sections we compile... BTW, ifdef NewspeakVM, OK, but what do Pharo people think about it?

Choose the UNICODE variant, because error messages are presumably localized an may use non ASCII characters The alternative would be to use `_ftprintf(stderr,TEXT("%s"),gai_strerror(gaiError))` and let -DUNICODE decide...

The low level `RegQueryValueEx` deals with a char*, but here char* just means some un-interpreted bytes, not a string! If we compile with `-DUNICODE` the un-interpreted bytes will contain a WCHAR* And if we want to properly NULL terminate this WCHAR*, then we need 2 null bytes, not 1!

Rationale: there's no urge in providing localized UNICODE info... That's IP addresses, etc... Eventually, server names could be UNICODE but this seems to be a real mess! The short term goal is to enable compilation with -DUNICODE For longer term, we'll see later.

We now interpret iconPath as UTF-8 encoded We convert it to WideChar and call the W version. TODO: for now, do not deal with UNC long filenames...

though, vfprintf does not, so reconcile by using _vftprintf from <tchar.h>

so let's use appropriate TCHAR* functions/macros

…simply sprintf We have to test #ifdef UNICODE, and if so, convert to UTF8

But there is no need for UNICODE in VM_VERSION_TEXT Revert to plain ASCII and rename it VM_VERSION_VERBOSE

Let iniName be WCHAR unconditionally. Get manufacturer and model into a UTF16 buffer, then convert them to UTF8 while at it, protect buffer overrun strcat - > wcsNcat

We want to generate an UTF8 report (do we really?). So provide a `RegLookupUTF8String`, so as to make all the queries with W variant, then convert all information to UTF8 While at it, use snprintf instead of sprintf and strncat instead of strcat

There was a mixture of TCHAR* and char* which could not work with -DUNICODE Who knows, the TempPath might be localized, so go UNICODE...

I'm very sorry to dilapidate all this historical knowledge, but frankly YAGNI! Even if we are a museum, we can't expose all our art in permanent collections! Note that minimal version is already set to XP (via WINVER:=-D_WIN32_WINNT=0x0501 -DWINVER=0x0501 in Makefile.tools) So we're keeping this stuf for nothing, and now it gets in our way to UNICODE.

These are the last two obstacles that generate compiler warnings with -DUNICODE No compiler warning does not mean that UNICODE is OK and ready to go. It means that we have at least eliminated all the trivial TCHAR*/char*/WCHAR* mismatch (and there were a bunch of them!!!) Also the code is like a battlefield with lot of different ad hoc recipes, that would deserve more uniform approach (refactoring) But small steps! To reach the current stage, we have to overhaul printCommandLine() It can't answer a TCHAR* when sqMain expects a char*argv[]... So, like what is done in WinMain thru getCommandLineW, we first produce a Wide command line, then convert to UTF8. There is currently no provision for conversion failure: I don't know how to report and exit when the VM is ran as a service. Note that command line parsing is then working the other way around: some of the arguments are converted back to UTF16. I call this style tricotage coding: une maille à l'endroit, une maille à l'envers. Maybe a deeper change will be to WCHAR* all the way down, but let's differ this decision. Small steps! Note that another source of problems is RegQueryValueExW. When we query a WCHAR* string, it is not necessarily NULL terminated. But every answer is handled as raw un-interpretd-bytes (char *), whose byte length is answered in dwSize. That's not the WCHAR character length! So buffer[dwSize] = 0 is not the right thing: - if buffer is declared WVCHAR*, this would write the NULL well beyond the real string length (and eventually overrun the buffer); - if buffer is declared char*, this would not guaranty a terminating NULL WCHAR (we need 2 zero bytes to make a NULL WCHAR). Usage of this function will require a careful review, the price of low level API... Also note that we seem to compile with flag -DNO_SERVICE right now. Why? I have not enough knowledge in this area and lack tests/examples. In that conditions, it's not easy to test the modifications! Any help will be greatly appreciated. The good part is that It won't break current VM usage... You know what I think of dead-code, but half alive Frankenstein code scares me as well ;)

wcsncat does not care about the write limit! (buffer overrun). It only specifies the maximum number of characters to read from source... This way, we pay one more Shlemiel the painter run, and write hyper convoluted code. Nice!

Like wcsncat, the ugly and correct usage is strncat( dest, src, sizeof(dest) - 1 - strlen(dest) );

Here, i don't want to redefine error() to take a TCHAR* So the simplest alternative is to switch to MessageBoxA But vmLogDirA may contain UNICODE, and UTF8 encoded character may be mangled in the MessageBox. The right way is to switch to W variant unconditionnally, and interpret msg as UTF8...

The joy of 0-based indices... [skip travis]

Note that keyName was declared char * in the original code. But now it is WCHAR*, so keyName+18 already does the right thing (skip 18 char) while modified code would now skip 36! Note that original code was using memcpy for this case of overlapping memory which is BAD and eventually UB. [skip travis]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Win64 unicode #332

Win64 unicode #332

Commits on Dec 30, 2018

Commits on Dec 31, 2018

Commits on Jan 1, 2019

Commits on Jan 2, 2019