Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Win64 unicode #332

Merged
merged 36 commits into from Jan 2, 2019
Merged

Win64 unicode #332

merged 36 commits into from Jan 2, 2019

Conversation

nicolas-cellier-aka-nice
Copy link
Contributor

We can now at least compile the whole VM and main plugins with -DUNICODE

…me/path

Note:
the Microsoft windows API mostly uses the W version (for enabling internationalized image name/path)
the image uses UTF8 encoded bytes string for communication with the VM (this is best for compatibility with Unix/Mac)
The idea here is that the implementation maintains both versions of the UTF8 and UTF16 path/name

The appropriate macro returning a TCHAR * are also provided.
This is in order to support the generic version using TCHAR, which are normally used to ease transition to UNICODE.

Note about string length:
No effort has been made so far to support long path names for image and VM.
The path is limited to MAX_PATH in UTF16.
UTF8 can eventually consume more characters than UTF16 (not necessarily more bytes).
Thus, the ASCII version has been made longer (MAX_PATH_UTF8) in order to avoid an even more restrictive limit.
OK, OK, they are compatible with 16 bits windows ;)
But these are not recommended any more especially if we want to go toward UNICODE paths everywhere.
This is for the generated temporary file $$squeak$$.bmp
The path to temporary directory could be non ASCII, so let's be more robust to UNICODE.
See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/control87-controlfp-control87-2?view=vs-2017
Setting these flags triggers an assertion failure when compiled with MSVC in debug mode

    Assertion failure (mask&~(_MCW_DN|_MCW_EM|_MCW_RC))==0
Due to this, roundUpToPage is truncating addresses > 2^32
With MSVC, minAddress is > 2^32, and it then takes ages to generate a VirtualAlloc > minAddress (several minutes).
using `toUnicode` does not do the right thing: it promotes each UTF8 byte code to short...
... which can hardly work beyond ASCII.
toUnicode is just expanding bytes to short, and re-interpret them UTF16, not regarding original encoding.
This works well for pure ASCII, but is not really defined for other encodings.
We should prefer using UTF8 by default for all the image-VM string transfer.
Why to get rid of toUnicode fromUnicode fromSqueak fromSqueak2?

- we'd rather use UTF8 everywhere.
- and it's the last usage remaining!

Now the clients must use a TEXT() macro also for the format. In the future, we could eventually translate the VM messages...

Get rid of wsprintf variant which has no character limit and might be prone to buffer overrun.
An alternative for supporting both ASCII and WIDE is the _tcs* family in <tchar.h>
See https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/vsprintf-vsprintf-l-vswprintf-vswprintf-l-vswprintf-l?view=vs-2017
In order to avoid buffer overrun, prefer a vsnprintf variant

Note: I did also update the prototype in sqWin32SpurAlloc, but not the contents.
It is dead code, and only there for testing purposes.
We'd better remove it!
For UNICODE compatibility,

- every String coming from image to the VM should better be interpreted UTF8, and converted to wide String via MultiByteToWideChar()
- every String going to the image from the VM should better be converted from Wide string to UTF8 via WideCharToMultiByte()

See:
https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar
https://docs.microsoft.com/en-us/windows/desktop/api/stringapiset/nf-stringapiset-widechartomultibyte

Side note: there is also a _tcsrrchr in <tchar.h> at least since visual studio 2015
See
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrchr-wcsrchr-mbsrchr-mbsrchr-l?view=vs-2017
<tchar.h> is also supported by mingw so if ever we need lstrrchr again, we'll use that.
Reminder: even in WIN64, _WIN32 is defined, so the comment was a bit misleading anyway.
…lusive options

Therefore, when we define `WIN32_FILE_SUPPORT` we must also define `NO_STD_FILE_SUPPORT`
Until now, this knowledge was in build.win*/common/Makefile.*
(and a legacy MSVC project in plaftorms/win32/misc/Squeak.dsp - on pourrait réduire la voilure !!!)
Since it is easy to forget the second define, lets extend the knowledge in the sqPlatformSpecific.h
Even if buf were re-interpreted as WCHAR* when -DUNICODE,
strcmp wouldn't do the right thing (it will stop at first ASCII because high 8 bits will be zero!)
We thus use the TCHAR*dedicated `_tcscmp`.
If ever we want to switch to UTF16 (WCHAR*) windowClassName, then it will be `wcscmp`.
That's the limit of using compiler warnings: we only focus on the sections we compile...
BTW, ifdef NewspeakVM, OK, but what do Pharo people think about it?
Choose the UNICODE variant, because error messages are presumably localized an may use non ASCII characters
The alternative would be to use `_ftprintf(stderr,TEXT("%s"),gai_strerror(gaiError))` and let -DUNICODE decide...
The low level `RegQueryValueEx` deals with a char*, but here char* just means some un-interpreted bytes, not a string!
If we compile with `-DUNICODE` the un-interpreted bytes will contain a WCHAR*
And if we want to properly NULL terminate this WCHAR*, then we need 2 null bytes, not 1!
Rationale: there's no urge in providing localized UNICODE info...
That's IP addresses, etc...
Eventually, server names could be UNICODE but this seems to be a real mess!
The short term goal is to enable compilation with -DUNICODE
For longer term, we'll see later.
We now interpret iconPath as UTF-8 encoded
We convert it to WideChar and call the W version.
TODO: for now, do not deal with UNC long filenames...
though, vfprintf does not, so reconcile by using _vftprintf from <tchar.h>
so let's use appropriate TCHAR* functions/macros
…simply sprintf

We have to test #ifdef UNICODE, and if so, convert to UTF8
But there is no need for UNICODE in VM_VERSION_TEXT
Revert to plain ASCII and rename it VM_VERSION_VERBOSE
Let iniName be WCHAR unconditionally.
Get manufacturer and model into a UTF16 buffer, then convert them to UTF8
while at it, protect buffer overrun strcat - > wcsNcat
We want to generate an UTF8 report (do we really?).
So provide a `RegLookupUTF8String`, so as to make all the queries with W variant,
then convert all information to UTF8

While at it, use snprintf instead of sprintf and strncat instead of strcat
There was a mixture of TCHAR* and char* which could not work with -DUNICODE
Who knows, the TempPath might be localized, so go UNICODE...
I'm very sorry to dilapidate all this historical knowledge, but frankly YAGNI!
Even if we are a museum, we can't expose all our art in permanent collections!
Note that minimal version is already set to XP (via WINVER:=-D_WIN32_WINNT=0x0501 -DWINVER=0x0501 in Makefile.tools)
So we're keeping this stuf for nothing, and now it gets in our way to UNICODE.
These are the last two obstacles that generate compiler warnings with -DUNICODE
No compiler warning does not mean that UNICODE is OK and ready to go.
It means that we have at least eliminated all the trivial TCHAR*/char*/WCHAR* mismatch (and there were a bunch of them!!!)
Also the code is like a battlefield with lot of different ad hoc recipes, that would deserve more uniform approach (refactoring)
But small steps!

To reach the current stage, we have to overhaul printCommandLine()
It can't answer a TCHAR* when sqMain expects a char*argv[]...
So, like what is done in WinMain thru getCommandLineW, we first produce a Wide command line, then convert to UTF8.
There is currently no provision for conversion failure: I don't know how to report and exit when the VM is ran as a service.

Note that command line parsing is then working the other way around: some of the arguments are converted back to UTF16.
I call this style tricotage coding: une maille à l'endroit, une maille à l'envers.
Maybe a deeper change will be to WCHAR* all the way down, but let's differ this decision. Small steps!

Note that another source of problems is RegQueryValueExW.
When we query a WCHAR* string, it is not necessarily NULL terminated.
But every answer is handled as raw un-interpretd-bytes (char *), whose byte length is answered in dwSize.
That's not the WCHAR character length! So buffer[dwSize] = 0 is not the right thing:

- if buffer is declared WVCHAR*, this would write the NULL well beyond the real string length (and eventually overrun the buffer);
- if buffer is declared char*, this would not guaranty a terminating NULL WCHAR (we need 2 zero bytes to make a NULL WCHAR).

Usage of this function will require a careful review, the price of low level API...

Also note that we seem to compile with flag -DNO_SERVICE right now. Why?
I have not enough knowledge in this area and lack tests/examples.
In that conditions, it's not easy to test the modifications! Any help will be greatly appreciated.
The good part is that It won't break current VM usage...
You know what I think of dead-code, but half alive Frankenstein code scares me as well ;)
wcsncat does not care about the write limit! (buffer overrun).
It only specifies the maximum number of characters to read from source...

This way, we pay one more Shlemiel the painter run, and write hyper convoluted code.
Nice!
Like wcsncat, the ugly and correct usage is

    strncat( dest, src, sizeof(dest)  - 1 - strlen(dest) );
Here, i don't want to redefine error() to take a TCHAR*
So the simplest alternative is to switch to MessageBoxA
But vmLogDirA may contain UNICODE, and UTF8 encoded character may be mangled in the MessageBox.

The right way is to switch to W variant unconditionnally, and interpret msg as UTF8...
The joy of 0-based indices...
[skip travis]
if (_strnicmp(keyName, "\\registry\\machine\\", 18) == 0) {
memcpy(keyName, keyName+18, strlen(keyName)-17);
if (_wcsnicmp(keyName, L"\\registry\\machine\\", 18) == 0) {
memmove(keyName, keyName+18*sizeof(WCHAR), (wcslen(keyName)-17)*sizeof(WCHAR));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err: keyName is WCHAR*, keyName+18 skips 18 characters, keyName+18*sizeof(WCHAR) skips 36 characters!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3e51616

Note that keyName was declared char * in the original code.
But now it is WCHAR*, so keyName+18 already does the right thing (skip 18 char) while modified code would now skip 36!
Note that original code was using memcpy for this case of overlapping memory which is BAD and eventually UB.
[skip travis]
@nicolas-cellier-aka-nice nicolas-cellier-aka-nice deleted the WIN64_UNICODE branch January 2, 2019 14:54
tesonep added a commit to tesonep/opensmalltalk-vm that referenced this pull request Sep 1, 2021
…alling-convention-flag

Exposing the ABI selection to the image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants