Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support unicode characters in command line arguments for windows #96

Closed
forworldm opened this issue Nov 4, 2021 · 4 comments
Closed
Assignees

Comments

@forworldm
Copy link

on Windows the argv is encoded in the ANSI codepage. your code seems to assume it is UTF-8 and convert it to wide characters when call system functions.

@forworldm
Copy link
Author

wpath = path_to_windows(path);

file->fd = CreateFile(filename, access_flags, share_mode, NULL, creation_mode,

your code seems to be mixing these two encodings?

@AgentD
Copy link
Owner

AgentD commented Dec 5, 2021

It took a little bit longer than expected, but I finally got around to looking into this (and also got stuck with another bug along the way during testing), but I hope that this should be fixed soon-ish for a new release with primarily Windows fixes. There are now commits on master and fixes-1.1.0 that try to address this issue, but I'm afraid that it will require a little more research, review and testing.

A wrapper for the main() function was added that obtains the actual UTF-16 command line and converts it to UTF-8 before running the real main() function. The libsquashfs Windows port has been modified to automatically convert the filename argument from UTF-8 to UTF-16 internally, and use the wide-char API. A feature flag is used to retain the existing code-page-random behavior, if desired. The libfstream code (primarily used for processing tar files with transparent decompression) has also been fixed. The directory scanning code already uses the wide-char API.

This was sufficient that I could use the command line tools for accessing files/archives with German and Chinese names when running some quick tests.

Input files (i.e. the gensquashfs pack file) are interpreted as being UTF-8 encoded. This might be a problem, since plain text files on Windows could easily be code-page-random or UTF-16. Furthermore, the strings in an archive could in theory be anything, not necessarily UTF-8, which might also have to be addressed.

@AgentD AgentD self-assigned this Dec 5, 2021
@forworldm
Copy link
Author

thanks for you work. I can create archive file with non-ASCII directory name now. however the tool will print garbled text if file name contains non-ASCII characters. one possible solution is to call SetConsoleOutputCP(CP_UTF8) in the main function.

@AgentD
Copy link
Owner

AgentD commented Mar 11, 2022

Hi,

first of all, sorry for the long delay. While I was preoccupied with work/personal issues for much longer than I had initially hoped, I did occasionally find some time to look into this and test several approaches on a Windows 7 VM.

Sadly, the suggested drop-in solution doesn't seem to work. Using SetConsoleOutputCP still causes individual code units to be sent to the console. Apparently printf/fputs internally use the ANSI version of the underlying API and simply interpret the UTF-8 multi-byte sequences as Latin-1 (I guess?) and they end up themselves converted to UTF-8.

Trying to do _setmode(_fileno(stdout), _O_U8TEXT); causes printf and friends to trigger an assert. As the MSDN page says, they do not support output to a "Unicode stream".

I tried another approach to use pre-processor magic to redirect the stdio functions to Windows specific, custom implementations (generate a finished string for the printf ones) and then convert it to UTF-16 and use the wide-char versions. This strangely worked for German Umlaut characters, but Chinese text magically disappeared. Also, if it had worked, this would result in UTF-16 files when redirecting the output to a file or a pipe. Particularly rdsquashfs -d is supposed to generate output that gensquashfs can use as a manifest file.

I modified this approach and instead added a hacky check if the target stream is stdout or stderr, directly get the handles using GetStdHandle, check if it is a console using GetFileType, then convert to UTF-16 and use ConsoleWriteW. If it isn't a console, the original (presumed) UTF-8 is kept, so redirecting to a file or pipe causes the output to remain unmodified. This is what ultimately ended up in commit 6447b19.

I also alternatively tried to change the codepage to UTF-8, not convert the strings at all and use ConsoleWriteA instead. This worked for both German and Chinese text, but broke line wrapping behavior on the console for some reason.

The approach in 6447b19 worked to most reliably so far, but is still not perfect. In the Windows 7 VM, printing Chinese text causes a weird indentation to be added in front of every printed line (I guess this caused by the different font being switched to?). Also, when manually setting the codepage to UTF-8 (by running chcp 65001), I can see continuation characters again but mapped somewhere into HJK range. I guess in the end of the day, Windows is just not an ideal platform to write CLI programs for.

@AgentD AgentD closed this as completed Mar 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants