Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unzip-like Unicode support? #125

Open
dgcampea opened this issue Sep 27, 2021 · 3 comments
Open

unzip-like Unicode support? #125

dgcampea opened this issue Sep 27, 2021 · 3 comments

Comments

@dgcampea
Copy link

dgcampea commented Sep 27, 2021

If the filenames of a zip file are in a different encoding, attempting to interact with them with zziplib results in mojibake or invalid byte sequences if the "current system is on a different encoding".
unzip can extract them fine without any extra options but with unzip -UU -O UTF8 it results in the same mojibake seen with zziplib. Inspecting the zip with a hex editor it looks like the zip file actually stores 2 filenames, one in UTF and another in the original system encoding.

Is there anything that can be done in zziplib to pick the UTF8 filename instead?

@gdraheim
Copy link
Owner

zziplib does scan the zip directory ahead of time, so it does not seem to be very complicated to guide the processing function to do a re-encoding. It was simply not needed so far - and I don't have time implement such a thing.

@dgcampea
Copy link
Author

Without delving into encoding conversion (via ICU lib) and only relying on the Extra Field (subfield 0x7075) which is already in UTF8 if available, should the zziplib api be duplicated into utf8 aware functions if implemented?
Or would compile time macros which toggle unicode path extra field support be a better option?

@gdraheim
Copy link
Owner

Well, if you really want to be correct then you need to consider that file system functions expect the arguments for file names to be in the encoding that is used on that file system. Just disregarding the encoding was an easy approach. Since unix-ish system have switched to UTF8 more than a decade ago, it did work as expected.

I don't know of any existing file API that differentiates between native and utf8 encoding - instead we see that a derived API based on wchar_t has been developed, and with using POSIX mbstowcs and friends there is a standard API which can do the conversion (without UCI libs). I guess that is too much of an effort to be put into zziplib.

So may be a compile-switch is the only thing that could be put into the current design - forcing the arguments of the functions to be utf8 no matter what the operating system uses elsewhere. The testsuite could be adapted to that, handing over arguments in utf8 as well. Depending on the project it may help - but it would not go out into standard packages in shared libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants