unzip-like Unicode support? #125

dgcampea · 2021-09-27T10:22:17Z

If the filenames of a zip file are in a different encoding, attempting to interact with them with zziplib results in mojibake or invalid byte sequences if the "current system is on a different encoding".
unzip can extract them fine without any extra options but with unzip -UU -O UTF8 it results in the same mojibake seen with zziplib. Inspecting the zip with a hex editor it looks like the zip file actually stores 2 filenames, one in UTF and another in the original system encoding.

Is there anything that can be done in zziplib to pick the UTF8 filename instead?

The text was updated successfully, but these errors were encountered:

gdraheim · 2021-09-27T10:59:23Z

zziplib does scan the zip directory ahead of time, so it does not seem to be very complicated to guide the processing function to do a re-encoding. It was simply not needed so far - and I don't have time implement such a thing.

dgcampea · 2021-09-29T16:47:33Z

Without delving into encoding conversion (via ICU lib) and only relying on the Extra Field (subfield 0x7075) which is already in UTF8 if available, should the zziplib api be duplicated into utf8 aware functions if implemented?
Or would compile time macros which toggle unicode path extra field support be a better option?

gdraheim · 2021-09-30T10:43:11Z

Well, if you really want to be correct then you need to consider that file system functions expect the arguments for file names to be in the encoding that is used on that file system. Just disregarding the encoding was an easy approach. Since unix-ish system have switched to UTF8 more than a decade ago, it did work as expected.

I don't know of any existing file API that differentiates between native and utf8 encoding - instead we see that a derived API based on wchar_t has been developed, and with using POSIX mbstowcs and friends there is a standard API which can do the conversion (without UCI libs). I guess that is too much of an effort to be put into zziplib.

So may be a compile-switch is the only thing that could be put into the current design - forcing the arguments of the functions to be utf8 no matter what the operating system uses elsewhere. The testsuite could be adapted to that, handing over arguments in utf8 as well. Depending on the project it may help - but it would not go out into standard packages in shared libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unzip-like Unicode support? #125

unzip-like Unicode support? #125

dgcampea commented Sep 27, 2021 •

edited

gdraheim commented Sep 27, 2021

dgcampea commented Sep 29, 2021

gdraheim commented Sep 30, 2021

unzip-like Unicode support? #125

unzip-like Unicode support? #125

Comments

dgcampea commented Sep 27, 2021 • edited

gdraheim commented Sep 27, 2021

dgcampea commented Sep 29, 2021

gdraheim commented Sep 30, 2021

dgcampea commented Sep 27, 2021 •

edited