unzip non-UTF-8 archives #112

mixtur · 2021-12-14T16:11:49Z

What can't you do right now?
It happens that in Russia file names inside zip files are often encoded with cp866. Such filenames currently decoded incorrectly in fflate. The best I can do is

  new TextDecoder('cp866').decode(strToU8(file.name))

but it produces correct characters interleaved with some gibberish.

An optimal solution
Either provide the raw name in UnzipFile

{
    name: string,//as it is decoded now
    rawName: {
        bytes: Uint8Array,
        isUTF8: boolean
    },
    ondata: AsyncFlateStreamHandler,
    ...
}

, or make it possible to provide an encoding for entries marked as not utf-8.

unzip = new Unzip();
unzip.setFallbackEncoding('cp866');

(How) is this done by other libraries?
jszip also fails to decode it correctly.

There is unzip -O cp866 in Ubuntu starting from some version, and before that version I believe they had a hack that would have used cp866 automatically if it had seen a Russian locale in the OS.
A browser equivalent for that hack would be navigator.language == 'ru-RU' if you are willing to use that approach.

The text was updated successfully, but these errors were encountered:

mixtur · 2021-12-14T16:33:33Z

There is also this crazyness https://unix.stackexchange.com/a/364344

101arrowz · 2021-12-14T17:48:18Z

Interesting, I've never had to deal with locales so this could be a challenge, but I'll try to implement this when I get the chance. Since it's a larger change it could take a while.

mixtur · 2021-12-17T15:13:49Z

https://github.com/vlm/zip-fix-filename-encoding/blob/master/src/runzip.c this might help a little bit. They are trying to guess an encoding by character frequencies there.

Also there are some test files that might be useful https://github.com/Stuk/jszip/tree/master/test/ref
In particular local_encoding_in_name.zip has russian filenames inside, I think it is encoded with cp866 according to jszip tests.

I was probably wrong about jszip in the first comment, apparently they are handling it somehow (or at least they have tests for that), yet for some of my files jszip produces something wrong in file names. And I definitely have cp866.

101arrowz · 2022-01-20T03:41:21Z

I've found that Unicode filenames in general are annoying to deal with. You need to set { os: 3 } (Unix) for it to work with the default zip included in many Linux distributions. But then you also need to set permissions with { attrs: 0o644 << 16 } because attrs defaults to 0 (which is fine for DOS, os = 0). I'll need extra time and/or help investigating an ergonomic solution for all these issues without bloating bundle size.

costacosta · 2022-01-26T06:49:06Z

JSZip punts the issue to the end user. https://github.com/Stuk/jszip/blob/master/documentation/api_jszip/load_async.md#decodefilename-option

101arrowz · 2022-01-26T07:08:16Z

That actually seems like a decent solution. I'm assuming no sane person would use cp866 for creating new ZIP files, so just decode support might work OK with a similar setting in UnzipOptions. I'll look into it.

mixtur · 2022-01-26T08:38:44Z

I think it is tempting to archivers authors to use one-byte encodings to save a few more bytes. So the problem is not going away any time soon.

But yeah. putting the problem on the user is fine by me too.

mixtur · 2023-11-13T14:04:49Z

👀

101arrowz · 2023-11-24T00:51:02Z

I still want to fix this issue but I can't promise I'll get to it in v0.8.2. I will give it another honest attempt though.

ShenHongFei · 2023-11-24T02:24:01Z

I tried changing it like this and it solved the problem.

usage:

kichikuou · 2024-04-11T12:17:25Z

I'd like to clarify that this is not just a problem for Russian archives.

Even in Windows 11, Explorer's built-in "Compress to ZIP file" creates non-UTF-8 archives, i.e., ZIPs with filenames encoded in the current Windows code page. As far as I know, there are plenty of archives around with non-UTF-8 Japanese filenames, and the situation is probably similar for other languages.

Punting the issue to the user may be the only realistic solution. Below is a screenshot of The Unarchiver on Mac asking the user for the encoding of the filename, and there are so many possible encodings. There are five different encodings for Japanese alone (four of which are slightly different variants of the Shift_JIS encoding). I think it is beyond the scope of this library to deal with this complexity, or to try to do anything clever to guess the encoding.

hyvyys · 2024-07-17T13:30:50Z

does this issue only occur with non-utf8 archives? I'm having this issue with a zip created in MacOS:

Curaçao.txt.zip

Curaçao.txt

becomes

CuracÌ§ao.txt

And ChatGPT has me believe MacOS uses UTF-8 for filenames inside zip.

mixtur changed the title ~~unzip Russian files~~ unzip non-UTF-8 archives Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unzip non-UTF-8 archives #112

unzip non-UTF-8 archives #112

mixtur commented Dec 14, 2021 •

edited

Loading

mixtur commented Dec 14, 2021

101arrowz commented Dec 14, 2021

mixtur commented Dec 17, 2021

101arrowz commented Jan 20, 2022

costacosta commented Jan 26, 2022

101arrowz commented Jan 26, 2022 •

edited

Loading

mixtur commented Jan 26, 2022

mixtur commented Nov 13, 2023

101arrowz commented Nov 24, 2023

ShenHongFei commented Nov 24, 2023

kichikuou commented Apr 11, 2024

hyvyys commented Jul 17, 2024 •

edited

Loading

unzip non-UTF-8 archives #112

unzip non-UTF-8 archives #112

Comments

mixtur commented Dec 14, 2021 • edited Loading

mixtur commented Dec 14, 2021

101arrowz commented Dec 14, 2021

mixtur commented Dec 17, 2021

101arrowz commented Jan 20, 2022

costacosta commented Jan 26, 2022

101arrowz commented Jan 26, 2022 • edited Loading

mixtur commented Jan 26, 2022

mixtur commented Nov 13, 2023

101arrowz commented Nov 24, 2023

ShenHongFei commented Nov 24, 2023

kichikuou commented Apr 11, 2024

hyvyys commented Jul 17, 2024 • edited Loading

mixtur commented Dec 14, 2021 •

edited

Loading

101arrowz commented Jan 26, 2022 •

edited

Loading

hyvyys commented Jul 17, 2024 •

edited

Loading