Folder Names with Umlaut are not Supported #79

vanthome · 2014-01-13T10:38:32Z

With this code you will end up with
ASCII control chars as path:

var zip = new JSZip();
zip.file("Hello.txt", "Hello World\n");
var img = zip.folder("öäü");
img.file("smile.gif", imgData, {base64: true});
var content = zip.generate();
location.href="data:application/zip;base64,"+content;

The text was updated successfully, but these errors were encountered:

dduponchel · 2014-01-14T00:14:29Z

I reproduced the issue with unzip on linux and after diving in the source code of unzip, I think I understand what's happening. JSZip set the flag saying "the path is in utf8 !" but don't add any extra field... Extra field expected by unzip. I will re-read the "APPENDIX D - Language Encoding (EFS)" of the zip specs tomorrow, and hopefully fix this bug :)
To be sure that I will fix your bug, do you use unzip or an other archive manager ?

vanthome · 2014-01-14T05:44:17Z

Ok, great. I use Ark on KDE to open the Archive.

dduponchel · 2014-01-18T19:06:26Z

I spent some time in the source code of unzip (v6.0), to understand what is really going on and my first conclusion was (almost) wrong :-) This will be long and technical but I need to share with the world this madness. Feel free to skip to the tl;dr.

On Linux

... with unzip :

in fileio.c:2243, in the "translate the Zip entry filename" part, there is a call to the macro Ext_ASCII_TO_Native
in unzpriv.h:3005, the macro uses the version/platform to convert the file name
in fileio.c:2306, with an unicode path extra field, we overwrite the filename to its unicode version

The comment block on this macro is really helpful :

/* Convert filename (and file comment string) into "internal" charset.
 * This macro assumes that Zip entry filenames are coded in OEM (IBM DOS)
 * codepage when made on
 *  -> DOS (this includes 16-bit Windows 3.1)  (FS_FAT_)
 *  -> OS/2                                    (FS_HPFS_)
 *  -> Win95/WinNT with Nico Mak's WinZip      (FS_NTFS_ && hostver == "5.0")
 * EXCEPTIONS:
 *  PKZIP for Windows 2.5, 2.6, and 4.0 flag their entries as "FS_FAT_", but
 *  the filename stored in the local header is coded in Windows ANSI (CP 1252
 *  resp. ISO 8859-1 on US and western Europe locale settings).
 *  Likewise, PKZIP for UNIX 2.51 flags its entries as "FS_FAT_", but the
 *  filenames stored in BOTH the local and the central header are coded
 *  in the local system's codepage (usually ANSI codings like ISO 8859-1).
 *
 * All other ports are assumed to code zip entry filenames in ISO 8859-1.
 */

JSZip generates zip files with the DOS flag so unzip expects the IBM 437 code page. The UTF8 flag (G.pInfo->GPFIsUTF8) doesn't seem to be used to generate the final file name without the unicode path extra field.

I suspect the other archive managers to do the same guesses for the encoding so adding the unicode path extra field is the easy way to be sure that the path is correctly read.

If we still have issues with other managers, we will change the "version made by" field to UNIX for example (but without any extra info, unzip will break the file permissions so this need some development/tests). Changing it to NTFS doesn't seem to be a good idea : windows itself, winrar, winzip, etc uses DOS (more on that after) and the specification says "10 - Windows NTFS" while the unzip code says #define FS_NTFS_ 11...

On Windows

I also tested on Windows (seven) with the default compressed folders feature. Windows generated a zip file as DOS using the IBM 437 code page. Of course, I was on a NTFS partition with unicode filename. If I use characters that are not in this code page, I get a nice :

'C:\♥.txt' cannot be compressed because it includes characters that cannot be used in a compressed folder, such as ♥. You should rename this file or directory.

This post on superuser.com give a well explained answer (the links in this post are broken but you can find them on archive.org, here or here for example)

This also means that without a <locale code page used in Windows> to utf8 converter, JSZip won't read correctly zip files generated by the default windows compressed folders feature if they contains non-ascii file names. And by <locale code page used in Windows>, this seems to go from IBM 437 code page to Japanese Shift-JIS code page.

We could/should go the same way as winrar when generating a zip file : with non-ascii characters in the path, replace them with _ but set the correct path as the extra field. I will test it on several archive managers but that would a nice fallback on Windows : on non compatible managers, instead of "I ♥ you.txt" it would display "I _ you.txt" instead of "I GÖÑ you.txt".

TL;DR

If you whish to use unicode in file names :

On windows, please install / force your users to install an external archive manager, the default one is a mess. I will test _ as a fallback for unicode characters for the worst case scenario.
On linux, the fix is coming :)

This patch sets the unicode path extra field. unzip needs at least one extra field to correctly handle unicode path, so using the path is as good as any other information. This could improve the situation with other archive managers too. This field is usually used without the utf8 flag, with a non unicode path in the header (winrar, winzip). This helps (a bit) with the messy Windows' default compressed folders feature but breaks on p7zip which doesn't seek the unicode path extra field. So for now, UTF-8 everywhere ! Fix Stuk#79.

vanthome · 2014-01-20T09:40:15Z

Ok, great, MANY thanks, hope for a soon release...

dduponchel mentioned this issue Jan 19, 2014

Add support for the unicode path extra field #82

Merged

dduponchel closed this as completed in #82 Feb 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Folder Names with Umlaut are not Supported #79

Folder Names with Umlaut are not Supported #79

vanthome commented Jan 13, 2014

dduponchel commented Jan 14, 2014

vanthome commented Jan 14, 2014

dduponchel commented Jan 18, 2014

vanthome commented Jan 20, 2014

Folder Names with Umlaut are not Supported #79

Folder Names with Umlaut are not Supported #79

Comments

vanthome commented Jan 13, 2014

dduponchel commented Jan 14, 2014

vanthome commented Jan 14, 2014

dduponchel commented Jan 18, 2014

On Linux

On Windows

TL;DR

vanthome commented Jan 20, 2014