Files double unpacked due to different UTF-8 normalizations #1633

Safihre · 2020-10-09T17:40:38Z

For a few days the tests on Travis have been failing for macOS, for some reason the test_download_unicode_made_on_windows was unpacking the resulting file twice.
After much debugging I found out that the string frènch_german_demö that is printed in the logs is actually presented in 2 different ways. This caused the unpacker not to detect a set was already unpacked, basically because:

>>> "frènch_german_demö" == "frènch_german_demö"
False

The reason is that they are obtained from 2 different sources:
Output of os.listdir:

b'fre\xcc\x80nch_german_demo\xcc\x88'

Output of the par2 files:

b'fr\xc3\xa8nch_german_dem\xc3\xb6'

https://stackoverflow.com/a/26733055

Now just need to find a way to fix this..

The text was updated successfully, but these errors were encountered:

puzzledsab · 2021-02-15T15:15:11Z

I'm not sure where to apply it so I haven't tested but maybe this will help:

#!/usr/bin/python

import unicodedata as ud
str1 = b'fre\xcc\x80nch_german_demo\xcc\x88'
str2 = b'fr\xc3\xa8nch_german_dem\xc3\xb6'

str1u = str1.decode("utf-8")
str2u = str2.decode("utf-8")

print(str1 == str2)
print(str1u == str2u)
print(ud.normalize('NFC',str1u) == str2u)
print(str1u == ud.normalize('NFC',str2u))
print(ud.normalize('NFC',str1u) == ud.normalize('NFC',str2u))

False
False
True
False
True

Safihre · 2021-02-15T15:23:15Z

I tried that indeed: I made a unicode-friendly os.listdir wrapper, but didn't work.

sanderjo · 2021-02-15T15:50:31Z

And how about this: hard-encode to pure ASCII?

>>> ud.normalize('NFKD', str1u).encode('ascii','ignore')
b'french_german_demo'

>>> ud.normalize('NFKD', str2u).encode('ascii','ignore')
b'french_german_demo'


>>> ud.normalize('NFKD', str1u).encode('ascii','ignore') == ud.normalize('NFKD', str2u).encode('ascii','ignore')
True

>>> ud.normalize('NFC', str1u).encode('ascii','ignore') == ud.normalize('NFC', str2u).encode('ascii','ignore')
True

Not nice if there are unicode chars without ASCII equivalant, but maybe acceptable?

>>> ud.normalize('NFKD', "blabla 你好，世界 tadada").encode('ascii','ignore')
b'blabla , tadada'

Safihre · 2021-02-15T19:00:35Z

@sanderjo Indeed that fails for the Chinese-download test.
So since this problem is so rare, I haven't give it much priority. The result is just a double unpack, which isn't that bad.

Actually I did work on a solution, where we don't rely on the output reading of unrar so much but just rely more on output of rarfile module: https://github.com/sabnzbd/sabnzbd/tree/bugfix/handle_du
Will see if I can merge this for 3.3.0.

BrianMSheldon · 2022-04-05T19:38:32Z

From the Unicode Consortium Normalization FAQ

Q: Which forms of normalization should I support?

A: The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

I personally use NFKD for internal comparisons and NFKC for output.

Safihre added the Bug label Oct 9, 2020

Safihre added a commit that referenced this issue Oct 13, 2020

Allow failure of download_unicode_made_on_windows test due to bug #1633

92541fe

Safihre mentioned this issue May 16, 2024

Sorting variable %fn is not UTF8 normalized #2858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files double unpacked due to different UTF-8 normalizations #1633

Files double unpacked due to different UTF-8 normalizations #1633

Safihre commented Oct 9, 2020

puzzledsab commented Feb 15, 2021

Safihre commented Feb 15, 2021

sanderjo commented Feb 15, 2021 •

edited

Safihre commented Feb 15, 2021

BrianMSheldon commented Apr 5, 2022

Files double unpacked due to different UTF-8 normalizations #1633

Files double unpacked due to different UTF-8 normalizations #1633

Comments

Safihre commented Oct 9, 2020

puzzledsab commented Feb 15, 2021

Safihre commented Feb 15, 2021

sanderjo commented Feb 15, 2021 • edited

Safihre commented Feb 15, 2021

BrianMSheldon commented Apr 5, 2022

sanderjo commented Feb 15, 2021 •

edited