Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files double unpacked due to different UTF-8 normalizations #1633

Open
Safihre opened this issue Oct 9, 2020 · 5 comments
Open

Files double unpacked due to different UTF-8 normalizations #1633

Safihre opened this issue Oct 9, 2020 · 5 comments
Labels

Comments

@Safihre
Copy link
Member

Safihre commented Oct 9, 2020

For a few days the tests on Travis have been failing for macOS, for some reason the test_download_unicode_made_on_windows was unpacking the resulting file twice.
After much debugging I found out that the string frènch_german_demö that is printed in the logs is actually presented in 2 different ways. This caused the unpacker not to detect a set was already unpacked, basically because:

>>> "frènch_german_demö" == "frènch_german_demö"
False

The reason is that they are obtained from 2 different sources:
Output of os.listdir:

b'fre\xcc\x80nch_german_demo\xcc\x88'

Output of the par2 files:

b'fr\xc3\xa8nch_german_dem\xc3\xb6'

https://stackoverflow.com/a/26733055

Now just need to find a way to fix this..

@puzzledsab
Copy link
Contributor

I'm not sure where to apply it so I haven't tested but maybe this will help:

#!/usr/bin/python

import unicodedata as ud
str1 = b'fre\xcc\x80nch_german_demo\xcc\x88'
str2 = b'fr\xc3\xa8nch_german_dem\xc3\xb6'

str1u = str1.decode("utf-8")
str2u = str2.decode("utf-8")

print(str1 == str2)
print(str1u == str2u)
print(ud.normalize('NFC',str1u) == str2u)
print(str1u == ud.normalize('NFC',str2u))
print(ud.normalize('NFC',str1u) == ud.normalize('NFC',str2u))

False
False
True
False
True

@Safihre
Copy link
Member Author

Safihre commented Feb 15, 2021

I tried that indeed: I made a unicode-friendly os.listdir wrapper, but didn't work.

@sanderjo
Copy link
Contributor

sanderjo commented Feb 15, 2021

And how about this: hard-encode to pure ASCII?

>>> ud.normalize('NFKD', str1u).encode('ascii','ignore')
b'french_german_demo'

>>> ud.normalize('NFKD', str2u).encode('ascii','ignore')
b'french_german_demo'


>>> ud.normalize('NFKD', str1u).encode('ascii','ignore') == ud.normalize('NFKD', str2u).encode('ascii','ignore')
True

>>> ud.normalize('NFC', str1u).encode('ascii','ignore') == ud.normalize('NFC', str2u).encode('ascii','ignore')
True

Not nice if there are unicode chars without ASCII equivalant, but maybe acceptable?

>>> ud.normalize('NFKD', "blabla 你好,世界 tadada").encode('ascii','ignore')
b'blabla , tadada'

@Safihre
Copy link
Member Author

Safihre commented Feb 15, 2021

@sanderjo Indeed that fails for the Chinese-download test.
So since this problem is so rare, I haven't give it much priority. The result is just a double unpack, which isn't that bad.

Actually I did work on a solution, where we don't rely on the output reading of unrar so much but just rely more on output of rarfile module: https://github.com/sabnzbd/sabnzbd/tree/bugfix/handle_du
Will see if I can merge this for 3.3.0.

@BrianMSheldon
Copy link

From the Unicode Consortium Normalization FAQ

Q: Which forms of normalization should I support?

A: The choice of which to use depends on the particular program or system. NFC is the best form for general text, since it is more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

I personally use NFKD for internal comparisons and NFKC for output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants