-
Notifications
You must be signed in to change notification settings - Fork 5
Normalize filesystemencoding codec name #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This allows to avoid weird cases where the codec may not be properly recognized. Link: aboutcode-org/scancode-toolkit#688 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
|
@pjdelport Would you mind to review and merge this and publish a new Pypi version? Alternatively I can use my own advanced and patched build in ScanCode ... but it would be nice if this is part of the upstream code. Thanks!
|
|
Many thanks @pombredanne for working on this! |
|
Thanks for contributing this! Are there any potential ill or compatibility-breaking side effects to normalising the codec name like this? |
|
Thanks for contributing this! Are there any potential ill or compatibility-breaking side effects to normalising the codec name earlier like this? |
|
@pjdelport you wrote:
Actually IMHO to the contrary. @Haypo original Python code (from which the future surrogatescape codec you depend on is derived and AFAIK from which the early fsencode/fsdecode comes from) vstinner/misc@a5f90a0#diff-b500f48c9778753dc97ebc453352f351R127 ensures that the encoding is normalized early. And the bug is really all about the lack of such early normalization when there are some weird unset or non-normalized FS encodings... |
|
@pjdelport I would go even as far as saying there could be a bug in Python 3.x that is lurking in fsencode/decode in this case... but I did not test this on Python 3 so this is all speculative. ;) |
|
@pombredanne: Okay, I've gone through the discussion and context in aboutcode-org/scancode-toolkit#688 and aboutcode-org/scancode-toolkit#752, and in particular the tracebacks like: However, as far as I can tell, the If the above is the issue, I'm not sure how the codec name normalisation in this PR would help: it should have the same result either way. Am I understanding the problem right, or is there something else to it? |
|
@pjdelport Thank you for diving in these tickets of ours! You have a good point ... yet, I need to dig a bit more: How would Python 3.6 deal with this? when the fsencoding is 'ascii'? it would still need to be able to fsencode back to ascii something which is the same that we are trying to do here. With backports.os on Python2 I get this: So you are right that the patch in this PR does not fix the issue at all! Yet, the backport code does not handle the critical case of always being able to fsencode/fsdecode from bytes or unicode that Python3 handles nicely. |
|
So it looks like the 3.6 behavior might be to fall back to |
|
So my recap:
Does this make sense to you? If so I can submit a new patch. |
|
Closing this in favor of #7 |
This allows to avoid weird cases where the codec may not be properly
recognized.
Link: aboutcode-org/scancode-toolkit#688
Signed-off-by: Philippe Ombredanne pombredanne@nexb.com