Skip to content

Conversation

@pombredanne
Copy link

This allows to avoid weird cases where the codec may not be properly
recognized.
Link: aboutcode-org/scancode-toolkit#688

Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

This allows to avoid weird cases where the codec may not be properly
recognized. 
Link: aboutcode-org/scancode-toolkit#688

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Author

pombredanne commented Sep 9, 2017

@pjdelport Would you mind to review and merge this and publish a new Pypi version? Alternatively I can use my own advanced and patched build in ScanCode ... but it would be nice if this is part of the upstream code. Thanks!
I confirmed that this fixes the case(aboutcode-org/scancode-toolkit#688) reported by @sschuberth where there may be no proper locale defined on Linux and that getfsencoding() returns 'ANSI_X3.4-1968'

'ANSI_X3.4-1968' is really ascii once normalized but makes the encode/decode choke:

>>> import codecs
>>> c = codecs.lookup('ANSI_X3.4-1968')
>>> c.name
'ascii'

@sschuberth
Copy link

Many thanks @pombredanne for working on this!

@PiDelport
Copy link
Owner

Thanks for contributing this!

Are there any potential ill or compatibility-breaking side effects to normalising the codec name like this?

@PiDelport
Copy link
Owner

Thanks for contributing this!

Are there any potential ill or compatibility-breaking side effects to normalising the codec name earlier like this?

@pombredanne
Copy link
Author

@pjdelport you wrote:

Are there any potential ill or compatibility-breaking side effects to normalising the codec name earlier like this?

Actually IMHO to the contrary. @Haypo original Python code (from which the future surrogatescape codec you depend on is derived and AFAIK from which the early fsencode/fsdecode comes from) vstinner/misc@a5f90a0#diff-b500f48c9778753dc97ebc453352f351R127 ensures that the encoding is normalized early. And the bug is really all about the lack of such early normalization when there are some weird unset or non-normalized FS encodings...

@pombredanne
Copy link
Author

@pjdelport I would go even as far as saying there could be a bug in Python 3.x that is lurking in fsencode/decode in this case... but I did not test this on Python 3 so this is all speculative. ;)

@PiDelport
Copy link
Owner

@pombredanne: Okay, I've gone through the discussion and context in aboutcode-org/scancode-toolkit#688 and aboutcode-org/scancode-toolkit#752, and in particular the tracebacks like:

Traceback (most recent call last):
  …
  File "…/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 19: ordinal not in range(128)

However, as far as I can tell, the backports.os code is behaving as expected here: the input path is non-ASCII-compatible Unicode, while the desired FS encoding is set to ASCII, so it has to raise UnicodeEncodeError like this. The fix should be to change the environment to use a Unicode-compatible encoding.

If the above is the issue, I'm not sure how the codec name normalisation in this PR would help: it should have the same result either way.

Am I understanding the problem right, or is there something else to it?

@pombredanne
Copy link
Author

@pjdelport Thank you for diving in these tickets of ours! You have a good point ... yet, I need to dig a bit more: How would Python 3.6 deal with this? when the fsencoding is 'ascii'? it would still need to be able to fsencode back to ascii something which is the same that we are trying to do here.
For example:

$ pyenv local 3.6.1
$ export LANGUAGE=
$ export LANG=
$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ python
Python 3.6.1 (default, Apr 19 2017, 16:20:51) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding()
'ascii'
>>> import os
>>> bfile = b'snow \xe2\x98\x83'
>>> ufile = u'snow \udce2\udc98\udc83'
>>> os.fsdecode(os.fsencode(bfile))
'snow \udce2\udc98\udc83'
>>> os.fsdecode(os.fsencode(ufile))
'snow \udce2\udc98\udc83'
>>> os.fsencode(os.fsdecode(ufile))
b'snow \xe2\x98\x83'
>>> os.fsencode(os.fsdecode(bfile))
b'snow \xe2\x98\x83'

With backports.os on Python2 I get this:

>>> import sys
>>> sys.getfilesystemencoding()
'ANSI_X3.4-1968'
>>> import os
>>> import backports.os as osb
>>> bfile = b'snow \xe2\x98\x83'
>>> ufile = u'snow \udce2\udc98\udc83'
>>> osb.fsdecode(osb.fsencode(bfile))
u'snow \udce2\udc98\udc83'
>>> osb.fsdecode(osb.fsencode(ufile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)
>>> osb.fsencode(osb.fsdecode(ufile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)
>>> osb.fsencode(osb.fsdecode(bfile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)

So you are right that the patch in this PR does not fix the issue at all!

Yet, the backport code does not handle the critical case of always being able to fsencode/fsdecode from bytes or unicode that Python3 handles nicely.

@pombredanne
Copy link
Author

pombredanne commented Sep 18, 2017

So it looks like the 3.6 behavior might be to fall back to sys.getdefaultencoding() when the sys.getfilesystemencoding() is 'ascii'. At least that's the behavior observed above.
And I made a test by forcing encoding in backports.os to be 'utf-8' is the codecs.lookup is ascii. This forces the code to go in your _HACK_AROUND_PY2_UTF8 path and it roundtrips OK without errors

@pombredanne
Copy link
Author

So my recap:

  1. this patch here is indeed completely useless!
  2. the rare case where no locale is defined makes this backport fail and seems to be not so rare when using containers/Docker after all
  3. Python 3.6 os.fsencode/fsdecode handles this case of an ascii FS encoding alright, quite likely because of a fallback to the defaultencoding in this case.
  4. this backport does not and a fix would then be to pretend the FS to be utf-8 encoded when it is ascii/bytes to ensure things can be de/encoded alright.

Does this make sense to you? If so I can submit a new patch.

@pombredanne
Copy link
Author

Closing this in favor of #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants