Normalize filesystemencoding codec name #4

pombredanne · 2017-09-08T19:32:48Z

This allows to avoid weird cases where the codec may not be properly
recognized.
Link: aboutcode-org/scancode-toolkit#688

Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

This allows to avoid weird cases where the codec may not be properly recognized. Link: aboutcode-org/scancode-toolkit#688 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2017-09-09T05:37:47Z

@pjdelport Would you mind to review and merge this and publish a new Pypi version? Alternatively I can use my own advanced and patched build in ScanCode ... but it would be nice if this is part of the upstream code. Thanks!
I confirmed that this fixes the case(aboutcode-org/scancode-toolkit#688) reported by @sschuberth where there may be no proper locale defined on Linux and that getfsencoding() returns 'ANSI_X3.4-1968'

'ANSI_X3.4-1968' is really ascii once normalized but makes the encode/decode choke:

>>> import codecs
>>> c = codecs.lookup('ANSI_X3.4-1968')
>>> c.name
'ascii'

sschuberth · 2017-09-09T05:45:23Z

Many thanks @pombredanne for working on this!

PiDelport · 2017-09-10T20:29:48Z

Thanks for contributing this!

Are there any potential ill or compatibility-breaking side effects to normalising the codec name like this?

PiDelport · 2017-09-10T20:31:56Z

Thanks for contributing this!

Are there any potential ill or compatibility-breaking side effects to normalising the codec name earlier like this?

pombredanne · 2017-09-13T04:43:30Z

@pjdelport you wrote:

Are there any potential ill or compatibility-breaking side effects to normalising the codec name earlier like this?

Actually IMHO to the contrary. @Haypo original Python code (from which the future surrogatescape codec you depend on is derived and AFAIK from which the early fsencode/fsdecode comes from) vstinner/misc@a5f90a0#diff-b500f48c9778753dc97ebc453352f351R127 ensures that the encoding is normalized early. And the bug is really all about the lack of such early normalization when there are some weird unset or non-normalized FS encodings...

pombredanne · 2017-09-13T04:47:15Z

@pjdelport I would go even as far as saying there could be a bug in Python 3.x that is lurking in fsencode/decode in this case... but I did not test this on Python 3 so this is all speculative. ;)

PiDelport · 2017-09-17T20:07:15Z

@pombredanne: Okay, I've gone through the discussion and context in aboutcode-org/scancode-toolkit#688 and aboutcode-org/scancode-toolkit#752, and in particular the tracebacks like:

Traceback (most recent call last):
  …
  File "…/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 19: ordinal not in range(128)

However, as far as I can tell, the backports.os code is behaving as expected here: the input path is non-ASCII-compatible Unicode, while the desired FS encoding is set to ASCII, so it has to raise UnicodeEncodeError like this. The fix should be to change the environment to use a Unicode-compatible encoding.

If the above is the issue, I'm not sure how the codec name normalisation in this PR would help: it should have the same result either way.

Am I understanding the problem right, or is there something else to it?

pombredanne · 2017-09-18T16:23:51Z

@pjdelport Thank you for diving in these tickets of ours! You have a good point ... yet, I need to dig a bit more: How would Python 3.6 deal with this? when the fsencoding is 'ascii'? it would still need to be able to fsencode back to ascii something which is the same that we are trying to do here.
For example:

$ pyenv local 3.6.1
$ export LANGUAGE=
$ export LANG=
$ locale
LANG=
LANGUAGE=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ python
Python 3.6.1 (default, Apr 19 2017, 16:20:51) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding()
'ascii'
>>> import os
>>> bfile = b'snow \xe2\x98\x83'
>>> ufile = u'snow \udce2\udc98\udc83'
>>> os.fsdecode(os.fsencode(bfile))
'snow \udce2\udc98\udc83'
>>> os.fsdecode(os.fsencode(ufile))
'snow \udce2\udc98\udc83'
>>> os.fsencode(os.fsdecode(ufile))
b'snow \xe2\x98\x83'
>>> os.fsencode(os.fsdecode(bfile))
b'snow \xe2\x98\x83'

With backports.os on Python2 I get this:

>>> import sys
>>> sys.getfilesystemencoding()
'ANSI_X3.4-1968'
>>> import os
>>> import backports.os as osb
>>> bfile = b'snow \xe2\x98\x83'
>>> ufile = u'snow \udce2\udc98\udc83'
>>> osb.fsdecode(osb.fsencode(bfile))
u'snow \udce2\udc98\udc83'
>>> osb.fsdecode(osb.fsencode(ufile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)
>>> osb.fsencode(osb.fsdecode(ufile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)
>>> osb.fsencode(osb.fsdecode(bfile))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/w421/scancode-toolkit-master/local/lib/python2.7/site-packages/backports/os.py", line 133, in fsencode
    return filename.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character u'\udce2' in position 5: ordinal not in range(128)

So you are right that the patch in this PR does not fix the issue at all!

Yet, the backport code does not handle the critical case of always being able to fsencode/fsdecode from bytes or unicode that Python3 handles nicely.

pombredanne · 2017-09-18T22:12:39Z

So it looks like the 3.6 behavior might be to fall back to sys.getdefaultencoding() when the sys.getfilesystemencoding() is 'ascii'. At least that's the behavior observed above.
And I made a test by forcing encoding in backports.os to be 'utf-8' is the codecs.lookup is ascii. This forces the code to go in your _HACK_AROUND_PY2_UTF8 path and it roundtrips OK without errors

pombredanne · 2017-09-18T22:19:35Z

So my recap:

this patch here is indeed completely useless!
the rare case where no locale is defined makes this backport fail and seems to be not so rare when using containers/Docker after all
Python 3.6 os.fsencode/fsdecode handles this case of an ascii FS encoding alright, quite likely because of a fallback to the defaultencoding in this case.
this backport does not and a fix would then be to pretend the FS to be utf-8 encoded when it is ascii/bytes to ensure things can be de/encoded alright.

Does this make sense to you? If so I can submit a new patch.

pombredanne · 2017-09-19T19:17:30Z

Closing this in favor of #7

Normalize filesystemencoding codec name

9f8efee

This allows to avoid weird cases where the codec may not be properly recognized. Link: aboutcode-org/scancode-toolkit#688 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

PiDelport added the enhancement label Sep 10, 2017

sschuberth mentioned this pull request Sep 13, 2017

UnicodeDecodeError backtrace during scan aboutcode-org/scancode-toolkit#688

Closed

pombredanne mentioned this pull request Sep 19, 2017

Ensure encoding is Unicode UTF-8 when no locale is defined #7

Closed

pombredanne closed this Sep 19, 2017

PiDelport mentioned this pull request Sep 19, 2017

Hack around Python 2 ASCII encoding bug / incompatibility #8

Merged

PiDelport added invalid and removed enhancement labels Sep 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize filesystemencoding codec name #4

Normalize filesystemencoding codec name #4

Uh oh!

pombredanne commented Sep 8, 2017

Uh oh!

pombredanne commented Sep 9, 2017 •

edited

Loading

Uh oh!

sschuberth commented Sep 9, 2017

Uh oh!

PiDelport commented Sep 10, 2017

Uh oh!

PiDelport commented Sep 10, 2017

Uh oh!

pombredanne commented Sep 13, 2017

Uh oh!

pombredanne commented Sep 13, 2017

Uh oh!

PiDelport commented Sep 17, 2017

Uh oh!

pombredanne commented Sep 18, 2017

Uh oh!

pombredanne commented Sep 18, 2017 •

edited

Loading

Uh oh!

pombredanne commented Sep 18, 2017

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Normalize filesystemencoding codec name #4

Normalize filesystemencoding codec name #4

Uh oh!

Conversation

pombredanne commented Sep 8, 2017

Uh oh!

pombredanne commented Sep 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sschuberth commented Sep 9, 2017

Uh oh!

PiDelport commented Sep 10, 2017

Uh oh!

PiDelport commented Sep 10, 2017

Uh oh!

pombredanne commented Sep 13, 2017

Uh oh!

pombredanne commented Sep 13, 2017

Uh oh!

PiDelport commented Sep 17, 2017

Uh oh!

pombredanne commented Sep 18, 2017

Uh oh!

pombredanne commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pombredanne commented Sep 18, 2017

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pombredanne commented Sep 9, 2017 •

edited

Loading

pombredanne commented Sep 18, 2017 •

edited

Loading