Ensure encoding is Unicode UTF-8 when no locale is defined #7

pombredanne · 2017-09-19T16:11:17Z

this fixes an issue when no locale is defined on Linux
and the filesystem encoding is ASCII. By forcing the encoding used
for filesystem-realted encode/decode in this case we can use
proper surrogatescape encoding and marshall bytes to unicode and back
mostly the same way Python 3 does it.
see Normalize filesystemencoding codec name #4 for a detailed discussion and UnicodeDecodeError backtrace during scan aboutcode-org/scancode-toolkit#688 for some of the issues that motivated this patch

Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

* this fixes an issue when no locale is defined on Linux and the filesystem encoding is ASCII. By forcing the encoding used for filesystem-realted encode/decode in this case we can use proper surrogatescape encoding and marshall bytes to unicode and back mostly the same way Python 3 does it. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2017-09-19T16:12:08Z

@sschuberth ping, FYI.

codecov · 2017-09-19T16:15:11Z

Codecov Report

Merging #7 into master will decrease coverage by 69.36%.
The diff coverage is 61.29%.

@@             Coverage Diff             @@
##           master       #7       +/-   ##
===========================================
- Coverage   95.83%   26.47%   -69.37%     
===========================================
  Files           5        5               
  Lines         168      680      +512     
  Branches       26      115       +89     
===========================================
+ Hits          161      180       +19     
- Misses          4      490      +486     
- Partials        3       10        +7

Impacted Files	Coverage Δ
src/backports/os.py	`13.58% <28.57%> (-80.01%)`	⬇️
tests/test_os.py	`75.67% <70.83%> (-10.04%)`	⬇️
tests/test_helpers.py	`95.83% <0%> (-4.17%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 848c092...6024baa. Read the comment docs.

pombredanne · 2017-09-19T16:36:51Z

@pjdelport not sure why the code coverage is going down... since I added more tests. Probably a quirk in codecov and the fact that test without a locale/ascii fs encoding run in a subprocess.

* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

PiDelport · 2017-09-19T23:37:30Z

@pombredanne: Thank you for this update!

I investigated this further, and I think I found the root problem: it actually seems to be a bug (or at least an incompatibility) in how Python 2's ASCII codec calls the surrogateescape error handler.

This bug can be reproduced independently of locale:

On Python 2, surrogateescape encodes surrogates for low bytes to ASCII as expected; however, as soon as it encodes surrogates for high bytes, the ASCII codec seems to disregard the result of surrogateescape, and fail anyway:

$ python2.7 -c "import backports.os; print(repr(u'\udc7f'.encode('ascii', 'surrogateescape')))"
'\x7f'

$ python2.7 -c "import backports.os; print(repr(u'\udc80'.encode('ascii', 'surrogateescape')))"
UnicodeEncodeError: 'ascii' codec can't encode character u'\udc80' in position 0: ordinal not in range(128)

(I added instrumentation to the latter case to verify that surrogateescape is actually being called, and is correctly returning \x80.)

On Python 3, by contrast, the ASCII codec just uses the result of surrogateescape as expected:

$ python3.6 -c "print(repr(u'\udc7f'.encode('ascii', 'surrogateescape')))"
b'\x7f'

$ python3.6 -c "print(repr(u'\udc80'.encode('ascii', 'surrogateescape')))"
b'\x80'

PiDelport · 2017-09-19T23:50:28Z

Given the above, I don't think this PR is the right fix: it avoids the bug, but it does so by re-interpreting a filesystem encoding of ascii as utf-8, which actually changes what gets encodes to and from compared to Python 3 on ascii.

However, I implemented a fix for the bug described above in #8: can you check that out, and see if it works for you?

pombredanne · 2017-09-20T04:42:43Z

@pjdelport You are my new unicode star!

sschuberth · 2017-09-20T06:18:29Z

Thanks all! So, I believe we're good to close this unmerged as @pombredanne confirmed that #8 works.

pombredanne · 2017-09-20T15:02:42Z

Closing as invalid in favor of #8

* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne mentioned this pull request Sep 19, 2017

Normalize filesystemencoding codec name #4

Closed

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this pull request Sep 19, 2017

Update backports.os for ASCII FS encoding

4ba35c1

* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

PiDelport mentioned this pull request Sep 19, 2017

Hack around Python 2 ASCII encoding bug / incompatibility #8

Merged

pombredanne closed this Sep 20, 2017

PiDelport added the invalid label Sep 20, 2017

pombredanne added a commit to aboutcode-org/scancode-toolkit that referenced this pull request Sep 22, 2017

Update backports.os for ASCII FS encoding

cd9befe

* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure encoding is Unicode UTF-8 when no locale is defined #7

Ensure encoding is Unicode UTF-8 when no locale is defined #7

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

codecov bot commented Sep 19, 2017 •

edited

Loading

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

PiDelport commented Sep 19, 2017

Uh oh!

PiDelport commented Sep 19, 2017

Uh oh!

pombredanne commented Sep 20, 2017

Uh oh!

sschuberth commented Sep 20, 2017

Uh oh!

pombredanne commented Sep 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ensure encoding is Unicode UTF-8 when no locale is defined #7

Ensure encoding is Unicode UTF-8 when no locale is defined #7

Uh oh!

Conversation

pombredanne commented Sep 19, 2017

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

codecov bot commented Sep 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pombredanne commented Sep 19, 2017

Uh oh!

PiDelport commented Sep 19, 2017

Uh oh!

PiDelport commented Sep 19, 2017

Uh oh!

pombredanne commented Sep 20, 2017

Uh oh!

sschuberth commented Sep 20, 2017

Uh oh!

pombredanne commented Sep 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Sep 19, 2017 •

edited

Loading