-
Notifications
You must be signed in to change notification settings - Fork 5
Ensure encoding is Unicode UTF-8 when no locale is defined #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* this fixes an issue when no locale is defined on Linux and the filesystem encoding is ASCII. By forcing the encoding used for filesystem-realted encode/decode in this case we can use proper surrogatescape encoding and marshall bytes to unicode and back mostly the same way Python 3 does it. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
|
@sschuberth ping, FYI. |
Codecov Report
@@ Coverage Diff @@
## master #7 +/- ##
===========================================
- Coverage 95.83% 26.47% -69.37%
===========================================
Files 5 5
Lines 168 680 +512
Branches 26 115 +89
===========================================
+ Hits 161 180 +19
- Misses 4 490 +486
- Partials 3 10 +7
Continue to review full report at Codecov.
|
|
@pjdelport not sure why the code coverage is going down... since I added more tests. Probably a quirk in codecov and the fact that test without a locale/ascii fs encoding run in a subprocess. |
* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
|
@pombredanne: Thank you for this update! I investigated this further, and I think I found the root problem: it actually seems to be a bug (or at least an incompatibility) in how Python 2's ASCII codec calls the This bug can be reproduced independently of locale: On Python 2, $ python2.7 -c "import backports.os; print(repr(u'\udc7f'.encode('ascii', 'surrogateescape')))"
'\x7f'
$ python2.7 -c "import backports.os; print(repr(u'\udc80'.encode('ascii', 'surrogateescape')))"
UnicodeEncodeError: 'ascii' codec can't encode character u'\udc80' in position 0: ordinal not in range(128)(I added instrumentation to the latter case to verify that On Python 3, by contrast, the ASCII codec just uses the result of $ python3.6 -c "print(repr(u'\udc7f'.encode('ascii', 'surrogateescape')))"
b'\x7f'
$ python3.6 -c "print(repr(u'\udc80'.encode('ascii', 'surrogateescape')))"
b'\x80' |
|
Given the above, I don't think this PR is the right fix: it avoids the bug, but it does so by re-interpreting a filesystem encoding of However, I implemented a fix for the bug described above in #8: can you check that out, and see if it works for you? |
|
@pjdelport You are my new unicode star! |
|
Thanks all! So, I believe we're good to close this unmerged as @pombredanne confirmed that #8 works. |
|
Closing as invalid in favor of #8 |
* see PiDelport/backports.os#7 for details Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
this fixes an issue when no locale is defined on Linux
and the filesystem encoding is ASCII. By forcing the encoding used
for filesystem-realted encode/decode in this case we can use
proper surrogatescape encoding and marshall bytes to unicode and back
mostly the same way Python 3 does it.
see Normalize filesystemencoding codec name #4 for a detailed discussion and UnicodeDecodeError backtrace during scan aboutcode-org/scancode-toolkit#688 for some of the issues that motivated this patch
Signed-off-by: Philippe Ombredanne pombredanne@nexb.com