Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.listdir and UnicodeError #120

Closed
ReimarBauer opened this issue Dec 19, 2017 · 29 comments
Closed

fs.listdir and UnicodeError #120

ReimarBauer opened this issue Dec 19, 2017 · 29 comments

Comments

@ReimarBauer
Copy link

On my system for whatever reason I have a file whith wrong encoding in the / dir.
Always if I want
for item in sorted(self.fs.listdir(_sel_dir)):

I have to encapsulate this by an exception for UnicodeDecodeError. I would prefer to not crash but just ignore this file.

(I am still looking on why that file anyway is there)

@willmcgugan
Copy link
Member

Sounds like a bug. What filesystem? A traceback would be great.

@pombredanne
Copy link
Contributor

pombredanne commented Dec 19, 2017

This is likely to be a error that may happen in general on Linux and Unix: the path cannot be guaranteed to be Unicode-decodable, as this is an unspecified byte string, with some (possibly unknown) encoding.
I have met these issues on a regular basis with https://github.com/nexB/scancode-toolkit and I am considering switching to PyFilesystem. This would be a show stopper to me :|

@pombredanne
Copy link
Contributor

To get some feel for the problem of FS encoding (at least on Python 2) see nexB/scancode-toolkit#688

@pombredanne
Copy link
Contributor

So as reported in scancode by @dengste $ touch test/foo$'\261'bar is all you need to make this fail
You can list this, but you no longer return unicode:

>>> f.listdir(u'.')
[u'foobaz', 'foo\xb1bar']

That's a problem with Python 2 only.

@pombredanne
Copy link
Contributor

Python3 uses surrogate pair encoding and one way to emulate this on Python2 is this https://github.com/pjdelport/backports.os by @pjdelport

@pombredanne
Copy link
Contributor

FWIW, you are not alone there, @jaraco 's path.py has the same issue: jaraco/path#130

@pombredanne
Copy link
Contributor

Here is a more complete snippet:

$ mkdir test
$ touch test/foo$'\261'bar
$ touch test/foobaz
$ pip install fs
$ python2
>>> import fs
>>> f=fs.open_fs(u'test')
>>> for x in f.listdir(u'.'):
...  print repr(x)
...  unicode(x)
... 
u'foobaz'
u'foobaz'
'foo\xb1bar'
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb1 in position 3: ordinal not in range(128)

@pombredanne
Copy link
Contributor

@willmcgugan one issue is that your fsencode/decode https://github.com/PyFilesystem/pyfilesystem2/blob/master/fs/_fscompat.py#L5 may not be as involved as @pjdelport 's https://github.com/pjdelport/backports.os/blob/master/src/backports/os.py The backport of @pjdelport one works on Python2 flawlessly for me and we have tested it on 100+ million files so far.

@pombredanne
Copy link
Contributor

Now the key is that os.listdir is borked on Python2. It used there:

... and only there. So IMHO the fix could be to dabble around there. But then the the fsencode/fsdecode dance to ensure correctness on Linux/Unix may require to be done in many other places: not sure.

@pombredanne
Copy link
Contributor

pombredanne commented Dec 19, 2017

FWIW, @benhoyt scandir package does not fare better on Python2 see benhoyt/scandir#86

>>> list(scandir.scandir('test'))
[<DirEntry 'foo\xb1bar2'>, <DirEntry 'foobaz'>, <DirEntry 'foo\xb1bar'>]
>>> list(scandir.scandir(u'test'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/tmp/fs/tmp/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 8: invalid start byte

@pombredanne
Copy link
Contributor

@willmcgugan I could take a crack at this as this is a blocker for me. Now, how to best deal with this?
Do you assume that you have unicode internally throughout? The issue is that for this to work then the paths should be fsencoded back to bytes when on Linux/Unix and this at the boundaries of whenever paths are needed. I mean things cannot work out by assuming Linux/Unix can work with unicode. So the assumption that unicode is everywhere breaks.

@willmcgugan
Copy link
Member

@pombredanne Would be happy to accept a PR. This would be something I intend to look at, but couldn't say when I'll have the time.

Paths have to be unicode in the Pyfilesystem api. So the fix would have to be at the boundaries.

I'd be interested to know if the scandir code is similarly affected.

Feel free to email me if you have any questions.

@pombredanne
Copy link
Contributor

@willmcgugan scandir code is affected the same way on Python2

Now the fix is rather engaged, as essentially getsyspath() should return plain bytes when on *nix. Which breaks a legion of things. Alternatively it could take a flag arg like native=True/False that would effectively honor the bytes/unicode hiatus when on *nix.
In any case this is serious heart surgery

@pombredanne
Copy link
Contributor

shrikes: the problem is the "boundaries" are large. For instance, should listdir always return unicode or bytes on *nix and unicode elsewhere? I would consider listdir as a boundary.

@pombredanne
Copy link
Contributor

So this eventually means touching most everything is osfs and fixing a large number of tests

@willmcgugan
Copy link
Member

listdir in the FS interface should definitely return unicode. It's a guarantee made by the api which tries to isolate the developer from precisely this kind of real world nastiness. The boundary points would have to be internal where OSFS calls listdir and scandir

getsyspath could be an exception as it only needs to be a 'path understood by the OS'. So I'd consider allowing that to return bytes, but I'm not sure what that would break off hand.

Maybe a getnativepath would be warranted? Would much rather add to the api than risk break anything.

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017
The approach is that unicode is used everywhere unless when on *nix
and that real access to files is needed. In this case the patch is
encoded to bytes using the filesystem encoding.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Contributor

@ReimarBauer do you mind testing if the code in #121 from this branch works for you?

@pombredanne
Copy link
Contributor

re

listdir in the FS interface should definitely return unicode. It's a guarantee made by the api which tries to isolate the developer from precisely this kind of real world nastiness. The boundary points would have to be internal where OSFS calls listdir and scandir

The reality is that even on Python3, you cannot use anything realiably that comes as unicode from the os/os.path modules on *nix: you need to fsencode these otherwise this will fail on the cases highlighted here, so shielding users from path semantics with Unicode cannot work as general rule.

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
Instead I added doc to explain that fsencode how can be used
if needed.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
 * Avoid code duplication with a new _get_validated_syspath() method
 * Remove as_bytes arg from getsyspath PyFilesystem#120

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
 * This was mistakenly left over
 * Remove as_bytes arg from getsyspath PyFilesystem#120

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017
Following @willmcgugan in PyFilesystem#121 this is:
- removing and/or shortcuts and
- does not override path arg variables 

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 22, 2017
 * I had somehow introduced a regression with the previous commit

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@ReimarBauer
Copy link
Author

just got back to this and try to test this :)

@ReimarBauer
Copy link
Author

ReimarBauer commented Feb 15, 2018

The above and I think also the 2.0.18 have the same behaviour for my problem currently. I guess there are different / more issues related too.

names = self.fs.listdir(_sel_dir)

this makes a list with the content of e.g.
names = [u'usr', u'mnt', u'lib64', u'sbin', u'dev', u'proc', '\x01\xa1']

The further processing of this list makes then problems. e.g.

sorted(names)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa1 in position 1: ordinal not in range(128)

My current workaround for this is. (this fork and also 2.0.18)

names = self.fs.listdir(_sel_dir)
for item in names:
    _item = fs.path.combine(_sel_dir, item)
    try:
        self.fs.isdir(_item)
    except TypeError:
        names.remove(item)

This means listdir returns something and isdir cannot handle it.

@ReimarBauer
Copy link
Author

ReimarBauer commented Apr 15, 2018

I got a hint by @appleonkel on the PythonCamp for using from backports.os import fsdecode

name = fsdecode(name)

e.g.
appleonkel/scandir@84110a2

@pombredanne
Copy link
Contributor

@ReimarBauer this is already something that I integrated in my WIP branch 810ee9b#diff-97766fdc3eaf0f62e76fe6d51fff1be2R8

FWIW, there is a bit more to it than just handling this in scandir (or os.listdir)

@ReimarBauer
Copy link
Author

@pombredanne Great! Looking forward :)

@dstromberg
Copy link

I'm seeing this too. I"m considering adding pyfilesystem to http://stromberg.dnsalias.org/~strombrg/backshift/ (a filesystem backup tool), but this bug blocks that.

The error I'm getting in a rudimentary REPL test:

import fs.sshfs
import fs
my_fs = fs.open_fs("ssh://localhost/directory/with/weird/filename/in/it")
for path in my_fs.walk.files():
... print(path)

...many files listed correctly, but then:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

@willmcgugan
Copy link
Member

Work in progress to fix this #120

@willmcgugan
Copy link
Member

@ReimarBauer @dstromberg @pombredanne There is a work in progress effort to address this issue. Please give 2.0.22a0 a try, and let me know if that fixes it.

@ReimarBauer
Copy link
Author

thx @willmcgugan
I look soon on it

@dstromberg
Copy link

dstromberg commented May 7, 2018 via email

@althonos
Copy link
Member

althonos commented May 8, 2018

@dstromberg : fs.sshfs is not a part of the core PyFilesystem library. If the fix Will came with works as intended, I'll adapt it to fs.sshfs later. You can open an issue there if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants