fs.listdir and UnicodeError #120

ReimarBauer · 2017-12-19T09:10:53Z

On my system for whatever reason I have a file whith wrong encoding in the / dir.
Always if I want
for item in sorted(self.fs.listdir(_sel_dir)):

I have to encapsulate this by an exception for UnicodeDecodeError. I would prefer to not crash but just ignore this file.

(I am still looking on why that file anyway is there)

The text was updated successfully, but these errors were encountered:

willmcgugan · 2017-12-19T09:51:25Z

Sounds like a bug. What filesystem? A traceback would be great.

pombredanne · 2017-12-19T12:00:02Z

This is likely to be a error that may happen in general on Linux and Unix: the path cannot be guaranteed to be Unicode-decodable, as this is an unspecified byte string, with some (possibly unknown) encoding.
I have met these issues on a regular basis with https://github.com/nexB/scancode-toolkit and I am considering switching to PyFilesystem. This would be a show stopper to me :|

pombredanne · 2017-12-19T12:24:04Z

To get some feel for the problem of FS encoding (at least on Python 2) see nexB/scancode-toolkit#688

pombredanne · 2017-12-19T13:08:19Z

So as reported in scancode by @dengste $ touch test/foo$'\261'bar is all you need to make this fail
You can list this, but you no longer return unicode:

>>> f.listdir(u'.')
[u'foobaz', 'foo\xb1bar']

That's a problem with Python 2 only.

pombredanne · 2017-12-19T13:10:34Z

Python3 uses surrogate pair encoding and one way to emulate this on Python2 is this https://github.com/pjdelport/backports.os by @pjdelport

pombredanne · 2017-12-19T13:12:24Z

FWIW, you are not alone there, @jaraco 's path.py has the same issue: jaraco/path#130

pombredanne · 2017-12-19T13:16:56Z

Here is a more complete snippet:

$ mkdir test
$ touch test/foo$'\261'bar
$ touch test/foobaz
$ pip install fs
$ python2
>>> import fs
>>> f=fs.open_fs(u'test')
>>> for x in f.listdir(u'.'):
...  print repr(x)
...  unicode(x)
... 
u'foobaz'
u'foobaz'
'foo\xb1bar'
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb1 in position 3: ordinal not in range(128)

pombredanne · 2017-12-19T14:36:38Z

@willmcgugan one issue is that your fsencode/decode https://github.com/PyFilesystem/pyfilesystem2/blob/master/fs/_fscompat.py#L5 may not be as involved as @pjdelport 's https://github.com/pjdelport/backports.os/blob/master/src/backports/os.py The backport of @pjdelport one works on Python2 flawlessly for me and we have tested it on 100+ million files so far.

pombredanne · 2017-12-19T15:12:06Z

Now the key is that os.listdir is borked on Python2. It used there:

pyfilesystem2/fs/osfs.py

Line 267 in b8cc82b

names = os.listdir(sys_path)
pyfilesystem2/fs/osfs.py

Line 455 in b8cc82b

for entry_name in os.listdir(sys_path):

... and only there. So IMHO the fix could be to dabble around there. But then the the fsencode/fsdecode dance to ensure correctness on Linux/Unix may require to be done in many other places: not sure.

pombredanne · 2017-12-19T16:58:27Z

FWIW, @benhoyt scandir package does not fare better on Python2 see benhoyt/scandir#86

>>> list(scandir.scandir('test'))
[<DirEntry 'foo\xb1bar2'>, <DirEntry 'foobaz'>, <DirEntry 'foo\xb1bar'>]
>>> list(scandir.scandir(u'test'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pombreda/tmp/fs/tmp/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 8: invalid start byte

pombredanne · 2017-12-20T08:45:44Z

@willmcgugan I could take a crack at this as this is a blocker for me. Now, how to best deal with this?
Do you assume that you have unicode internally throughout? The issue is that for this to work then the paths should be fsencoded back to bytes when on Linux/Unix and this at the boundaries of whenever paths are needed. I mean things cannot work out by assuming Linux/Unix can work with unicode. So the assumption that unicode is everywhere breaks.

willmcgugan · 2017-12-20T09:37:09Z

@pombredanne Would be happy to accept a PR. This would be something I intend to look at, but couldn't say when I'll have the time.

Paths have to be unicode in the Pyfilesystem api. So the fix would have to be at the boundaries.

I'd be interested to know if the scandir code is similarly affected.

Feel free to email me if you have any questions.

pombredanne · 2017-12-20T10:14:44Z

@willmcgugan scandir code is affected the same way on Python2

Now the fix is rather engaged, as essentially getsyspath() should return plain bytes when on *nix. Which breaks a legion of things. Alternatively it could take a flag arg like native=True/False that would effectively honor the bytes/unicode hiatus when on *nix.
In any case this is serious heart surgery

pombredanne · 2017-12-20T11:15:47Z

shrikes: the problem is the "boundaries" are large. For instance, should listdir always return unicode or bytes on *nix and unicode elsewhere? I would consider listdir as a boundary.

pombredanne · 2017-12-20T11:16:37Z

So this eventually means touching most everything is osfs and fixing a large number of tests

willmcgugan · 2017-12-20T11:25:53Z

listdir in the FS interface should definitely return unicode. It's a guarantee made by the api which tries to isolate the developer from precisely this kind of real world nastiness. The boundary points would have to be internal where OSFS calls listdir and scandir

getsyspath could be an exception as it only needs to be a 'path understood by the OS'. So I'd consider allowing that to return bytes, but I'm not sure what that would break off hand.

Maybe a getnativepath would be warranted? Would much rather add to the api than risk break anything.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

The approach is that unicode is used everywhere unless when on *nix and that real access to files is needed. In this case the patch is encoded to bytes using the filesystem encoding. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2017-12-20T17:43:44Z

@ReimarBauer do you mind testing if the code in #121 from this branch works for you?

pombredanne · 2017-12-20T17:46:05Z

re

listdir in the FS interface should definitely return unicode. It's a guarantee made by the api which tries to isolate the developer from precisely this kind of real world nastiness. The boundary points would have to be internal where OSFS calls listdir and scandir

The reality is that even on Python3, you cannot use anything realiably that comes as unicode from the os/os.path modules on *nix: you need to fsencode these otherwise this will fail on the cases highlighted here, so shielding users from path semantics with Unicode cannot work as general rule.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Instead I added doc to explain that fsencode how can be used if needed. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* Avoid code duplication with a new _get_validated_syspath() method * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* This was mistakenly left over * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

@willmcgugan

Following @willmcgugan in PyFilesystem#121 this is: - removing and/or shortcuts and - does not override path arg variables Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

* I had somehow introduced a regression with the previous commit Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

ReimarBauer · 2018-02-15T13:21:12Z

just got back to this and try to test this :)

ReimarBauer · 2018-02-15T15:53:22Z

The above and I think also the 2.0.18 have the same behaviour for my problem currently. I guess there are different / more issues related too.

names = self.fs.listdir(_sel_dir)

this makes a list with the content of e.g.
names = [u'usr', u'mnt', u'lib64', u'sbin', u'dev', u'proc', '\x01\xa1']

The further processing of this list makes then problems. e.g.

sorted(names)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa1 in position 1: ordinal not in range(128)

My current workaround for this is. (this fork and also 2.0.18)

names = self.fs.listdir(_sel_dir)
for item in names:
    _item = fs.path.combine(_sel_dir, item)
    try:
        self.fs.isdir(_item)
    except TypeError:
        names.remove(item)

This means listdir returns something and isdir cannot handle it.

ReimarBauer · 2018-04-15T07:09:59Z

I got a hint by @appleonkel on the PythonCamp for using from backports.os import fsdecode

name = fsdecode(name)

e.g.
appleonkel/scandir@84110a2

pombredanne · 2018-04-15T12:09:39Z

@ReimarBauer this is already something that I integrated in my WIP branch 810ee9b#diff-97766fdc3eaf0f62e76fe6d51fff1be2R8

FWIW, there is a bit more to it than just handling this in scandir (or os.listdir)

ReimarBauer · 2018-04-15T17:56:58Z

@pombredanne Great! Looking forward :)

dstromberg · 2018-04-25T17:06:51Z

I'm seeing this too. I"m considering adding pyfilesystem to http://stromberg.dnsalias.org/~strombrg/backshift/ (a filesystem backup tool), but this bug blocks that.

The error I'm getting in a rudimentary REPL test:

import fs.sshfs
import fs
my_fs = fs.open_fs("ssh://localhost/directory/with/weird/filename/in/it")
for path in my_fs.walk.files():
... print(path)

...many files listed correctly, but then:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 2: invalid continuation byte

willmcgugan · 2018-05-02T16:16:09Z

Work in progress to fix this #120

willmcgugan · 2018-05-04T14:41:07Z

@ReimarBauer @dstromberg @pombredanne There is a work in progress effort to address this issue. Please give 2.0.22a0 a try, and let me know if that fixes it.

ReimarBauer · 2018-05-07T07:17:23Z

thx @willmcgugan
I look soon on it

dstromberg · 2018-05-07T21:06:47Z

Hi folks. I just tried walking with osfs:// using 2.0.20, and got no errors. ssh:// gives an error with both 2.0.20 and 2.0.20a0. BTW, I'm also getting an error on a symlink that causes itself to be retraversed. IOW: ./c/d/2 -> .. It seems to be trying to traverse forever. This happens with osfs - I haven't tried it with ssh yet. I'm trying to use pyfilesystem2 to walk a directory hierarchy I created for testing backshift: http://stromberg.dnsalias.org/~strombrg/backshift/ The code I'm testing pyfilesystem2 with looks like: #!./bin/python3 """List a couple of test directories, to see if pyfilesystem2 can deal with non-unicode filenames and self-referential symlinks.""" import fs import fs.sshfs def list_files(filesys): """List files in filesys.""" for path in filesys.walk.files(): print(type(path), path) def main(): """List a test directory.""" filesys = fs.open_fs('ssh://localhost/home/dstromberg/src/home-svn/backshift/trunk/tests/50-encoding-2.6-3.1') list_files(filesys) print() filesys = fs.open_fs('ssh://localhost/home/dstromberg/src/home-svn/backshift/trunk/tests/57-symlinks') list_files(filesys) main() Thanks!

…

On Fri, May 4, 2018 at 7:41 AM, Will McGugan ***@***.***> wrote: @ReimarBauer <https://github.com/ReimarBauer> @dstromberg <https://github.com/dstromberg> @pombredanne <https://github.com/pombredanne> There is a work in progress effort to address this issue. Please give 2.0.22a0 a try, and let me know if that fixes it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#120 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA0yGqXZlZEbLHTg8SGb4nvjJKHEzERPks5tvGiEgaJpZM4RGpVx> .

-- Dan Stromberg

althonos · 2018-05-08T09:03:03Z

@dstromberg : fs.sshfs is not a part of the core PyFilesystem library. If the fix Will came with works as intended, I'll adapt it to fs.sshfs later. You can open an issue there if you want.

pombredanne mentioned this issue Dec 19, 2017

Allow a plugin to provide more than one option nexB/scancode-toolkit#787

Closed

willmcgugan added accepted bug labels Dec 19, 2017

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017

Use @pjdelport backports.os on Py2 PyFilesystem#120

810ee9b

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017

Add new tests for non-unicode bytes paths PyFilesystem#120

b3c22c7

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017

Add optional "as_bytes" arg PyFilesystem#120

ce6ae08

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne mentioned this issue Dec 20, 2017

Add support for non-unicode, bytes-only paths on Linux and *nix #121

Closed

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 20, 2017

Ensure tests pass on Python3 PyFilesystem#120

116c5d6

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017

fsencode only on *nix and Python2 PyFilesystem#120

257cc1f

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017

Remove as_bytes arg from getsyspath PyFilesystem#120

6d5181b

Instead I added doc to explain that fsencode how can be used if needed. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017

Remove as_bytes arg PyFilesystem#120

a48d134

* This was mistakenly left over * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 21, 2017

Pin backports.os with correct Python version PyFilesystem#120

3504187

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne added a commit to pombredanne/pyfilesystem2 that referenced this issue Dec 22, 2017

Cleanup scandir bytes vs unicode handling PyFilesystem#120

b7d9e87

* I had somehow introduced a regression with the previous commit Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

willmcgugan mentioned this issue May 2, 2018

handle broken unicode paths #167

Merged

willmcgugan closed this as completed in #167 May 12, 2018

aidanheerdegen mentioned this issue Jul 4, 2018

Not enough people know about PyFilesystem #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs.listdir and UnicodeError #120

fs.listdir and UnicodeError #120

ReimarBauer commented Dec 19, 2017

willmcgugan commented Dec 19, 2017

pombredanne commented Dec 19, 2017 •

edited

Loading

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017 •

edited

Loading

pombredanne commented Dec 20, 2017

willmcgugan commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

willmcgugan commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

ReimarBauer commented Feb 15, 2018

ReimarBauer commented Feb 15, 2018 •

edited

Loading

ReimarBauer commented Apr 15, 2018 •

edited

Loading

pombredanne commented Apr 15, 2018

ReimarBauer commented Apr 15, 2018

dstromberg commented Apr 25, 2018

willmcgugan commented May 2, 2018

willmcgugan commented May 4, 2018

ReimarBauer commented May 7, 2018

dstromberg commented May 7, 2018 via email

althonos commented May 8, 2018

fs.listdir and UnicodeError #120

fs.listdir and UnicodeError #120

Comments

ReimarBauer commented Dec 19, 2017

willmcgugan commented Dec 19, 2017

pombredanne commented Dec 19, 2017 • edited Loading

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017

pombredanne commented Dec 19, 2017 • edited Loading

pombredanne commented Dec 20, 2017

willmcgugan commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

willmcgugan commented Dec 20, 2017

pombredanne commented Dec 20, 2017

pombredanne commented Dec 20, 2017

ReimarBauer commented Feb 15, 2018

ReimarBauer commented Feb 15, 2018 • edited Loading

ReimarBauer commented Apr 15, 2018 • edited Loading

pombredanne commented Apr 15, 2018

ReimarBauer commented Apr 15, 2018

dstromberg commented Apr 25, 2018

willmcgugan commented May 2, 2018

willmcgugan commented May 4, 2018

ReimarBauer commented May 7, 2018

dstromberg commented May 7, 2018 via email

althonos commented May 8, 2018

pombredanne commented Dec 19, 2017 •

edited

Loading

pombredanne commented Dec 19, 2017 •

edited

Loading

ReimarBauer commented Feb 15, 2018 •

edited

Loading

ReimarBauer commented Apr 15, 2018 •

edited

Loading