Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-unicode, bytes-only paths on Linux and *nix #121

Closed
wants to merge 14 commits into from

Conversation

pombredanne
Copy link
Contributor

This is a fix for #120
Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The approach is that unicode is used everywhere unless when on *nix
and that real access to files is needed. In this case the patch is
encoded to bytes using the filesystem encoding.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.9%) to 99.072% when pulling 7c7886f on pombredanne:unicode into ee3e335 on PyFilesystem:master.

@pombredanne pombredanne changed the title [WPI] Add support for non-unicode, bytes-only paths on Linux and *nix [WIP] Add support for non-unicode, bytes-only paths on Linux and *nix Dec 20, 2017
fs/base.py Outdated
@@ -599,14 +599,16 @@ def getsize(self, path):
size = self.getdetails(path).size
return size

def getsyspath(self, path):
def getsyspath(self, path, as_bytes=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not overly keen on overloading getsyspath like this.

How about leaving getsyspath as returning unicode, and add a getnativepath which may return any format (presumably bytes for *nix).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willmcgugan your call. I had started doing so but then I found it heavy: to have another method, another error, etc. And since your docstring for getsyspath states:

        A system path is one recognized by the OS, that may be used
        outside of PyFilesystem (in an application or a shell for
        example). This method will get the corresponding system path
        that would be referenced by ``path``.

... I find it really hard to have another method with different docstring... since on *nix the real system path is bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what would happen if you were to get a syspath with as_bytes=False where the path couldn't be decoded as unicode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_bytes=False is the default so you still always get unicode out. With as_bytes=True you will always get bytes out using the filesystem encoding. There is no case I can think of where there would be any possible issue (unless if you have no filesystem encoding at all.... but that is the source of so many other possible errors and is a gross misconfiguration of the OS that's it is not worth catering to)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but again, your call if you think that would break some API. I think not since this is only additive with a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to give this one some thought!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that strictly speaking the conversion to bytes on *nix is only required on Python2. Python3 handles this fine internally (and the os.path and related stdlib have been sprinkled with plenty of os.fsencode/fsdecode as needed) .... So I could make the test for both *nix and py2 ... now the need to get bytes is still valid on Py2 and Py3 as a unicode fsdecoded path such as in #120 cannot be used in a *nix shell outside of Python unless it is fsencoded first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit only deals with bytes internally on Python 2 and *nix. Everywhere else, this is unicode throughout. We still need of course the ability to fsencode some unicode, but frankly this could be left out entirely of the getsyspath method. Instead we could update the doc and stating that getsyspath always return Unicode and when and why this could be a problem, and how to fsencode the received unicode for use outside if needed.

This could be a happy middle ground.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne Sounds good. But could you drop the as_bytes then? Calling this method with the new parameter will break with most of the other filesystems and all of the external filesystems.

I'm still favouring the getnativepath which may return bytes or unicode, but leaving getsyspath as returning (surrogate encoded) unicode. The default getnativepath could just return the same value as getsyspath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still favouring the getnativepath which may return bytes or unicode, but leaving getsyspath as returning (surrogate encoded) unicode. The default getnativepath could just return the same value as getsyspath

Do you want this at all? I thing this is no longer needed as the doc explains that getsystempath is always unicode and that some use cases may need fsencoding to be usable

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 20, 2017

Coverage Status

Coverage decreased (-0.05%) to 99.902% when pulling 116c5d6 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

@pombredanne
Copy link
Contributor Author

pombredanne commented Dec 20, 2017

@willmcgugan BTW, coveralls complains for bits that can be tested only when on macOS or Windows. #122 is needed to deal with this.... b ut then coverall is crass at merging reports from multiple runs last I checked, while codecode was OK there.

@pombredanne pombredanne changed the title [WIP] Add support for non-unicode, bytes-only paths on Linux and *nix Add support for non-unicode, bytes-only paths on Linux and *nix Dec 20, 2017
@pombredanne
Copy link
Contributor Author

Sounds like this is decently passing the tests now and is ready for your review.

@pombredanne
Copy link
Contributor Author

as a fun side note: with this patch you could then claim that you are the only path abstraction library that works on all OS and FS encodings on Py2 and Py3. There are none anywhere else. I searched hard.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 21, 2017

Coverage Status

Coverage decreased (-0.02%) to 99.927% when pulling 257cc1f on pombredanne:unicode into ee3e335 on PyFilesystem:master.

Instead I added doc to explain that fsencode how can be used
if needed.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
 * Avoid code duplication with a new _get_validated_syspath() method
 * Remove as_bytes arg from getsyspath PyFilesystem#120

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 21, 2017

Coverage Status

Coverage decreased (-0.02%) to 99.926% when pulling 0ebc631 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

@pombredanne
Copy link
Contributor Author

@willmcgugan With the latest commits, there is no change in API needed at all, and the internal use of fsencoded bytes is limited to *nix and Py2.

Copy link
Member

@willmcgugan willmcgugan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Hope you don't mind my pickiness.

setup.py Outdated
@@ -37,7 +37,8 @@
install_requires=REQUIREMENTS,
extras_require={
"scandir :python_version < '3.5'": ['scandir~=1.5'],
":python_version < '3.4'": ['enum34~=1.1.6']
":python_version < '3.4'": ['enum34~=1.1.6'],
":python_version < '3.0'": ['backports.os'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fsencode was introduced in Python3.2. Could you pin the backports.os module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

requirements.txt Outdated
@@ -3,3 +3,4 @@ enum34==1.1.6 ; python_version < '3.4'
pytz
setuptools
six==1.10.0
backports.os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you pin this, and add the Python version syntax...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

fs/base.py Outdated
@@ -599,14 +599,16 @@ def getsize(self, path):
size = self.getdetails(path).size
return size

def getsyspath(self, path):
def getsyspath(self, path, as_bytes=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne Sounds good. But could you drop the as_bytes then? Calling this method with the new parameter will break with most of the other filesystems and all of the external filesystems.

I'm still favouring the getnativepath which may return bytes or unicode, but leaving getsyspath as returning (surrogate encoded) unicode. The default getnativepath could just return the same value as getsyspath

fs/osfs.py Outdated
return names

def makedir(self, path, permissions=None, recreate=False):
self.check()
mode = Permissions.get_mode(permissions)
path = path and fsdecode(path) or path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a convention we use that you don't modify a path variable once its passed in. The reason is that when you raise an error containing the path, the error message will contain the same path that the developer called the method with.

Could you replace this pattern with path = fsdecode(path) if path else path ? It's more familiar for developers who came to Python recently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use a _path instead alright. But the and/or is more pythonic than if/else in my book. I can admit though that it may not be obvious ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

 * This was mistakenly left over
 * Remove as_bytes arg from getsyspath PyFilesystem#120

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 21, 2017

Coverage Status

Coverage decreased (-0.02%) to 99.926% when pulling a48d134 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 21, 2017

Coverage Status

Coverage decreased (-0.02%) to 99.926% when pulling 3504187 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

Following @willmcgugan in PyFilesystem#121 this is:
- removing and/or shortcuts and
- does not override path arg variables 

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Contributor Author

Thanks. Hope you don't mind my pickiness.

I can be much more picky than this ;)

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.8%) to 99.143% when pulling 77c0548 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

@pombredanne
Copy link
Contributor Author

@willmcgugan The latest should have all your comments/feedback applied.

@pombredanne
Copy link
Contributor Author

@willmcgugan Some thing went south on the Py3 tests. Let me review this

 * I had somehow introduced a regression with the previous commit

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@coveralls
Copy link

coveralls commented Dec 22, 2017

Coverage Status

Coverage increased (+0.0002%) to 99.951% when pulling b7d9e87 on pombredanne:unicode into ee3e335 on PyFilesystem:master.

@pombredanne
Copy link
Contributor Author

I cleaned up the regression I had introduced. We should be all good now.

@pombredanne
Copy link
Contributor Author

pombredanne commented Dec 22, 2017

Unless you have some strong objections left about this PR, I will build a local wheel until you have made a release and start using this in ScanCode. A pypi release soon enough would be welcomed!

@pombredanne
Copy link
Contributor Author

Hold on merging this. There are plenty of issues left to test and fix on *nix with non-decodable paths. I am playing with beautiful crashes so far.

@willmcgugan
Copy link
Member

@pombredanne I think I can live without the getnativepath for now. I'll consider adding it myself later.

No rush, will be AFK for a few days over xmas.

@willmcgugan
Copy link
Member

@pombredanne Did you resolve the crashes? Is it ready for review?

@willmcgugan
Copy link
Member

The code has moved on a bit. And I've attempted to tackle this issue in #167

@willmcgugan
Copy link
Member

willmcgugan commented May 12, 2018

Closing in favour of #167

Thanks for taking the lead on this one @pombredanne

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants