New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for non-unicode, bytes-only paths on Linux and *nix #121
Conversation
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
The approach is that unicode is used everywhere unless when on *nix and that real access to files is needed. In this case the patch is encoded to bytes using the filesystem encoding. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
fs/base.py
Outdated
@@ -599,14 +599,16 @@ def getsize(self, path): | |||
size = self.getdetails(path).size | |||
return size | |||
|
|||
def getsyspath(self, path): | |||
def getsyspath(self, path, as_bytes=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not overly keen on overloading getsyspath
like this.
How about leaving getsyspath
as returning unicode, and add a getnativepath
which may return any format (presumably bytes for *nix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmcgugan your call. I had started doing so but then I found it heavy: to have another method, another error, etc. And since your docstring for getsyspath
states:
A system path is one recognized by the OS, that may be used
outside of PyFilesystem (in an application or a shell for
example). This method will get the corresponding system path
that would be referenced by ``path``.
... I find it really hard to have another method with different docstring... since on *nix the real system path is bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what would happen if you were to get a syspath with as_bytes=False
where the path couldn't be decoded as unicode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_bytes=False
is the default so you still always get unicode
out. With as_bytes=True
you will always get bytes
out using the filesystem encoding. There is no case I can think of where there would be any possible issue (unless if you have no filesystem encoding at all.... but that is the source of so many other possible errors and is a gross misconfiguration of the OS that's it is not worth catering to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but again, your call if you think that would break some API. I think not since this is only additive with a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to give this one some thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note also that strictly speaking the conversion to bytes on *nix is only required on Python2. Python3 handles this fine internally (and the os.path and related stdlib have been sprinkled with plenty of os.fsencode/fsdecode as needed) .... So I could make the test for both *nix and py2
... now the need to get bytes is still valid on Py2 and Py3 as a unicode fsdecoded path such as in #120 cannot be used in a *nix shell outside of Python unless it is fsencoded first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit only deals with bytes internally on Python 2 and *nix. Everywhere else, this is unicode throughout. We still need of course the ability to fsencode some unicode, but frankly this could be left out entirely of the getsyspath
method. Instead we could update the doc and stating that getsyspath
always return Unicode and when and why this could be a problem, and how to fsencode
the received unicode for use outside if needed.
This could be a happy middle ground.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne Sounds good. But could you drop the as_bytes
then? Calling this method with the new parameter will break with most of the other filesystems and all of the external filesystems.
I'm still favouring the getnativepath
which may return bytes or unicode, but leaving getsyspath
as returning (surrogate encoded) unicode. The default getnativepath
could just return the same value as getsyspath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still favouring the getnativepath which may return bytes or unicode, but leaving getsyspath as returning (surrogate encoded) unicode. The default getnativepath could just return the same value as getsyspath
Do you want this at all? I thing this is no longer needed as the doc explains that getsystempath is always unicode and that some use cases may need fsencoding to be usable
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@willmcgugan BTW, coveralls complains for bits that can be tested only when on macOS or Windows. #122 is needed to deal with this.... b ut then coverall is crass at merging reports from multiple runs last I checked, while codecode was OK there. |
Sounds like this is decently passing the tests now and is ready for your review. |
as a fun side note: with this patch you could then claim that you are the only path abstraction library that works on all OS and FS encodings on Py2 and Py3. There are none anywhere else. I searched hard. |
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Instead I added doc to explain that fsencode how can be used if needed. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* Avoid code duplication with a new _get_validated_syspath() method * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@willmcgugan With the latest commits, there is no change in API needed at all, and the internal use of fsencoded bytes is limited to *nix and Py2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Hope you don't mind my pickiness.
setup.py
Outdated
@@ -37,7 +37,8 @@ | |||
install_requires=REQUIREMENTS, | |||
extras_require={ | |||
"scandir :python_version < '3.5'": ['scandir~=1.5'], | |||
":python_version < '3.4'": ['enum34~=1.1.6'] | |||
":python_version < '3.4'": ['enum34~=1.1.6'], | |||
":python_version < '3.0'": ['backports.os'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fsencode
was introduced in Python3.2. Could you pin the backports.os
module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
requirements.txt
Outdated
@@ -3,3 +3,4 @@ enum34==1.1.6 ; python_version < '3.4' | |||
pytz | |||
setuptools | |||
six==1.10.0 | |||
backports.os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you pin this, and add the Python version syntax...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
fs/base.py
Outdated
@@ -599,14 +599,16 @@ def getsize(self, path): | |||
size = self.getdetails(path).size | |||
return size | |||
|
|||
def getsyspath(self, path): | |||
def getsyspath(self, path, as_bytes=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pombredanne Sounds good. But could you drop the as_bytes
then? Calling this method with the new parameter will break with most of the other filesystems and all of the external filesystems.
I'm still favouring the getnativepath
which may return bytes or unicode, but leaving getsyspath
as returning (surrogate encoded) unicode. The default getnativepath
could just return the same value as getsyspath
fs/osfs.py
Outdated
return names | ||
|
||
def makedir(self, path, permissions=None, recreate=False): | ||
self.check() | ||
mode = Permissions.get_mode(permissions) | ||
path = path and fsdecode(path) or path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a convention we use that you don't modify a path
variable once its passed in. The reason is that when you raise an error containing the path, the error message will contain the same path that the developer called the method with.
Could you replace this pattern with path = fsdecode(path) if path else path
? It's more familiar for developers who came to Python recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use a _path instead alright. But the and/or is more pythonic than if/else in my book. I can admit though that it may not be obvious ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
* This was mistakenly left over * Remove as_bytes arg from getsyspath PyFilesystem#120 Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Following @willmcgugan in PyFilesystem#121 this is: - removing and/or shortcuts and - does not override path arg variables Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
I can be much more picky than this ;) |
@willmcgugan The latest should have all your comments/feedback applied. |
@willmcgugan Some thing went south on the Py3 tests. Let me review this |
* I had somehow introduced a regression with the previous commit Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
I cleaned up the regression I had introduced. We should be all good now. |
Unless you have some strong objections left about this PR, I will build a local wheel until you have made a release and start using this in ScanCode. A pypi release soon enough would be welcomed! |
Hold on merging this. There are plenty of issues left to test and fix on *nix with non-decodable paths. I am playing with beautiful crashes so far. |
@pombredanne I think I can live without the No rush, will be AFK for a few days over xmas. |
@pombredanne Did you resolve the crashes? Is it ready for review? |
The code has moved on a bit. And I've attempted to tackle this issue in #167 |
Closing in favour of #167 Thanks for taking the lead on this one @pombredanne |
This is a fix for #120
Signed-off-by: Philippe Ombredanne pombredanne@nexb.com