Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isfile error on too long file names #39774

Open
GiggleLiu opened this issue Feb 21, 2021 · 27 comments
Open

isfile error on too long file names #39774

GiggleLiu opened this issue Feb 21, 2021 · 27 comments
Assignees
Labels
domain:docs This change adds or pertains to documentation domain:filesystem Underlying file system and functions that use it

Comments

@GiggleLiu
Copy link
Contributor

The maximum filename length is 255, isfile should return false rather than raising an error when the file name is too long.

julia> isfile(randstring(256))
ERROR: IOError: stat("yDdEhZcxL943psM9mDEXZJhZwXH0cnrNCR8Ps1MrdDB8I8EDaN1WDyL40PQlcEra20zt1CyDOEv4cR7BlsIyJr0hBhoFl1kP2kQBJvTcdsTjVUx65g2kgCfER3adn3xQkIVA80mYfXvQBNBxNg9rLfsOh6TjGXApOLVRBBRBWsGc6dYwFRWZgDyRhkjNBecthZ7Pgg0EQFnLPRrlXpMdBaPoV0fDF8iNfhguOdfGpwjAD7U8ABJSW2fFUeHmUoMr"): name too long (ENAMETOOLONG)
Stacktrace:
 [1] uv_error
   @ ./libuv.jl:97 [inlined]
 [2] stat(path::String)
   @ Base.Filesystem ./stat.jl:69
 [3] isfile(path::String)
   @ Base.Filesystem ./stat.jl:311
 [4] top-level scope
   @ REPL[20]:1

The same applies for isdir

relevant code:

@eval ($f)(path...) = ($f)(stat(path...))

Julia version: 1.6.0-rc1

@kimikage
Copy link
Contributor

The maximum filename length is 255

It's actually system dependent. I believe that silently returning false can cause inexplicable bugs.

@mgkuhn
Copy link
Contributor

mgkuhn commented Feb 26, 2021

I suspect this issue isn't really about the maximum filename length on a particular file system, but about the question whether/when isfile (and friends isdir, islink, etc.) should raise an exception in Julia.

The POSIX stat system call can fail for various reasons, including EACCES, EIO, ELOOP, ENAMETOOLONG, ENOENT, ENOTDIR.

The docstrings of the Julia functions isfile and stat are currently completely silent on what result which of these errors lead to.

This may be a problem because the equivalent functions in other languages do not raise an exception, and are in fact frequently used to check if a file exists.

For comparison: the Perl/Bash equivalent -f does not raise an exception if called with a gigabyte long filename, i.e. both ENOENT and ENAMETOOLONG result in false:

$ perl -E 'say -f ("a" x 1e9)'

In Julia, ENOENT results in false and ENAMETOOLONG in IOError: stat: name too long (ENAMETOOLONG). At a minimum, this should be documented. I can see arguments that more errnos than just ENOENT should cause isfile to return false.

@mgkuhn mgkuhn added domain:docs This change adds or pertains to documentation domain:filesystem Underlying file system and functions that use it labels Feb 26, 2021
@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 26, 2021

I tend to agree with @mgkuhn . I recall most of my using cases are about using isfile to avoid exceptions. It is not expected to return an error just because the name is too long. Just to add one API to fix: ispath .

Also, another important factor is: is there an easy way to check if the filename is too long?

@mgkuhn
Copy link
Contributor

mgkuhn commented Feb 26, 2021

Looking at the source of stat, this just calls jl_stat which calls uv_fs_stat, none of which say anything about exceptions that they can raise.

@mgkuhn
Copy link
Contributor

mgkuhn commented Feb 26, 2021

POSIX offers the function pathconf to query path-specific limits on the maximum filename and path length permitted by a file system.

Libuv has uv__fs_pathmax_size() to query _PC_PATH_MAX that way, but apparently no equivalent yet for _PC_NAME_MAX. Since Julia tends to do all file I/O through libuv, I think libuv would have to be extended first before equivalent functions could be added to Julia. Or is there some package that just imports all POSIX functions (for users interested in such more low-level queries and less concerned about Win32 portability)?

I don't know a Win32 function to query maximum filename length, and limits there tend to be be measured in UTF-16 words, and not in UTF-8 bytes. So "255" can mean different things!

@GiggleLiu
Copy link
Contributor Author

GiggleLiu commented Feb 26, 2021

I just tried to attack Pluto server (locally) by inputing a too long filename in the path, Pluto is lucky and this error does not break down the server. Not sure every web application can be so lucky under such an attack. 😏

If you input a short filename, looks reasonable

image

If you input a too long filename, it might confuse Pluto a bit

image

Too many people are assuming isfile not raising an error.

@mgkuhn
Copy link
Contributor

mgkuhn commented Feb 26, 2021

is there an easy way to check if the filename is too long?

Not sure if that qualifies as “easy”:

istoolong(filename) = try
    stat(filename)
    false
catch e
    if isa(e, Base.IOError)
        e.code == -36
    else
        rethrow()
    end
end

istoolong("a"^800)

@GiggleLiu
Copy link
Contributor Author

is there an easy way to check if the filename is too long?

Not sure if that qualifies as “easy”:

istoolong(filename) = try
    stat(filename)
    false
catch e
    if isa(e, Base.IOError)
        e.code == -36
    else
        rethrow()
    end
end

istoolong("a"^800)

This is good. I just want to know if people want to have a safe_isfile, how can they implement one. Thanks

@mgkuhn
Copy link
Contributor

mgkuhn commented Feb 26, 2021

So the question is: in base/stat.jl should the functions

ispath
isfifo
ischardev
isdir
isblockdev
isfile
issocket
issetuid
issetgid
issticky
islink

catch exceptions that stat/lstat can raise and return false instead?

If yes: all exceptions that stat/lstat can raise, or only some? Which?

Which exceptions can stat/lstat raise at present?

@nstiurca
Copy link
Contributor

nstiurca commented Jul 7, 2023

Bump.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jul 8, 2023

EDIT: Below is trivia of MAX_PATH, an issue I want fixed, and then below on NAME_MAX, that I can confirm is 255, practically speaking.

https://insanecoding.blogspot.com/2007/11/pathmax-simply-isnt.html

The flaw is that PATH_MAX simply isn't. [..]
Since a path can be longer than PATH_MAX, the define is useless, writing code based off of it is wrong, and the functions that require it are broken.

An exception to this is Windows. It doesn't allow any paths to be created larger than 260 characters. [That's also wrong.]

It sounds strange that such a small limit was chosen, considering that FAT has no such limit imposed, and NTFS allows paths to be 32768 characters long.

Can't we assume at least 32768-long paths allowed on all filesystems/operating systems we want Julia to support?

#ifndef JL_PATH_MAX // many platforms don't have a max path, we define one anyways
#define JL_PATH_MAX 1024

https://www.gnu.org/software/libc/manual/html_node/Limits-for-Files.html

Macro: int PATH_MAX
The uniform system limit (if any) for the length of an entire file name (that is, the argument given to system calls such as open), including the terminating null character.

Portability Note: The GNU C Library does not enforce this limit even if PATH_MAX is defined.

EDIT: I misread the issue here about path length, something I very much want to fix to not be limited to 256/260, 1024 or whatever, which seem possible even on Windows (without the registry hack), but to stay on topic:

The maximum filename length is 255

Seemingly never longer, but sometimes shorter (all exceptions we can Ignore, i.e. 14 in System V, and 8.3 in FAT, i.e. VFAT ok?):

https://www.baeldung.com/linux/bash-filename-limit

BTRFS | 255 bytes
exFAT | 255 UTF-16 characters
ext2 | 255 bytes
[..]
FAT32 | 8.3 (255 UCS-2 code units with VFAT LFNs)
NTFS | 255 characters

And:

https://discussions.apple.com/thread/555198

Under OS 9 a maximum file name length was 32 characters while under OS X it is 255 characters. However, some applications don't allow file names in the save dialogs that are longer than 32 characters. That is a problem with those applications, not the system.

and:

For example, {NAME_MAX} frequently changes between System V and BSD-based file systems; System V uses a maximum of 14, BSD 255

https://stackoverflow.com/questions/69385044/is-a-posix-name-max-value-of-only-14-as-expected-in-mac-os

@elextr
Copy link

elextr commented Jul 9, 2023

The obligatory Wikipedia summary post https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits

@nstiurca
Copy link
Contributor

nstiurca commented Jul 9, 2023

I recently ran into an issue where I got hit with ENAMETOOLONG for a path that is 180 characters. I'm on Linux Mint, btrfs filesystem, which should support <255 filename length just fine. My string did contain special characters, and it wasn't an absolute path. So I'm guessing somewhere or other, my string ballooned to >255 characters? Point is, simple length checks seems brittle. Why not just rely on the error code, and handling it in a way that doesn't surprise users?

@elextr
Copy link

elextr commented Jul 10, 2023

To my thinking the path/segment length issue is moot.

isfile() simply answers the question, "does this path resolve to an existing plain file". If the path is illegal for the specific filesystem then clearly the file cannot exist so it should return false.

The reason for the failure is not relevant to the purpose of isfile(), and the fact that it uses a library call that throws an IOError is an implementation detail that should not be allowed to leak.

Instead of trying to test if the path is legal which is "complicated", just try to stat it and swallow all errors from the filesystem and return false.

As noted above a number of other similar functions probably should be made consistent.

@mgkuhn mgkuhn self-assigned this Jul 10, 2023
@mgkuhn
Copy link
Contributor

mgkuhn commented Jul 10, 2023

I'll have a look. There seem to be additional problems with error handling in stat-related APIs, e.g.

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, ivybridge)
  Threads: 1 on 8 virtual cores

julia> stat("/nonexistantfile")
StatStruct for "/nonexistantfile"
   size: 0 bytes
 device: 0
  inode: 0
   mode: 0o000000 (----------)
  nlink: 0
    uid: 0 (root)
    gid: 0 (root)
   rdev: 0
  blksz: 0
 blocks: 0
  mtime: 1970-01-01T01:00:00+0100 (19548 days ago)
  ctime: 1970-01-01T01:00:00+0100 (19548 days ago)

doesn't look like a robust error-handling mechanism either.

Shouldn't stat instead return nothing if the requested file does not exist? (Changing that would not seem to be a breaking change, as the current docstring says nothing about error behaviour.)

@nstiurca
Copy link
Contributor

nstiurca commented Jul 12, 2023

Shouldn't stat instead return nothing if the requested file does not exist? (Changing that would not seem to be a breaking change, as the current docstring says nothing about error behaviour.)

Wow, that's very surprising. Probably worth raising as a separate bug.

@nstiurca
Copy link
Contributor

To my thinking the path/segment length issue is moot.

isfile() simply answers the question, "does this path resolve to an existing plain file". If the path is illegal for the specific filesystem then clearly the file cannot exist so it should return false.

The reason for the failure is not relevant to the purpose of isfile(), and the fact that it uses a library call that throws an IOError is an implementation detail that should not be allowed to leak.

Instead of trying to test if the path is legal which is "complicated", just try to stat it and swallow all errors from the filesystem and return false.

As noted above a number of other similar functions probably should be made consistent.

I mostly agree, except for the case of permission issues. Given a folder /foo which I cannot access, and a file therein called bar, I cannot stat /foo/bar even though it is in fact a file. So really the return type should be Boolean|Missing

@elextr
Copy link

elextr commented Jul 12, 2023

Given a folder /foo which I cannot access, and a file therein called bar, I cannot stat /foo/bar even though it is in fact a file.

The permissions you describe are set so you are not supposed to be able to see the file. It would be a security leak if isfile() indicated the existence (by returning anything other than false) when you are not permitted to see the file.

@nstiurca
Copy link
Contributor

I understand the security implications, but I don't think returning false from Julia's isfile() really helps. For one thing, the user can stat the file directly to get the error code in question without worrying about what Julia does with it. For another, the error code only says that some directory in the path of the given string does not give the user execute permissions. The user already knows that directory exists (because it was able to list its parent). Ie, If the path in question is /foo/bar/biz, the user can list the foo directory, then the user will know that bar exists. But if the bar directory does not grant the user permissions to list it, the user gains no info about whether biz file exists; the error just indicates that some directory is not accessible by the user. So basically missing is the correct return value, because Julia is unable to confirm or deny the existence of the file, only that it looks like a path as far as it can tell.

@elextr
Copy link

elextr commented Jul 15, 2023

True, nothing that uses stat() can be more secure than stat() is, so yes bar must not be searchable by the user, and stat() will return EACCESS.

What about all the other reasons that the file cannot be resolved?

On Linux stat() (Windows error codes are different IIUC) EACCESS , ELOOP, ENAMETOOLONG, ENOENT, ENOTDIR all relate to all components of the pathname, but Linux stat() gives no information which component, isfile() would have to walk the path to find which, so the status of the file is indeterminate unless that is done, so IIUC your argument all stat() errors should return missing. But in all cases as far as the user is concerned the path does not resolve to a file, so isfile() can also legitimately return false.

As you say, if the user wants more information they can use stat directly, so the only reason for isfile() existing is to just to make simple usage simple, eg:

if isfile(path)
   do_something(path) # ignoring potential race, its "simple" after all
end

But returning missing will mean that simple usage may now throw TypeError from the if IIUC.

@nstiurca
Copy link
Contributor

nstiurca commented Jul 17, 2023

your argument all stat() errors should return missing.

Not what I was saying. I only meant that stat returning an empty result for files that don't exist seems like a bug. Sorry for being unclear.

As for isfile and friends, I was only suggesting EACCESS map to missing. ELOOP, ENAMETOOLONG, ENOENT, and ENOTDIR all seem like they would map to false IMO.

The other one I'm unsure about is EIO:

An error occurred while reading from the file system.

It's a bit unclear what that means, but it sounds like it could be raised even if the path does in fact refer to a file. Eg, if it's on a network mount that's not responding, it might or might not be a file and we just don't know.

As for your point about ending up with a TypeError... yeah that's not great. I guess the three options are: return missing, return false, or throw an exception (eg, rethrow the IOError). I imagine most of the time users would want to just get a false back from the function, but it could cause confusion in some cases.

@elextr
Copy link

elextr commented Jul 18, 2023

My question then is why the special pleading for EACCESS? What is the general use-case for it for simplified functions like isfile()?

@nstiurca
Copy link
Contributor

nstiurca commented Jul 18, 2023

If you return false for an EACCESS issue, it is inaccurate and potentially misleading. Consider:

daemon_pid_fname = "..."
isfile(daemon_pid_fname) || spawn_daemon() |> persist_pid
...

If we mask the EACCESS issue in isfile(), then we won't notice it until we try to persist the PID of a new daemon (which would presumably throw some IOError during some open() call inside persist_pid()), then in this case we'd end up with an extra daemon.

@elextr
Copy link

elextr commented Jul 19, 2023

But why is EACCESS special, any error from the stat() inside isfile() could cause the same problem, a dangling link, a name too long (the OP), and intermittent NFS error etc.

And as my comment in the example alluded to, there is a race in your example, another process (or thread) could create the daemon between the test and the spawn. Simple isfile() isn't the way to handle that, it should be performed properly using locking.

IMHO complicating the simple to incorrectly handle the complex is the wrong thing to do.

@nstiurca
Copy link
Contributor

Yeah, ironically my example had misleading characteristics. The concurrency/race condition is really not the point though. The point is misleading the caller that the file does not exist, acting on that false knowledge, but the file did actually exist--we just couldn't access it. If we were discussing is_file_which_does_not_throw_errors() then sure all IOErrors should be treated the same.

The point remains that if something like ENAMETOOLONG is thrown, we know for sure that the input argument cannot possibly be a file; but if we get EIO or EACCESS then we really don't know. Making assumptions or otherwise pretending to know could be dangerous.

So to recap the options:

  1. Keep things as is. Those who continue to get tripped up from time to time will hopefully find this thread and in particular @mgkuhn 's workaround above.
  2. Return false if and only if we can be sure that the argument does not refer to a file.
  3. Return false if we have direct evidence it's not a file, or we are not sure.
  4. Have multiple versions of the functions, one version without any error handling like the current version, and another version which will do the Right Thing^TM in the vast majority of cases for the vast majority if users. Have good documentation on these subtleties.

I'm in favor of options 2 (preferred) or 4, but mostly I'm just very against option 1.

@elextr
Copy link

elextr commented Jul 19, 2023

isfile(path) is one of a number of predicates which are convenience queries on the result of a stat() call. They are all defined as "Return true if ..., false otherwise".

Since all they do is test the return of stat() their only reason to exist is to be convenient. They are all defined as:

($f)(path...)  = ($f)(stat(path...))

The OP was that isfile() unexpectedly threw on name too long, and its clear that the other convenience queries on stat() will do the same given their equivalent definitions.

The OP commented "I recall most of my using cases are about using isfile to avoid exceptions.", a reasonable expectation since no exceptions are documented for stat() or the queries, and I suspect most uses of the queries is to avoid the complications of stat() in simple cases. Since all the queries are racy in complex cases, they should only be used in simple use-cases anyway. For simple uses avoiding exceptions is reasonable.

@mgkuhn to answer your question "If yes: all exceptions that stat/lstat can raise, or only some? Which?"

AFAICT UV error codes are converted to exceptions here for any code other than ENOENT, EINVAL, and ENOTDIR. It will also throw if it detects an inconsistency in the stat() return, success but no type.

Those conditions seem arbitrary IMO, eg why does EINVAL not throw, but ENAMETOOLONG does, both seem to me to be input parameter errors, and both ENOTDIR and ENAMETOOLONG both seem to be path errors but ENOTDIR does not throw. Since these tests are explicit perhaps someone can comment on why it is that way.

Perhaps a @stat_nothrow_call is needed that does not convert UV error codes to exceptions so predicates like isfile() can avoid throwing and return false when they cannot confirm the truth of their query. Rather than changing the basic stat(), which should however be documented as to what conditions can throw and what the reasoning for that is.

@nstiurca ok, understand your point, but isfile() and the other queries are defined as "Return true if (path is a regular file), false otherwise." so you are requesting a feature change (or new functions) that return true if the test is definitely successful, false if definitely not successful, and missing if not sure, is a different question to that raised by the OP and probably should be a separate feature request.

Also returning missing reintroduces the risk of TypeError throws from if when many use-cases of the current predicates are to avoid throwing.

@nstiurca
Copy link
Contributor

@elextr Soooo... you support option 3 from my last comment as your preferred resolution? It's better than before, particularly if well documented. I can live with that.

As for "feature change"... we're dealing with undefined behavior (the definition doesn't mention exceptions or cases where the call is uncertain. As such, fleshing out the definition isn't technically a breaking change even if it's a change. Point being we can get this change into the next point release if we can get to some consensus on what's the right definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:docs This change adds or pertains to documentation domain:filesystem Underlying file system and functions that use it
Projects
None yet
Development

No branches or pull requests

6 participants