Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare performance vs std::fs::read_dir #120

Open
joshtriplett opened this issue Apr 28, 2019 · 19 comments
Open

Compare performance vs std::fs::read_dir #120

joshtriplett opened this issue Apr 28, 2019 · 19 comments

Comments

@joshtriplett
Copy link

I recently wrote a short test program using walkdir to count 6M files in a single directory. I then wrote the same test program using std::fs::read_dir, and found a surprisingly large performance difference.

To create the test data:

mkdir testdir
(cd testdir ; seq 6000000 | xargs touch)

Using walkdir:

fn main() {
    let dir = &std::env::args().nth(1).expect("Usage: countdir dirname");
    println!("{}: {} files", dir, walkdir::WalkDir::new(dir).min_depth(1).max_depth(1).into_iter().count());
}

Performance:

$ time target/release/countdir testdir
testdir: 6000000 files

real	0m3.476s
user	0m1.308s
sys	0m2.168s

Using std::fs::read_dir:

fn main() {
    let dir = &std::env::args().nth(1).expect("Usage: countdir dirname");
    println!("{}: {} files", dir, std::fs::read_dir(dir).expect("read_dir").count());
}
$ time target/release/countdir testdir
testdir: 6000000 files

real	0m2.716s
user	0m0.580s
sys	0m2.136s

Note that the two programs use almost the same system time, and strace shows them making almost exactly the same syscalls other than an extra lstat (which makes sense, as walkdir uses std::fs::read_dir underneath). However, walkdir uses a lot more user time.

This seems worth investigating, to try to reduce the overhead.

@BurntSushi
Copy link
Owner

BurntSushi commented Apr 28, 2019

I can reproduce this:

$ hyperfine './target/release/walkdir-120 /tmp/large-testdir/'
Benchmark #1: ./target/release/walkdir-120 /tmp/large-testdir/
  Time (mean ± σ):      1.843 s ±  0.050 s    [User: 499.9 ms, System: 1341.8 ms]
  Range (min … max):    1.764 s …  1.938 s    10 runs

$ hyperfine './target/release/walkdir-120 /tmp/large-testdir/'
Benchmark #1: ./target/release/walkdir-120 /tmp/large-testdir/
  Time (mean ± σ):      2.398 s ±  0.053 s    [User: 1.024 s, System: 1.372 s]
  Range (min … max):    2.306 s …  2.463 s    10 runs

walkdir does have some overhead. You can see all the different branching going
on, for example, in the following routines:

  • Pushing a new entry on to the stack:
    fn push(&mut self, dent: &DirEntry) -> Result<()> {
  • Popping an entry:
    fn pop(&mut self) {
  • Handling an entry:
    fn handle_entry(
  • Top-level iterator entry point:
    fn next(&mut self) -> Option<Result<DirEntry>> {
  • Creating a walkdir::DirEntry from a std::fs::DirEntry:

    walkdir/src/lib.rs

    Lines 1170 to 1184 in 705d06c

    #[cfg(unix)]
    fn from_entry(depth: usize, ent: &fs::DirEntry) -> Result<DirEntry> {
    use std::os::unix::fs::DirEntryExt;
    let ty = ent.file_type().map_err(|err| {
    Error::from_path(depth, ent.path(), err)
    })?;
    Ok(DirEntry {
    path: ent.path(),
    ty: ty,
    follow_link: false,
    depth: depth,
    ino: ent.ino(),
    })
    }

Notably, that last link has a subtle but important
performance impact in workloads like this. Namely, it calls
std::fs::DirEntry::path, which allocates a new PathBuf by combining the path given
to std::fs::read_dir with the file name found in the call to readdir.
That's an allocation that cannot really be avoided when building a recursive
directory iterator on top of the standard library APIs.
Indeed, if we change
your program to do just a little bit more work by looking at the full path:

fn main() {
    let dir = &std::env::args().nth(1).expect("Usage: countdir dirname");
    let count = std::fs::read_dir(dir)
        .expect("read_dir")
        .map(|dent| dent.unwrap().path())
        .count();
    println!("{}: {} files", dir, count);
}

Then the performance difference almost evaporates:

$ hyperfine './target/release/walkdir-120 /tmp/large-testdir/'
Benchmark #1: ./target/release/walkdir-120 /tmp/large-testdir/
  Time (mean ± σ):      2.198 s ±  0.086 s    [User: 845.2 ms, System: 1350.8 ms]
  Range (min … max):    2.081 s …  2.309 s    10 runs

This is close enough where the rest of it might just be a result of the
additional branching and extra function calls that result from walkdir
supporting a lot more features than std::fs::read_dir.

It would in theory be possible to specialize to the particular use case of
reading a single directory with no recursion, but the allocation of PathBuf
still needs to be made since walkdir's
DirEntry::path
method returns a &Path. (In the common case, this saves additional
allocations for whoever uses walkdir, and because walkdir's reliance on the
standard library for directory retrieval requires this anyway.)


With all that said, I've been thinking about adapting some of the ideas from
this post to speed up directory traversal on Linux:
https://github.com/romkatv/gitstatus/blob/master/docs/listdir.md
As a bonus, I think it would also fix #23 for dealing with long file paths
(by simply not dealing with file paths at all). The problem with this approach
is that walkdir would need to roll its own directory traversal handling on at
least Linux (probably most Unixes, actually), which is kind of a bummer. But
it's not that much work. This would also likely motivate some API changes,
e.g., an alternative API for completely amortizing any and all allocation
related to any particular directory entry.

So I'll leave this issue open to track that work, but I otherwise don't think
this is likely to be fixed under some other avenue.

@BurntSushi
Copy link
Owner

I've made some progress on attempting to fix this (and other bugs, such as #23), but I've kind of burned myself out on it. In particular, it essentially requires re-rolling all of the platform specific directory traversal APIs (completely avoiding std's fs::read_dir). Moreover, it also looks like I'll need to re-roll all of the Metadata crap that's in std too, if I want to be able to use the faster fstatat (takes a parent file descriptor and a child file name) instead of the slower fstat (takes a full file path to a file). Re-rolling all of this is in general required for amortizing allocation, which is simply not possible using std's APIs. From initial experiments, it seems like this could get us a 20-30% speed up, which is nothing to sneeze at. (In addition to fixing long file path bugs.)

My WIP is here, with the meat of it in the os submodule.

I'll get back around to this eventually, but I need to work on something else first. :-)

@rabite0
Copy link

rabite0 commented Feb 17, 2020

This is absolutely fantastic!
In my quest for fast loading of directories I realized that the stdlib has too much overhead and even reaching for readdir and friends is far from optimal, since AFAIK glibc uses a very small buffer and fetches very few dirents per system calll.
So I ended up rolling my own using getdents64.

These are not the interfaces you are interested in.

They say, but when loading a directory with 1M files it ends up being ~30% faster (950ms -> 600ms). I'd say that's absolutely what I'm interested in ;).

The API in your WIP code is exactly what I was looking for. It exposes the raw system call data without any extra allocation or expensive conversions and verification as far as I can see. I'm only worried about this part:

    /// Create a new cursor with the specified capacity. The capacity given
    /// should be in bytes, and must be a multiple of the alignment of a raw
    /// directory entry.
    fn with_capacity(capacity: usize) -> DirEntryCursor {
        // TODO: It would be nice to expose a way to control the capacity to
        // the caller, but we'd really like the capacity to be a multiple of
        // the alignment. (Technically, the only restriction is that
        // the capacity and the alignment have a least common multiple that
        // doesn't overflow `usize::MAX`. But requiring the size to be a
        // multiple of alignment just seems like good sense in this case.)

I'm a bit worried about the fact that the buffer size is not part of the public API. When testing on my system (running hunter) it's actually faster with a 4kb buffer than with a 16kb buffer (if my calculation is correct). With the 16kb buffer it's just a bit faster than using readdir. This is mostly because I copy the buffer and process the dirents in another thread, while the main thread does another blocking call to getdents64. That doesnt't work if the buffer is so large that all dirents fit into the buffer and only one system call is done.
Of course, this depends heavily on the size of the file names and what kind of processing is done with the dirents, but for optimal performance I think it's necessary to expose this. II don't think it's weird at all, if you're dealing with this low-level stuff for performance fine tuning parameters like that is just what I'm looking for.

I think exposing the capacity in terms of "maximum, but not guaranteed number of dirents" makes the most sense since that's what the syscall does, too. As you wrote in the comment the actual size of the dirents is unpredictable and the best you can do is "guess" the average lengh.
An alternative approach could be to not to adjust the capacity of the buffer, but to limit the number of bytes/dirents to read per syscall. That way the buffer could be filled syscall by syscall and the caller can do something else with a copy of the buffer while getdents654 is blocked on IO. Maybe just a

fn read_dirents(&mut self, max: usize)

or something like that.

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 17, 2020

@rabite0 Thanks for the thoughts. But yeah, that detail is probably what I am least worried about. I actually managed to burn myself out on this for a second time a few months ago. Getting the API correct in a way that both minimizes mistakes and provides the best performance possible is extremely challenging. Certainly, I have a new appreciation for why std puts an Arc<DirFd> inside of every DirEntry.

Most/all of the low level stuff is done, including almost completely re-rolling std::fs::Metadata, which was quite annoying. Right now, I'm stuck trying to figure out how to design the higher level APIs. The main problem I'm trying to deal with is how to harmonize "set a limit on number of open file descriptors" and "make stat calls efficient by using statat" and "avoid allocation/copying as much as possible." Once you throw in the "oh yeah this should all be cross platform" wrench in there, you wind up with crap like this.

Overall, I've reached a point where I am not only writing a significant amount of new code, but much of that new code involves unsafe and hairy platform specific details. In the grand scheme of things, this is all in the service of a modest performance boost in fairly extreme cases. That has really soured the idea for me, and I actually haven't made up my mind on whether I'm going to follow through it. I still plan on "finishing" it, but I don't know whether I'll scrap it all or not.

@rabite0
Copy link

rabite0 commented Feb 17, 2020

I can imagine that writing a cross platform API that's both fast, correct, and easy to use, especially with Windows in the mix, is quite the effort. From what I can see in your code even on *nix systems it's more of a mess than I'd have thought and it's a mess that has to be abstracted away somehow and just researching and testing all that crap sounds like quite the horror. It's true that that this herculean task just to see the numbers go down in ridiculous conditions might be seen as a vanity project of sorts. I mean, millions of files files in a directory or whatever is not a realistic scenario other than creating it just to test performance. On the other hand it's a bit like chasing the dragon, seeing the numbers go down is quite the dopamine hit.

I wonder if it would make sense to release the low-level APIs as a separate crate? I think that would be quite useful by itself and if it's mostly done you'd have some low hanging fruit to push out.

This might be cheesy, but thank you for your hard work, you're a legend, don't stress out over this too much. I can understand if you don't think it's worth all the hours you have to put in. There's only so much time in a day. But the dragon is quite attractive indeed. :)

@BurntSushi
Copy link
Owner

On the other hand it's a bit like chasing the dragon, seeing the numbers go down is quite the dopamine hit.

Indeed. That's pretty much what drives me forward. :-)

I wonder if it would make sense to release the low-level APIs as a separate crate? I think that would be quite useful by itself and if it's mostly done you'd have some low hanging fruit to push out.

My thinking was to release it as part of the walkdir crate in a os sub-module. That way, all the guts are there for folks who want it. I could put this out as a separate crate, but 1) I'm trying really hard to not let crates sprawl out, especially for core ecosystem crates like walkdir and 2) I would be hesitant to do this until I "proved" out the API by finishing the higher level abstraction.

Certainly I wouldn't stop someone else from taking my work so far and moving forward with it. I have no idea when I'll finish it. :-)

This might be cheesy, but thank you for your hard work, you're a legend, don't stress out over this too much. I can understand if you don't think it's worth all the hours you have to put in. There's only so much time in a day. But the dragon is quite attractive indeed. :)

Thanks. :-) <3

@rabite0
Copy link

rabite0 commented Feb 18, 2020

Certainly I wouldn't stop someone else from taking my work so far and moving forward with it. I have no idea when I'll finish it. :-)

Sounds like a plan. I already "adopted" a few crates along the way and I was actually thinking about maybe just stealing your code for the Linux getdents implementation as is (looks really great), instead of writing my own. I'll probably get into it when I start adding high speed code paths for BSD and other OSs.
Might as well release it as a crate at that point (not "official" and with proper attribution of course). There ought to be a nerd who finds it useful ;)

@tbillington
Copy link

Hope I'm not being too naive with this but is io_uring on linux a consideration? I've seen some benchmarks that are hard to ignore. No idea how well it's model fits in though, I haven't used it.

@BurntSushi
Copy link
Owner

I don't really know much about io_uring, sorry. I thought it was for async stuff? My guess is that it requires a new kernel anyway, so even if it made sense, walkdir would still need to provide a working implementation for older kernels.

@tbillington
Copy link

Yes, true, requires linux 5 at least I think, and some features require even more recent versions. So yes, you'd need a second impl for older kernels.

Walkdir operates in a predictable way, so I assume the async aspect of io_uring could be leveraged to "queue up" subsequent iteration entries while the user's application logic is running. That would add quite a bit of complexity though, and I have no idea if it's worth it :)

End rambling

@BurntSushi
Copy link
Owner

@tbillington That can already be done with threads. I don't think async is necessary for that. I suspect that's a large reason why the parallel traverser in the ignore crate is faster than the single threaded one here in walkdir.

@Ekleog
Copy link

Ekleog commented May 11, 2020

Seeing as what appears to be blocking this as well as #23 seems to be the need to re-write everything without using libstd and this requiring quite a lot of work, have you considered using the openat crate that already implements all these APIs?

Though, it's totally possible that I just misunderstand #120 (comment) and think your issue is actually something else :)

@BurntSushi
Copy link
Owner

I don't see how that crate helps, sorry. It doesn't really do any of the hard parts. There is some code overlap, but it's pretty small and uninteresting. Basically, all openat could do is help with the walkdir::os::unix module. As I said in my comment above, all the low level stuff is done.

If you want to talk more about this, I would encourage you to review the code on my branch: https://github.com/BurntSushi/walkdir/tree/ag/sys/src

@joshtriplett
Copy link
Author

The main problem I'm trying to deal with is how to harmonize "set a limit on number of open file descriptors" and "make stat calls efficient by using statat" and "avoid allocation/copying as much as possible."

I don't know if this helps at all with complexity, but personally, I would have no objections whatsoever if "limit number of open file descriptors" was simply incompatible with the fastest traversal algorithm. Would eliminating that requirement substantially simplify this algorithm?

@BurntSushi
Copy link
Owner

Marginally, maybe. But it should be supported somewhere.

@martinetd
Copy link

(quite offtopic but since that was brought up in this thread: io_uring is for async yes but if you're thinking of threads it would already be worth using here; the point is that the kernel manages some background threads for you, and your single main thread avoid the syscalls so less overhead = faster IOs. . . (it's generally recommended to have one ring per thread, if you really want multiple threads on top of that)
But there's just one little problem: io_uring does not have a getdents (or equivalent) call -- it looks like someone tried, and that appeared to help according to a cover letter (24 -> 10 mins with cold cache or 5->3mins cached); but regardless of whether that was a fair comparison the latest version got shot down pretty badly by Al Viro for practical reasons justifying we can't really just read from any directory offset and this all died out because uring pretty much only makes sense with pread-like interfaces)

Anyway, I came here for openat and... wow, cross platform is hard. I have a few opinions on an interface I'd like to use on my own (exposing directory FDs and just the last component's d_name in a callback (+ per directory user state if they want to build a full path up to there or something themselves)), but that's not something that would be practical for general use anyway, and after reading most of this I don't think it's worth bothering.
I might just give an ultra-specific version a shot for time comparison but I'll probably be just as well with the current version...
Thanks for all the efforts @BurntSushi and everyone else!

@Sewer56
Copy link

Sewer56 commented Feb 7, 2024

I don't see how that crate helps, sorry. It doesn't really do any of the hard parts. There is some code overlap, but it's pretty small and uninteresting. Basically, all openat could do is help with the walkdir::os::unix module. As I said in my comment above, all the low level stuff is done.

If you want to talk more about this, I would encourage you to review the code on my branch: https://github.com/BurntSushi/walkdir/tree/ag/sys/src

I had a quick look here because I happened to stumble upon this issue, while looking for a performant solution to iterate directories.

Anyway, one thing does come to mind.

Is there any reason you can't use NtQueryDirectoryFIle for the Windows implementation? Kind of curious, I'm new-ish to Rust, but I was surprised std doesn't seem use it either.

It's the de-facto standard for querying the FIleSystem in the fastest way possible on Windows. It returns multiple items at once at various detail levels (specified by FileInformationClass). Don't be fooled by the fact it's part of NtDll, this is very much used in production on a massive scale, for example in Chromium and the .NET Runtime.

@ChrisDenton
Copy link

Std uses GetFileInformationByHandleEx in its implementation of std::fs::remove_dir_all because there's not a compelling reason to use an Nt function when a win32 one does the same job. With std::fs::read_dir it indeed still uses FindFirstFileW. It would make sense to unify these implementations but nobody has done so yet.

Probably more relevant to this issue is that there is a std::fs::Dir type in the works (but not yet implemented). This will allow directory iteration based on a handle and should be optimized for that. I suspect other things in std will eventually be rewritten in terms of that.

@BurntSushi
Copy link
Owner

Yeah I'm pretty excited about Dir in std. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants