preserve order of input files in output without sacrificing parallelism #152

Open
jikamens opened this Issue Oct 5, 2016 · 27 comments

Comments

Projects
None yet
5 participants
@jikamens

jikamens commented Oct 5, 2016

There should be a way to tell rg to preserve the order of input files specified on the command line in the output.

E.g., if I'm searching a bunch of files which are in chronological order, I want the output to be in chronological order.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Oct 5, 2016

Owner

You can do this today, but you have to give up parallelism. If you run rg -j1 PATTERN FILE FILE ... then the output should be deterministic and in the order specified. In general, when recursively searching a directory, the order is whatever the file system serves up, but with -j1 it should at least be deterministic.

Owner

BurntSushi commented Oct 5, 2016

You can do this today, but you have to give up parallelism. If you run rg -j1 PATTERN FILE FILE ... then the output should be deterministic and in the order specified. In general, when recursively searching a directory, the order is whatever the file system serves up, but with -j1 it should at least be deterministic.

@jikamens

This comment has been minimized.

Show comment
Hide comment
@jikamens

jikamens Oct 6, 2016

Yeah, so I don't want to give up parallelism, obviously.

GNU parallel let's me be parallel and preserve the order of output with --keep. I'm proposing that rg should provide similar functionality. Otherwise rg -j# isn't a drop-in replacement for grepping a bunch of files with GNU parallel, which I would really like it to be. ;-)

jikamens commented Oct 6, 2016

Yeah, so I don't want to give up parallelism, obviously.

GNU parallel let's me be parallel and preserve the order of output with --keep. I'm proposing that rg should provide similar functionality. Otherwise rg -j# isn't a drop-in replacement for grepping a bunch of files with GNU parallel, which I would really like it to be. ;-)

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Oct 6, 2016

Owner

I get what you're saying. Here are my thoughts on the matter:

  1. If deterministic ordering + parallelism were possible to do without incurring additional costs, rg would do that by default. (Because I'd really like deterministic ordering too.) Perhaps I am wrong and this is possible (or perhaps, possible with immeasurable costs), and if so, we should do that.
  2. If not (1), then hiding this behind another flag seems like bad UX, but maybe that's the best we can do.
  3. Requiring single threaded searches for when you absolutely need deterministic output doesn't seem unreasonable to me. For example, rg can sometimes beat or be competitive with ag using a single thread even when ag runs with multiple threads.

All of these things combined together means I'm not particularly motivated to work on this, but I grant it's something I'd be willing to explore given unbounded resources.

Owner

BurntSushi commented Oct 6, 2016

I get what you're saying. Here are my thoughts on the matter:

  1. If deterministic ordering + parallelism were possible to do without incurring additional costs, rg would do that by default. (Because I'd really like deterministic ordering too.) Perhaps I am wrong and this is possible (or perhaps, possible with immeasurable costs), and if so, we should do that.
  2. If not (1), then hiding this behind another flag seems like bad UX, but maybe that's the best we can do.
  3. Requiring single threaded searches for when you absolutely need deterministic output doesn't seem unreasonable to me. For example, rg can sometimes beat or be competitive with ag using a single thread even when ag runs with multiple threads.

All of these things combined together means I'm not particularly motivated to work on this, but I grant it's something I'd be willing to explore given unbounded resources.

@jikamens

This comment has been minimized.

Show comment
Hide comment
@jikamens

jikamens Oct 6, 2016

You can implement ordering + parallelism with no additional cost; all rg has to do is read output from the threads in the order that they are launched. This means, of course, that some of the threads will block waiting to be read because their output buffers will fill up, but the performance of such an implementation would be no worse than -j1, and in most cases it would be significantly better.

You can improve the performance simply by making the output buffers larger so that they block less.

"You just have to put up with your search running eight times slower on your eight-core machine," is not really a reasonable answer, especially since I don't have to put up with it, since as I pointed out, there are other tools that will allow me to search and preserve ordering in the output.

I frankly get the impression that you are rationalizing this idea being not worthwhile because the internal implementation of rg is not conducive to it. That's putting the cart before the horse. If it's too hard to do this right now in rg, fine, that's a reasonable answer. But to then try to hand-wave away the value of the idea because it's hard to do is a step too far.

jikamens commented Oct 6, 2016

You can implement ordering + parallelism with no additional cost; all rg has to do is read output from the threads in the order that they are launched. This means, of course, that some of the threads will block waiting to be read because their output buffers will fill up, but the performance of such an implementation would be no worse than -j1, and in most cases it would be significantly better.

You can improve the performance simply by making the output buffers larger so that they block less.

"You just have to put up with your search running eight times slower on your eight-core machine," is not really a reasonable answer, especially since I don't have to put up with it, since as I pointed out, there are other tools that will allow me to search and preserve ordering in the output.

I frankly get the impression that you are rationalizing this idea being not worthwhile because the internal implementation of rg is not conducive to it. That's putting the cart before the horse. If it's too hard to do this right now in rg, fine, that's a reasonable answer. But to then try to hand-wave away the value of the idea because it's hard to do is a step too far.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Oct 6, 2016

Owner

You can implement ordering + parallelism with no additional cost; all rg has to do is read output from the threads in the order that they are launched. This means, of course, that some of the threads will block waiting to be read because their output buffers will fill up

Unless I'm missing something, it sounds like the additional cost there is both memory and time.

but the performance of such an implementation would be no worse than -j1, and in most cases it would be significantly better.

But it sounds like it will be slower than -j8 on an eight core system, which doesn't quite satisfy "no additional cost." With that said, it's plausible that there is no measurable cost, but it's hard to know that without trying an experiment.

Comparing your suggestion to -j1 is interesting, but that's not the standard we need to live up to make this the default behavior. Simply being better than -j1 is good enough to put it behind a flag, but it's not good enough to make it default. I know you requested putting it behind a flag, but I'd like to see it be default behavior, and this is the frame of mind I have when inquiring about costs.

"You just have to put up with your search running eight times slower on your eight-core machine," is not really a reasonable answer, especially since I don't have to put up with it, since as I pointed out, there are other tools that will allow me to search and preserve ordering in the output.

Of course it's a reasonable answer. This is about time preferences and priorities. Two reasonable people can disagree on how important this particular issue is. I happen to think it's not that important. You clearly do. That's OK.

If GNU parallel lets you achieve this already, then it sounds like you can use ripgrep with GNU parallel to get everything you want except for the convenience of using one tool instead of two.

I frankly get the impression that you are rationalizing this idea being not worthwhile because the internal implementation of rg is not conducive to it.

I'm sorry I gave you that impression because that wasn't my intent. I think we should try to avoid reading into others' motivations and give the benefit of the doubt that I'm saying what I mean. Without an assumption of good faith here, it's hard to have a productive conversation. (To be frank, this comment by you is putting me on the defensive and making it hard for me to respond.)

To respond though, I don't think rg's current implementation really has anything to do with this. The multithreaded component of rg is pretty small (look at src/main.rs), and it wouldn't take much effort to rewrite it. In fact, I plan on rewriting at least a portion of it soon, because I'd like to make directory iteration itself parallelized. I suspect this may make deterministic output even harder. The best I can do is say that I'll give this issue more thought when I parallelize the directory iterator.

Owner

BurntSushi commented Oct 6, 2016

You can implement ordering + parallelism with no additional cost; all rg has to do is read output from the threads in the order that they are launched. This means, of course, that some of the threads will block waiting to be read because their output buffers will fill up

Unless I'm missing something, it sounds like the additional cost there is both memory and time.

but the performance of such an implementation would be no worse than -j1, and in most cases it would be significantly better.

But it sounds like it will be slower than -j8 on an eight core system, which doesn't quite satisfy "no additional cost." With that said, it's plausible that there is no measurable cost, but it's hard to know that without trying an experiment.

Comparing your suggestion to -j1 is interesting, but that's not the standard we need to live up to make this the default behavior. Simply being better than -j1 is good enough to put it behind a flag, but it's not good enough to make it default. I know you requested putting it behind a flag, but I'd like to see it be default behavior, and this is the frame of mind I have when inquiring about costs.

"You just have to put up with your search running eight times slower on your eight-core machine," is not really a reasonable answer, especially since I don't have to put up with it, since as I pointed out, there are other tools that will allow me to search and preserve ordering in the output.

Of course it's a reasonable answer. This is about time preferences and priorities. Two reasonable people can disagree on how important this particular issue is. I happen to think it's not that important. You clearly do. That's OK.

If GNU parallel lets you achieve this already, then it sounds like you can use ripgrep with GNU parallel to get everything you want except for the convenience of using one tool instead of two.

I frankly get the impression that you are rationalizing this idea being not worthwhile because the internal implementation of rg is not conducive to it.

I'm sorry I gave you that impression because that wasn't my intent. I think we should try to avoid reading into others' motivations and give the benefit of the doubt that I'm saying what I mean. Without an assumption of good faith here, it's hard to have a productive conversation. (To be frank, this comment by you is putting me on the defensive and making it hard for me to respond.)

To respond though, I don't think rg's current implementation really has anything to do with this. The multithreaded component of rg is pretty small (look at src/main.rs), and it wouldn't take much effort to rewrite it. In fact, I plan on rewriting at least a portion of it soon, because I'd like to make directory iteration itself parallelized. I suspect this may make deterministic output even harder. The best I can do is say that I'll give this issue more thought when I parallelize the directory iterator.

@jikamens

This comment has been minimized.

Show comment
Hide comment
@jikamens

jikamens Oct 6, 2016

Point taken. Thanks for the thoughtful response.

jikamens commented Oct 6, 2016

Point taken. Thanks for the thoughtful response.

@jikamens

This comment has been minimized.

Show comment
Hide comment
@jikamens

jikamens Oct 6, 2016

If GNU parallel lets you achieve this already, then it sounds like you can use ripgrep with GNU parallel to get everything you want except for the convenience of using one tool instead of two.

Unfortunately, I frequently get "Out of memory (os error 12)" from ripgrep when I try to use "rg -j1" with GNU parallel.

jikamens commented Oct 6, 2016

If GNU parallel lets you achieve this already, then it sounds like you can use ripgrep with GNU parallel to get everything you want except for the convenience of using one tool instead of two.

Unfortunately, I frequently get "Out of memory (os error 12)" from ripgrep when I try to use "rg -j1" with GNU parallel.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Oct 6, 2016

Owner

@jikamens Well... that's bad. :P If you'd be willing to file a bug, I'd be happy to look into it. I'm not a user of GNU parallel, so I think I'd need at least the full command you're running. If you can reproduce the issue on data we can both access, that's even better, but I understand if that's not possible.

Owner

BurntSushi commented Oct 6, 2016

@jikamens Well... that's bad. :P If you'd be willing to file a bug, I'd be happy to look into it. I'm not a user of GNU parallel, so I think I'd need at least the full command you're running. If you can reproduce the issue on data we can both access, that's even better, but I understand if that's not possible.

@jikamens

This comment has been minimized.

Show comment
Hide comment
@jikamens

jikamens Oct 6, 2016

I've just told you all that I'm able to tell you, so you can file the bug yourself if you want. ;-)

Seriously, the reason I haven't filed a bug is because the data I'm searching is very large and proprietary and so are the search strings I'm using, so all I'd be able to say in the bug is, "Yeah, I'm getting Out of memory (os error 12) but I can't tell you want I'm doing to provoke it." If I have time and am able to construct a test case that I can share, I'll open an issue.

jikamens commented Oct 6, 2016

I've just told you all that I'm able to tell you, so you can file the bug yourself if you want. ;-)

Seriously, the reason I haven't filed a bug is because the data I'm searching is very large and proprietary and so are the search strings I'm using, so all I'd be able to say in the bug is, "Yeah, I'm getting Out of memory (os error 12) but I can't tell you want I'm doing to provoke it." If I have time and am able to construct a test case that I can share, I'll open an issue.

@BurntSushi BurntSushi changed the title from Should be a way to preserve order of input files in output to preserve order of input files in output without sacrificing parallelism Oct 11, 2016

@ryanberckmans

This comment has been minimized.

Show comment
Hide comment
@ryanberckmans

ryanberckmans Feb 8, 2017

@BurntSushi on ripgrep 0.4.0,

       --sort-files
              Sort results by file path.  Note that this currently disables 
              all parallelism and runs search in a single thread.

It would be great if --sort-files with parallelism was the default. Sorting results by file path is a big benefit for me. Let me know if there's a better place for this feedback :).

Thanks for all your hard work! ripgrep is a joy to use. We have type coupling between microservices so I am often searching for FooRequestType and rely on file path order to parse results by service.

@BurntSushi on ripgrep 0.4.0,

       --sort-files
              Sort results by file path.  Note that this currently disables 
              all parallelism and runs search in a single thread.

It would be great if --sort-files with parallelism was the default. Sorting results by file path is a big benefit for me. Let me know if there's a better place for this feedback :).

Thanks for all your hard work! ripgrep is a joy to use. We have type coupling between microservices so I am often searching for FooRequestType and rely on file path order to parse results by service.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Feb 8, 2017

Owner

@ryanberckmans There's no debate that --sort-files with parallelism would be better than --sort-files without parallelism from an end user's perspective. I mean, it's not like I just arbitrarily chose to disable parallelism for the hell of it.

Glad you're enjoying ripgrep. :-)

Owner

BurntSushi commented Feb 8, 2017

@ryanberckmans There's no debate that --sort-files with parallelism would be better than --sort-files without parallelism from an end user's perspective. I mean, it's not like I just arbitrarily chose to disable parallelism for the hell of it.

Glad you're enjoying ripgrep. :-)

@BatmanAoD

This comment has been minimized.

Show comment
Hide comment
@BatmanAoD

BatmanAoD Mar 30, 2017

Since the documentation states that search time is guaranteed to be linear, perhaps we'd get more consistency (without guaranteeing full determinism of output) by ordering the files to search from largest to smallest by default? This of course would require completing the directory traversal before even starting a search, but I'm not sure how big a burden that is, either in terms of programming effort to make the modification or runtime memory usage to store and traverse the file list. (I'm guessing that --sort-files does not have this limitation, since you don't need to inspect child directories before starting to search through files in a parent directory.)

Obviously, to guarantee consistency, there's no way to escape the fact that sometimes the output will "hang" while waiting for the next thread to complete its search.

Since the documentation states that search time is guaranteed to be linear, perhaps we'd get more consistency (without guaranteeing full determinism of output) by ordering the files to search from largest to smallest by default? This of course would require completing the directory traversal before even starting a search, but I'm not sure how big a burden that is, either in terms of programming effort to make the modification or runtime memory usage to store and traverse the file list. (I'm guessing that --sort-files does not have this limitation, since you don't need to inspect child directories before starting to search through files in a parent directory.)

Obviously, to guarantee consistency, there's no way to escape the fact that sometimes the output will "hang" while waiting for the next thread to complete its search.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 30, 2017

Owner

Storing the entire tree in memory isn't a path I'm excited about pursuing. It's not only a huge code change, but it will introduce additional latency and a memory requirement that is proportional to the size of the tree.

Owner

BurntSushi commented Mar 30, 2017

Storing the entire tree in memory isn't a path I'm excited about pursuing. It's not only a huge code change, but it will introduce additional latency and a memory requirement that is proportional to the size of the tree.

@BatmanAoD

This comment has been minimized.

Show comment
Hide comment
@BatmanAoD

BatmanAoD Mar 30, 2017

That makes sense.

Does ripgrep currently search all files in a directory before going into subdirectories? If so, simply sorting the files in each directory by size (largest first) after getting the listing from the file system seems like it would probably be an improvement. (I'm guessing that getting the total size of each child directory is probably not worthwhile if exploring the entire file tree isn't an option.)

BatmanAoD commented Mar 30, 2017

That makes sense.

Does ripgrep currently search all files in a directory before going into subdirectories? If so, simply sorting the files in each directory by size (largest first) after getting the listing from the file system seems like it would probably be an improvement. (I'm guessing that getting the total size of each child directory is probably not worthwhile if exploring the entire file tree isn't an option.)

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 30, 2017

Owner

@BatmanAoD The directory traversal itself is parallelized. It's a key reason why ripgrep is so fast even when it needs to process gitignore files. Regardless, standard directory traversal is depth first, not breadth first, precisely for memory constraints. (Large but flat directories are more common than exceedingly deep directories.)

Owner

BurntSushi commented Mar 30, 2017

@BatmanAoD The directory traversal itself is parallelized. It's a key reason why ripgrep is so fast even when it needs to process gitignore files. Regardless, standard directory traversal is depth first, not breadth first, precisely for memory constraints. (Large but flat directories are more common than exceedingly deep directories.)

@BatmanAoD

This comment has been minimized.

Show comment
Hide comment
@BatmanAoD

BatmanAoD Mar 30, 2017

Right, but there's still an ordering of tasks available for execution (in parallel), based on which files haven't been searched yet and which folders haven't been explored yet. And even in a depth-first search, the full list of entries in a single directory is still retrieved from the file system before the children are actually searched (or added to the work queue). So I'm suggesting that in between the call to fs::read_dir (or, probably, after applying ignore rules) and the actual insertion of directory entries into the work queue, the contents of the directory are sorted against each other in descending size (possibly treating all directory entries as "larger" than file entries). (I'm not suggesting any kind of sorting between files or directories with different parent directories.)

......but now that I've written that out, I think the obvious question I should have asked earlier is, what kind of work queue does ripgrep actually use? If it's simply FIFO, perhaps switching to a priority queue based on size would be the best way to ensure that larger files are searched first if possible?

Right, but there's still an ordering of tasks available for execution (in parallel), based on which files haven't been searched yet and which folders haven't been explored yet. And even in a depth-first search, the full list of entries in a single directory is still retrieved from the file system before the children are actually searched (or added to the work queue). So I'm suggesting that in between the call to fs::read_dir (or, probably, after applying ignore rules) and the actual insertion of directory entries into the work queue, the contents of the directory are sorted against each other in descending size (possibly treating all directory entries as "larger" than file entries). (I'm not suggesting any kind of sorting between files or directories with different parent directories.)

......but now that I've written that out, I think the obvious question I should have asked earlier is, what kind of work queue does ripgrep actually use? If it's simply FIFO, perhaps switching to a priority queue based on size would be the best way to ensure that larger files are searched first if possible?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 31, 2017

Owner

And even in a depth-first search, the full list of entries in a single directory is still retrieved from the file system before the children are actually searched (or added to the work queue).

Nope. :-) In depth first search, the only things you need to store in memory are:

  1. The current entry (which is roughly equivalent to the dirent structure specified in man 3 readdir on Linux).
  2. A stack of the directories one has descended into (which is usually on the heap in a non-toy implementation). Each entry in the stack is an open file descriptor that reads the contents of a directory as a stream. (e.g., The return value of man 3 opendir.)

With that said, for a directory tree of depth N, this requires one to have N open file descriptors. On Linux, the default file descriptor limit is low enough that it's feasible to hit this, so most non-toy implementations will cap the maximum number of open file descriptors. Once the maximum is reached, the file descriptor at the top of the stack is exhausted (by reading the rest of its stream and storing it into memory) and then closed, which frees up a file descriptor.

At least, this is roughly how walkdir works, which is what ripgrep uses for its single threaded search. In prior versions, ripgrep also used walkdir for the parallel search, and it worked quite simply. Here's some pseudo code:

workq := Queue::new();
main thread:
    let mut workers = vec![];
    for i in number_of_cpus() {
        workers.push(Worker::new(workq.clone()));
    }
    for entry in WalkDir::new("./") {
        if should_search(&entry) {
            workq <- entry
        }
    }
    workq.close()
    for worker in workers {
        worker.join().unwrap();
    }

worker thread:
    for entry in workq {
        let results = do_search(entry);
        let stdout = io::stdout().lock();
        print(stdout, results)
    }

And that was pretty much it. The walkdir crate itself can be made to sort the contents of a directory, which would give us a deterministic order in the main thread. In this scheme, I believe it's possible to impose a total ordering in the main thread by assigning a sequence number to each entry. So if walkdir itself yields entries in a deterministic order, then we can tag each entry so that each entry knows the order in which it should be printed.

The worker thread could then be made to "block" until it was the entry's turn to be written, which could be done by checking the sequence id of the most recently printed entry. (There's probably a bit more to this, and perhaps latency could be reduced by introducing a ring queue and another worker that is only responsible for printing. This last worker would need to do a bit of windowing not unlike how TCP works.)

I'd also like to take this time to note that "sorting by file size" is itself a non-starter because it would slow ripgrep down a bit. Namely, ripgrep would need to issue a stat call for each file.

With that said, ripgrep no longer uses this simplistic approach. The core problem with it is that it forces the processing of all filter logic to be single threaded. Given the size of some gitignore files, this can end up being a substantial portion of ripgrep's runtime. Thus, a while ago, I set out to parallelize directory traversal itself. It still uses a work queue, but the fundamental difference is that there are no longer separate producers or consumers. Instead, every producer is, itself, a consumer. The reason for this is that if you're parallelizing directory traversal, then you end up reading the contents of multiple directories simultaneously, so there's no single point of truth about when it terminates. Indeed, termination is complicated.

The big picture problem is that this form of parallelism makes determinism much harder than the simpler work queue formulation that ripgrep used to have. In particular, each item in a directory is sent along a queue, which means there is no point at which you could sort the files you're searching and realistically expect that to result in more deterministic output. Namely, even if you collected all of the entries of a directory and sorted them before sending them to the queue, you would in fact be racing with the N other threads that are doing the same.

To solve the deterministic output problem while simultaneously using parallelism, I think you need to do one of these things:

  1. In addition to the single threaded and multithreaded search ripgrep has today, add a third mechanism that looks like the more traditional work queue style and implement deterministic output using that. You'll give up parallel processing of ripgrep's filters, but you'll still get parallel search. (I think this is inevitably what we'll do.)
  2. Learn how ripgrep does parallel search right now and figure out how to make its output deterministic without imposing significant costs. (I don't think this is possible. I think you could assign sequence ids to each entry in a way that imposes a total ordering across all entries, but the traversal fans out so quickly that I fear this would essentially be equivalent to "buffering all output." Incidentally, this will work in a lot of cases, but it will also mean that ripgrep's memory usage will spike. I am proud of ripgrep's current memory usage.)
  3. Come up with a completely different approach to parallelizing search. I've talked with a few folks about this, and they suggested a work stealing approach might also work, but I don't know if that makes sorting the output any easier. (This is also much more complex. They cited the "goroutine scheduler in the Go runtime" as the place to look.)
  4. Change the requirements. e.g., Convince me the buffering all of ripgrep's output would be OK.
Owner

BurntSushi commented Mar 31, 2017

And even in a depth-first search, the full list of entries in a single directory is still retrieved from the file system before the children are actually searched (or added to the work queue).

Nope. :-) In depth first search, the only things you need to store in memory are:

  1. The current entry (which is roughly equivalent to the dirent structure specified in man 3 readdir on Linux).
  2. A stack of the directories one has descended into (which is usually on the heap in a non-toy implementation). Each entry in the stack is an open file descriptor that reads the contents of a directory as a stream. (e.g., The return value of man 3 opendir.)

With that said, for a directory tree of depth N, this requires one to have N open file descriptors. On Linux, the default file descriptor limit is low enough that it's feasible to hit this, so most non-toy implementations will cap the maximum number of open file descriptors. Once the maximum is reached, the file descriptor at the top of the stack is exhausted (by reading the rest of its stream and storing it into memory) and then closed, which frees up a file descriptor.

At least, this is roughly how walkdir works, which is what ripgrep uses for its single threaded search. In prior versions, ripgrep also used walkdir for the parallel search, and it worked quite simply. Here's some pseudo code:

workq := Queue::new();
main thread:
    let mut workers = vec![];
    for i in number_of_cpus() {
        workers.push(Worker::new(workq.clone()));
    }
    for entry in WalkDir::new("./") {
        if should_search(&entry) {
            workq <- entry
        }
    }
    workq.close()
    for worker in workers {
        worker.join().unwrap();
    }

worker thread:
    for entry in workq {
        let results = do_search(entry);
        let stdout = io::stdout().lock();
        print(stdout, results)
    }

And that was pretty much it. The walkdir crate itself can be made to sort the contents of a directory, which would give us a deterministic order in the main thread. In this scheme, I believe it's possible to impose a total ordering in the main thread by assigning a sequence number to each entry. So if walkdir itself yields entries in a deterministic order, then we can tag each entry so that each entry knows the order in which it should be printed.

The worker thread could then be made to "block" until it was the entry's turn to be written, which could be done by checking the sequence id of the most recently printed entry. (There's probably a bit more to this, and perhaps latency could be reduced by introducing a ring queue and another worker that is only responsible for printing. This last worker would need to do a bit of windowing not unlike how TCP works.)

I'd also like to take this time to note that "sorting by file size" is itself a non-starter because it would slow ripgrep down a bit. Namely, ripgrep would need to issue a stat call for each file.

With that said, ripgrep no longer uses this simplistic approach. The core problem with it is that it forces the processing of all filter logic to be single threaded. Given the size of some gitignore files, this can end up being a substantial portion of ripgrep's runtime. Thus, a while ago, I set out to parallelize directory traversal itself. It still uses a work queue, but the fundamental difference is that there are no longer separate producers or consumers. Instead, every producer is, itself, a consumer. The reason for this is that if you're parallelizing directory traversal, then you end up reading the contents of multiple directories simultaneously, so there's no single point of truth about when it terminates. Indeed, termination is complicated.

The big picture problem is that this form of parallelism makes determinism much harder than the simpler work queue formulation that ripgrep used to have. In particular, each item in a directory is sent along a queue, which means there is no point at which you could sort the files you're searching and realistically expect that to result in more deterministic output. Namely, even if you collected all of the entries of a directory and sorted them before sending them to the queue, you would in fact be racing with the N other threads that are doing the same.

To solve the deterministic output problem while simultaneously using parallelism, I think you need to do one of these things:

  1. In addition to the single threaded and multithreaded search ripgrep has today, add a third mechanism that looks like the more traditional work queue style and implement deterministic output using that. You'll give up parallel processing of ripgrep's filters, but you'll still get parallel search. (I think this is inevitably what we'll do.)
  2. Learn how ripgrep does parallel search right now and figure out how to make its output deterministic without imposing significant costs. (I don't think this is possible. I think you could assign sequence ids to each entry in a way that imposes a total ordering across all entries, but the traversal fans out so quickly that I fear this would essentially be equivalent to "buffering all output." Incidentally, this will work in a lot of cases, but it will also mean that ripgrep's memory usage will spike. I am proud of ripgrep's current memory usage.)
  3. Come up with a completely different approach to parallelizing search. I've talked with a few folks about this, and they suggested a work stealing approach might also work, but I don't know if that makes sorting the output any easier. (This is also much more complex. They cited the "goroutine scheduler in the Go runtime" as the place to look.)
  4. Change the requirements. e.g., Convince me the buffering all of ripgrep's output would be OK.
@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 31, 2017

Owner

Come up with a completely different approach to parallelizing search.

If you look at my pseudo code above, you might wonder why one couldn't just move the should_search filter into the worker. Indeed, this could solve a lot of the problems. The problem there is that you wind up with a lot of synchronization overhead associated with maintaining the filters. (The filters are expensive to construct, so you need to cache them.) In particular, checking a filter may actually involve ascending the directory stack to check filters associated with parent directories. (For example, to correctly match .gitignore files.)

So if you go down this route, you basically need to move construction of the filters themselves into the workers as well. But if you do this, you either need to be willing to pay for constructing each filter N times (for each of the N threads), or you need to do some type of synchronization so that each filter is built once. If you pick the former, then you're right back to where you started: you're not actually parallelizing a good chunk of the filter process, although you are parallelizing the filter match process, which is a small win. If you pick the latter, then at some point, you're going to ask <=N workers to work on M files from the same directory, which means all M workers will need to sit and wait until the filter for that directory is built. Which, of course, also destroys parallelism.

If you really tumble down the rabbit hole, then maybe there is a smarter way to prioritize which entries get searched based on which filters are available, but then you've probably given up determinism again.

If you get this far, then you're back to door number 3 in my previous comment.

Owner

BurntSushi commented Mar 31, 2017

Come up with a completely different approach to parallelizing search.

If you look at my pseudo code above, you might wonder why one couldn't just move the should_search filter into the worker. Indeed, this could solve a lot of the problems. The problem there is that you wind up with a lot of synchronization overhead associated with maintaining the filters. (The filters are expensive to construct, so you need to cache them.) In particular, checking a filter may actually involve ascending the directory stack to check filters associated with parent directories. (For example, to correctly match .gitignore files.)

So if you go down this route, you basically need to move construction of the filters themselves into the workers as well. But if you do this, you either need to be willing to pay for constructing each filter N times (for each of the N threads), or you need to do some type of synchronization so that each filter is built once. If you pick the former, then you're right back to where you started: you're not actually parallelizing a good chunk of the filter process, although you are parallelizing the filter match process, which is a small win. If you pick the latter, then at some point, you're going to ask <=N workers to work on M files from the same directory, which means all M workers will need to sit and wait until the filter for that directory is built. Which, of course, also destroys parallelism.

If you really tumble down the rabbit hole, then maybe there is a smarter way to prioritize which entries get searched based on which filters are available, but then you've probably given up determinism again.

If you get this far, then you're back to door number 3 in my previous comment.

@BatmanAoD

This comment has been minimized.

Show comment
Hide comment
@BatmanAoD

BatmanAoD Mar 31, 2017

Okay, I think I understand much better now what you meant by "the directory traversal itself is parallelized" and why that presents an essentially intractable problem for obtaining determinism via the sort of minor tweaks I was trying to propose. And at the risk of making my complete ignorance of the issues at play even more obvious (! 😆), I didn't even realize that directories could be read as streams in Linux, let alone in any sort of cross-platform application (at least without circumventing the OS filesystem API and reading/interpreting the raw bits).

...and, looking back, I'm not even sure why I suggested that largest-file-first would be valuable, even heuristically; I think my idea was that this would prioritize actual regex-matching over output printing, but I'm pretty sure that's exactly the reverse of what you'd want.

Based on the very high-level explanation of Go-routine scheduling and stealing here, I do not immediately see any reason why that form of scheduling would make output-sorting any easier.

Okay, I think I understand much better now what you meant by "the directory traversal itself is parallelized" and why that presents an essentially intractable problem for obtaining determinism via the sort of minor tweaks I was trying to propose. And at the risk of making my complete ignorance of the issues at play even more obvious (! 😆), I didn't even realize that directories could be read as streams in Linux, let alone in any sort of cross-platform application (at least without circumventing the OS filesystem API and reading/interpreting the raw bits).

...and, looking back, I'm not even sure why I suggested that largest-file-first would be valuable, even heuristically; I think my idea was that this would prioritize actual regex-matching over output printing, but I'm pretty sure that's exactly the reverse of what you'd want.

Based on the very high-level explanation of Go-routine scheduling and stealing here, I do not immediately see any reason why that form of scheduling would make output-sorting any easier.

@BatmanAoD

This comment has been minimized.

Show comment
Hide comment
@BatmanAoD

BatmanAoD Mar 31, 2017

Anyway, thanks for the detailed explanation! That's quite helpful in understanding what's going on.

Anyway, thanks for the detailed explanation! That's quite helpful in understanding what's going on.

@bmalehorn

This comment has been minimized.

Show comment
Hide comment
@bmalehorn

bmalehorn Apr 16, 2017

Contributor

Hi all.

I also want this feature for ripgrep, as it's one of my most missed features from ag. I've read over this whole thread and I think I have a good idea of the problem, and I have an idea on how we can fix it.

Priority Queue Solution

Suppose you're searching this directory:

 1.txt
 2/2.txt
 3.txt

Let's say we sort the output of readdir.
Currently, that would make it search like this:

queue                         searched
-----                         --------
["1.txt", "2/", "3.txt"]
["2/", "3.txt"]               1.txt
["3.txt", "2/2.txt"]          1.txt
["2/2.txt"]                   1.txt, 3.txt
[]                            1.txt, 3.txt, 2/2.txt

I propose we use a priority queue, ordered by path name:

priority queue                searched
--------------                --------
["1.txt", "2/", "3.txt"]
["2/", "3.txt"]               1.txt
["2/2.txt", "3.txt"]          1.txt
["3.txt"]                     1.txt, 2/2.txt
[]                            1.txt, 2/2.txt, 3.txt

Notes about this approach:

  • Relatively small change. Swap queue for priority queue.
  • Won't guarantee ordering with >1 thread.
  • This approach is only a heuristic that gets us much closer to sorted output.
  • Hopefully this heurstic will be good enough.
  • Might degrade performance - trading lockless queue for locked priority queue.

Questions for @BurntSushi:

Does this make sense? Do you agree that this will mostly sort the output? Or do you see some critical flaw I didn't consider?

Is this something you'd be willing to merge? Assuming everything pans out - files are usually sorted and there's no major performance change.

Contributor

bmalehorn commented Apr 16, 2017

Hi all.

I also want this feature for ripgrep, as it's one of my most missed features from ag. I've read over this whole thread and I think I have a good idea of the problem, and I have an idea on how we can fix it.

Priority Queue Solution

Suppose you're searching this directory:

 1.txt
 2/2.txt
 3.txt

Let's say we sort the output of readdir.
Currently, that would make it search like this:

queue                         searched
-----                         --------
["1.txt", "2/", "3.txt"]
["2/", "3.txt"]               1.txt
["3.txt", "2/2.txt"]          1.txt
["2/2.txt"]                   1.txt, 3.txt
[]                            1.txt, 3.txt, 2/2.txt

I propose we use a priority queue, ordered by path name:

priority queue                searched
--------------                --------
["1.txt", "2/", "3.txt"]
["2/", "3.txt"]               1.txt
["2/2.txt", "3.txt"]          1.txt
["3.txt"]                     1.txt, 2/2.txt
[]                            1.txt, 2/2.txt, 3.txt

Notes about this approach:

  • Relatively small change. Swap queue for priority queue.
  • Won't guarantee ordering with >1 thread.
  • This approach is only a heuristic that gets us much closer to sorted output.
  • Hopefully this heurstic will be good enough.
  • Might degrade performance - trading lockless queue for locked priority queue.

Questions for @BurntSushi:

Does this make sense? Do you agree that this will mostly sort the output? Or do you see some critical flaw I didn't consider?

Is this something you'd be willing to merge? Assuming everything pans out - files are usually sorted and there's no major performance change.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 16, 2017

Owner

@bmalehorn did you happen to read my most recent comments on this? Namely, i don't see how this helps the matter. Remember, in the current implementation, the consumer of the queue is also its producer.

Also, I'm pretty sure ag doesn't have this feature. Perhaps you could elaborate on that point?

Owner

BurntSushi commented Apr 16, 2017

@bmalehorn did you happen to read my most recent comments on this? Namely, i don't see how this helps the matter. Remember, in the current implementation, the consumer of the queue is also its producer.

Also, I'm pretty sure ag doesn't have this feature. Perhaps you could elaborate on that point?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 16, 2017

Owner

The other issue I see is that your queue example seems to indicate a breadth first search. But most recursive directory iterators use depth first search (including ripgrep). Which kind of makes the priority queue idea moot.

Owner

BurntSushi commented Apr 16, 2017

The other issue I see is that your queue example seems to indicate a breadth first search. But most recursive directory iterators use depth first search (including ripgrep). Which kind of makes the priority queue idea moot.

@bmalehorn

This comment has been minimized.

Show comment
Hide comment
@bmalehorn

bmalehorn Apr 16, 2017

Contributor

Also, I'm pretty sure ag doesn't have this feature. Perhaps you could elaborate on that point?

ag will search in whatever order readdir(2) emits. On OS X this is sorted order. On Linux, you can sort the output of readdir with a small patch.

But most recursive directory iterators use depth first search (including ripgrep).

Actually, I don't think ripgrep uses depth first search.

walk.rs uses a crossbeam::sync::MsQueue, which is a queue. I tried adding some print statements, and it indeed seems to be doing breadth first search.

I tried swapping it out for a stack, and sorting readdir:

$ rg -l PM_RESUME
Documentation/dev-tools/sparse.rst
Documentation/translations/zh_CN/sparse.txt
arch/x86/kernel/apm_32.c
drivers/ide/ide-pm.c
drivers/input/mouse/cyapa.h
drivers/mtd/maps/pcmciamtd.c
drivers/net/wireless/intersil/hostap/hostap_cs.c
drivers/usb/mtu3/mtu3_hw_regs.h
include/uapi/linux/apm_bios.h
include/linux/ide.h

The result is that output is often sorted. However, it can get out of order if, for example:

  1. Documentation is popped by thread 1
  2. arch is popped by thread 2
  3. children of Documentation pushed by thread 1
  4. children of arch are pushed by thread 2

Now you will interleave Documentation and arch. I got such an out-of-order case by running cd ripgrep && rg -l ack.

You can view the commit here.

Even if you don't want to use my half-assed sorting approach, I'd still argue that ripgrep should use a stack instead of a queue. The stack won't grow nearly as large as the queue.

Contributor

bmalehorn commented Apr 16, 2017

Also, I'm pretty sure ag doesn't have this feature. Perhaps you could elaborate on that point?

ag will search in whatever order readdir(2) emits. On OS X this is sorted order. On Linux, you can sort the output of readdir with a small patch.

But most recursive directory iterators use depth first search (including ripgrep).

Actually, I don't think ripgrep uses depth first search.

walk.rs uses a crossbeam::sync::MsQueue, which is a queue. I tried adding some print statements, and it indeed seems to be doing breadth first search.

I tried swapping it out for a stack, and sorting readdir:

$ rg -l PM_RESUME
Documentation/dev-tools/sparse.rst
Documentation/translations/zh_CN/sparse.txt
arch/x86/kernel/apm_32.c
drivers/ide/ide-pm.c
drivers/input/mouse/cyapa.h
drivers/mtd/maps/pcmciamtd.c
drivers/net/wireless/intersil/hostap/hostap_cs.c
drivers/usb/mtu3/mtu3_hw_regs.h
include/uapi/linux/apm_bios.h
include/linux/ide.h

The result is that output is often sorted. However, it can get out of order if, for example:

  1. Documentation is popped by thread 1
  2. arch is popped by thread 2
  3. children of Documentation pushed by thread 1
  4. children of arch are pushed by thread 2

Now you will interleave Documentation and arch. I got such an out-of-order case by running cd ripgrep && rg -l ack.

You can view the commit here.

Even if you don't want to use my half-assed sorting approach, I'd still argue that ripgrep should use a stack instead of a queue. The stack won't grow nearly as large as the queue.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Apr 16, 2017

Owner

ag will search in whatever order readdir(2) emits. On OS X this is sorted order. On Linux, you can sort the output of readdir with a small patch.

Uh... Maybe in single threaded mode, but certainly not when search is parallelized. ag has non-deterministic output order, just like ripgrep.

walk.rs uses a crossbeam::sync::MsQueue, which is a queue. I tried adding some print statements, and it indeed seems to be doing breadth first search.

I see. I was thinking about walkdir, sorry, which is depth first.

You can view the commit here.

Even if you don't want to use my half-assed sorting approach, I'd still argue that ripgrep should use a stack instead of a queue. The stack won't grow nearly as large as the queue.

The sorting approach is probably something that would need good benchmarking before getting merged. Particularly on very large directories. If ripgrep were used on a directory with many files, then it won't even start searching until the entire list is in memory.

If using a stack ends up with a more consistent order in practice, then I'd be fine with that!

Owner

BurntSushi commented Apr 16, 2017

ag will search in whatever order readdir(2) emits. On OS X this is sorted order. On Linux, you can sort the output of readdir with a small patch.

Uh... Maybe in single threaded mode, but certainly not when search is parallelized. ag has non-deterministic output order, just like ripgrep.

walk.rs uses a crossbeam::sync::MsQueue, which is a queue. I tried adding some print statements, and it indeed seems to be doing breadth first search.

I see. I was thinking about walkdir, sorry, which is depth first.

You can view the commit here.

Even if you don't want to use my half-assed sorting approach, I'd still argue that ripgrep should use a stack instead of a queue. The stack won't grow nearly as large as the queue.

The sorting approach is probably something that would need good benchmarking before getting merged. Particularly on very large directories. If ripgrep were used on a directory with many files, then it won't even start searching until the entire list is in memory.

If using a stack ends up with a more consistent order in practice, then I'd be fine with that!

bmalehorn added a commit to bmalehorn/ripgrep that referenced this issue Apr 16, 2017

walk.rs: queue -> stack
Change the pool of "files to search" from queue to a stack. This causes
ripgrep to approximate depth-first search instead of breadth-first
search. This dramatically reduces the size of the pool, since most
directories are much more "wide" than they are "deep".

As a result, ripgrep uses ~45% less peak memory:

Before:

    /usr/bin/time -f '%M' rg PM_RESUME > /dev/null
    ...
    8876

After:

    /usr/bin/time -f '%M' rg PM_RESUME > /dev/null
    16240

Throughput improves a tiny bit:

Before:

    linux_literal (pattern: PM_RESUME)
    ----------------------------------
    rg (ignore)*  0.376 +/- 0.003 (lines: 16)*

After:

    linux_literal (pattern: PM_RESUME)
    ----------------------------------
    rg (ignore)*  0.371 +/- 0.004 (lines: 16)*

Related to #152.
@bmalehorn

This comment has been minimized.

Show comment
Hide comment
@bmalehorn

bmalehorn Apr 16, 2017

Contributor

ag has non-deterministic output order, just like ripgrep.

To be clear, they're both non-deterministic. But there are a few "levels" of how non-deterministic their outputs are:

  1. "completely ordered": No matter what the files will be in sorted order. Example: git grep
  2. "start searching in order": Files will commence being searched in order. So "1.txt" will get passed to a worker thread before "2.txt". But if the match in "1.txt" is 1 GB into the file but the match in "2.txt" is 1 KB into the file, "2.txt" will probably print first. Example: ag
  3. "unordered": No attempt to order outputs. Example: ripgrep

In practice, 2 is good enough. I've used this patched version of ag for months and have never observed files out of order.

"Priority queue" and "stack + sort readdir" are somewhere between 2 and 3.


I've made a PR for switching queue -> stack because it has some performance gains. It doesn't attempt to put the files in order.


The sorting approach is probably something that would need good benchmarking before getting merged.

Yeah I agree. I'm going to play around with sorting and priority queues and see what I find. Hopefully, we can either get sorting as the default, or make --sort-files compatible with (some) multithreading.

PS. Thanks for being such a responsive maintainer!

Contributor

bmalehorn commented Apr 16, 2017

ag has non-deterministic output order, just like ripgrep.

To be clear, they're both non-deterministic. But there are a few "levels" of how non-deterministic their outputs are:

  1. "completely ordered": No matter what the files will be in sorted order. Example: git grep
  2. "start searching in order": Files will commence being searched in order. So "1.txt" will get passed to a worker thread before "2.txt". But if the match in "1.txt" is 1 GB into the file but the match in "2.txt" is 1 KB into the file, "2.txt" will probably print first. Example: ag
  3. "unordered": No attempt to order outputs. Example: ripgrep

In practice, 2 is good enough. I've used this patched version of ag for months and have never observed files out of order.

"Priority queue" and "stack + sort readdir" are somewhere between 2 and 3.


I've made a PR for switching queue -> stack because it has some performance gains. It doesn't attempt to put the files in order.


The sorting approach is probably something that would need good benchmarking before getting merged.

Yeah I agree. I'm going to play around with sorting and priority queues and see what I find. Hopefully, we can either get sorting as the default, or make --sort-files compatible with (some) multithreading.

PS. Thanks for being such a responsive maintainer!

@bmalehorn

This comment has been minimized.

Show comment
Hide comment
@bmalehorn

bmalehorn Apr 25, 2017

Contributor

I've created two builds: priority queue (my creation), and single producer (earlier suggestion). In practice, both sort results. Benchmark:

linux_literal (pattern: PM_RESUME)
----------------------------------
rg (master)*           0.423 +/- 0.037 (lines: 16)*
rg (priority queue)    0.431 +/- 0.039 (lines: 16)
rg (single producer)   0.590 +/- 0.018 (lines: 16)
rg (--sort-files)      0.954 +/- 0.034 (lines: 16)

In my opinion, the minor performance loss of priority queue is worth it to get sorted files. But you might disagree. Thoughts? Something else you want me to measure?

Contributor

bmalehorn commented Apr 25, 2017

I've created two builds: priority queue (my creation), and single producer (earlier suggestion). In practice, both sort results. Benchmark:

linux_literal (pattern: PM_RESUME)
----------------------------------
rg (master)*           0.423 +/- 0.037 (lines: 16)*
rg (priority queue)    0.431 +/- 0.039 (lines: 16)
rg (single producer)   0.590 +/- 0.018 (lines: 16)
rg (--sort-files)      0.954 +/- 0.034 (lines: 16)

In my opinion, the minor performance loss of priority queue is worth it to get sorted files. But you might disagree. Thoughts? Something else you want me to measure?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment