Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slowdown with many files on the command line, compared to no arguments #136

Closed
oconnor663 opened this issue Sep 29, 2016 · 13 comments · Fixed by #233
Closed

slowdown with many files on the command line, compared to no arguments #136

oconnor663 opened this issue Sep 29, 2016 · 13 comments · Fixed by #233
Labels
bug A bug.

Comments

@oconnor663
Copy link
Contributor

oconnor663 commented Sep 29, 2016

Example using the https://github.com/keybase/client repo:

$ ls go/**/*.go | wc -l
1352

$ time rg blahblahblah go/       
rg blahblahblah go/  0.03s user 0.01s system 210% cpu 0.017 total

# <<< this is the slow one >>>
$ time rg blahblahblah go/**/*.go
rg blahblahblah go/**/*.go  0.14s user 0.01s system 104% cpu 0.137 total

$ time grep blahblahblah -r go/  
grep --color=auto blahblahblah -r go/  0.01s user 0.01s system 92% cpu 0.014 total

$ time grep blahblahblah go/**/*.go
grep --color=auto blahblahblah -r go/**/*.go  0.01s user 0.01s system 93% cpu 0.018 total

It looks like rg is slower when given a (long) list of files on the command line, than it is when it has to find those files itself. grep doesn't have this problem, so I don't think it's some confounder like my shell taking a long time to expand the glob.

@BurntSushi
Copy link
Owner

I think this is a dupe. On mobile.

On Sep 29, 2016 5:10 PM, "oconnor663" notifications@github.com wrote:

Example using the https://github.com/keybase/client repo:

$ ls go/*/.go | wc -l
1352

$ time rg blahblahblah go/
rg blahblahblah go/ 0.03s user 0.01s system 210% cpu 0.017 total

<<< this is the slow one >>>

$ time rg blahblahblah go//*.go
rg blahblahblah go/
/*.go 0.14s user 0.01s system 104% cpu 0.137 total

$ time grep blahblahblah -r go/
grep --color=auto blahblahblah -r go/ 0.01s user 0.01s system 92% cpu 0.014 total

$ time grep blahblahblah -r go//*.go
grep --color=auto blahblahblah -r go/
/*.go 0.01s user 0.01s system 93% cpu 0.018 total

It looks like rg is slower when given a (long) list of files on the
command line, than it is when it has to find those files itself. grep
doesn't have this problem, so I don't think it's some confounder like my
shell taking a long time to expand the glob.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#136, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34ju30o2BwZj2r7Gzif2cpz5Ge9ZZks5qvCksgaJpZM4KKe9M
.

@BurntSushi
Copy link
Owner

Yeah, this is probably a dupe of #44. Your example is actually more embarrassing, because the positional arguments are files, not directories, which makes the reprocessing of parent ignore files even more strange.

I have some grander plans in mind on refactoring the ignore code, so fixing this will probably get lumped in with that. For now, I think we can just track #44.

Thanks for the report!

@oconnor663
Copy link
Contributor Author

oconnor663 commented Sep 30, 2016

@BurntSushi I should've mentioned before, --no-ignore and --no-ignore-parent don't affect the slowdown here, unlike in #44. It might be a different cause?

@BurntSushi
Copy link
Owner

Oh. Neat. I didn't check that. I'll look into it.

On Sep 29, 2016 21:10, "oconnor663" notifications@github.com wrote:

@BurntSushi https://github.com/BurntSushi I should've mentioned before,
--no-ignore and --no-ignore-parent don't affect the slowdown here,
unlike in #44 #44. It might
be a different cause?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#136 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34pM7Jix-s-RuaECmFq1CPmPQhFCsks5qvGFtgaJpZM4KKe9M
.

@BurntSushi BurntSushi reopened this Sep 30, 2016
@BurntSushi
Copy link
Owner

You're right, something different is going on than what's in #44.

@oconnor663
Copy link
Contributor Author

cc @patrickxb who actually found this.

@BurntSushi BurntSushi added the bug A bug. label Oct 5, 2016
@BurntSushi
Copy link
Owner

Sadly (and embarrassingly), the blame for this can be placed squarely on my implementation of docopt. Virtually all of the time is spent in matching the argv string against the docopt matcher. My guess is that I could re-implement this to go faster, but I don't have the energy for a rewrite of docopt.

I think this means I should admit defeat switch to clap. :-) cc @kbknapp

@kbknapp
Copy link
Contributor

kbknapp commented Oct 29, 2016

@BurntSushi if "admitting defeat" means having spent all your time writing some of the best parsers and libraries known to Rust (perhaps any language), and helping ensure the Rust ecosystem/users maintains or even exceeds current levels of quality and professionalism....well then I aspire to one day "admit defeat" as well 😉

As for this issue, I'd like to do a build using clap just to compare the results. I'll post the results in the next day or so once I have time to make the build and do the tests.

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 29, 2016

@kbknapp Haha aww shucks. FWIW, clap is unbelievably well maintained. I hadn't looked at it for a while, but when I checked it the other day I was completely blown away.

To give you more context: you probably won't notice much of a difference in standard usage. The problem appears specifically when there's a long list of positional file arguments. Docopt struggles with this because it uses a weird backtracking algorithm to match argv against its matcher. When argv is large, there's a significant slow down. I doubt clap will even notice it at all.

Converting ripgrep over to clap seems like a largeish undertaking, so please don't kill yourself over it! (In the sense that there's a lot of options, not in the sense that Docopt weirdness is spread throughout ripgrep.) I do kind of want to do it myself as a forcing function to get more familiar with clap, but I won't stop you. :-) I will note that this is kind of high on my priority list because it's really easy to provoke this.

@oconnor663
Copy link
Contributor Author

Woah. Maybe there's some hack we could stick on the front of the whole thing (either docopt, or ripgrep) to handle the specific case of a ton of positional args, that wouldn't need a complete rewrite? I've never looked at docopt internals though.

@BurntSushi
Copy link
Owner

I'm sure there's probably something. I'm just not feeling up to digging around in Docopt internals. Everything about it is too complex. (Which is my failing.)

@kbknapp
Copy link
Contributor

kbknapp commented Oct 29, 2016

If you'd like to do the converting that's cool with me, I'll just standby for any questions/suggestions!

Instead of doing a full re-write I just a simple test (called lof for lots-o-files) to see if it would in fact make a difference:

$ find ~/.config -print | wc -l
4291

$ time lof-docopt ~/.config/**/*
0.52s user 0.11s system 99% cpu 0.629 total

$ time lof-clap ~/.config/**/*
0.05s user 0.02s system 97% cpu 0.075 total

I know it's not super scientific or perfect, but gives a rough estimate. Here's what I used:

Docopt:

const USAGE: &'static str = "
Lots of Files

Usage:
  lof <files>...

";

#[derive(RustcDecodable)]
struct Args {
    arg_files: Vec<String>,
}

fn main() {
    let args: Args = Docopt::new(USAGE)
        .and_then(|d| d.decode())
        .unwrap_or_else(|e| e.exit());
}

And clap:

fn main() {
    let m = App::new("lof")
        .arg(Arg::with_name("files")
            .multiple(true))
        .get_matches();
}

I also did version where I looped through and printed out the values both as Strings and OsStrings, but the results were basically no different.

@BurntSushi
Copy link
Owner

OK, so I finally have ripgrep wired up to clap and I can confirm this issue in particular gets fixed. My test case with GNU grep:

[andrew@Cheetah linux] time sh ../cmd.grep
./include/linux/ide.h:  REQ_TYPE_ATA_PM_RESUME, /* resume request */
./include/linux/ide.h:   (rq)->cmd_type == REQ_TYPE_ATA_PM_RESUME)
./include/linux/ide.h: * State information carried for REQ_TYPE_ATA_PM_SUSPEND and REQ_TYPE_ATA_PM_RESUME

real    0m0.182s
user    0m0.120s
sys     0m0.043s

With current ripgrep master:

[andrew@Cheetah linux] time sh ../cmd.rg 
include/linux/ide.h
        REQ_TYPE_ATA_PM_RESUME, /* resume request */
         (rq)->cmd_type == REQ_TYPE_ATA_PM_RESUME)
 * State information carried for REQ_TYPE_ATA_PM_SUSPEND and REQ_TYPE_ATA_PM_RESUME

real    0m1.025s
user    0m0.783s
sys     0m0.240s

and finally with ripgrep wired up to clap:

[andrew@Cheetah linux] time sh ../cmd.rg 
include/linux/ide.h
        REQ_TYPE_ATA_PM_RESUME, /* resume request */
         (rq)->cmd_type == REQ_TYPE_ATA_PM_RESUME)
 * State information carried for REQ_TYPE_ATA_PM_SUSPEND and REQ_TYPE_ATA_PM_RESUME

real    0m0.106s
user    0m0.067s
sys     0m0.037s

Everything is run single-threaded on 3,458 files explicitly given on the command line. This test case was derived from the following xargs command in the linux repo (notice how gosh dang slow ripgrep was):

$ time find ./ -type f | xargs grep PM_RESUME | wc -l
16

real    0m1.170s
user    0m0.833s
sys     0m0.483s
$ time find ./ -type f | xargs rg-master -j1 PM_RESUME | wc -l
16

real    0m17.562s
user    0m13.960s
sys     0m3.753s
$ time find ./ -type f | xargs rg -j1 PM_RESUME | wc -l
16

real    0m1.136s
user    0m0.630s
sys     0m0.670s

Yay clap!

BurntSushi added a commit that referenced this issue Nov 13, 2016
There were two important reasons for the switch:

1. Performance. Docopt does poorly when the argv becomes large, which is
   a reasonable common use case for search tools. (e.g., use with xargs)
2. Better failure modes. Clap knows a lot more about how a particular
   argv might be invalid, and can therefore provide much clearer error
   messages.

While both were important, (1) made it urgent.

Note that since Clap requires at least Rust 1.11, this will in turn
increase the minimum Rust version supported by ripgrep from Rust 1.9 to
Rust 1.11. It is therefore a breaking change, so the soonest release of
ripgrep with Clap will have to be 0.3.

There is also at least one subtle breaking change in real usage.
Previous to this commit, this used to work:

    rg -e -foo

Where this would cause ripgrep to search for the string `-foo`. Clap
currently has problems supporting this use case
(see: clap-rs/clap#742),
but it can be worked around by using this instead:

    rg -e [-]foo

or even

    rg [-]foo

and this still works:

    rg -- -foo

This commit also adds Bash, Fish and PowerShell completion files to the
release, fixes a bug that prevented ripgrep from working on file
paths containing invalid UTF-8 and shows short descriptions in the
output of `-h` but longer descriptions in the output of `--help`.

Fixes #136, #189, #210, #230
BurntSushi added a commit that referenced this issue Nov 18, 2016
There were two important reasons for the switch:

1. Performance. Docopt does poorly when the argv becomes large, which is
   a reasonable common use case for search tools. (e.g., use with xargs)
2. Better failure modes. Clap knows a lot more about how a particular
   argv might be invalid, and can therefore provide much clearer error
   messages.

While both were important, (1) made it urgent.

Note that since Clap requires at least Rust 1.11, this will in turn
increase the minimum Rust version supported by ripgrep from Rust 1.9 to
Rust 1.11. It is therefore a breaking change, so the soonest release of
ripgrep with Clap will have to be 0.3.

There is also at least one subtle breaking change in real usage.
Previous to this commit, this used to work:

    rg -e -foo

Where this would cause ripgrep to search for the string `-foo`. Clap
currently has problems supporting this use case
(see: clap-rs/clap#742),
but it can be worked around by using this instead:

    rg -e [-]foo

or even

    rg [-]foo

and this still works:

    rg -- -foo

This commit also adds Bash, Fish and PowerShell completion files to the
release, fixes a bug that prevented ripgrep from working on file
paths containing invalid UTF-8 and shows short descriptions in the
output of `-h` but longer descriptions in the output of `--help`.

Fixes #136, Fixes #189, Fixes #210, Fixes #230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants