Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep is slower then grep when searching for e-mail regex #1746

Closed
MaxG87 opened this issue Nov 27, 2020 · 2 comments
Closed

ripgrep is slower then grep when searching for e-mail regex #1746

MaxG87 opened this issue Nov 27, 2020 · 2 comments
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@MaxG87
Copy link

MaxG87 commented Nov 27, 2020

What version of ripgrep are you using?

rg --version
ripgrep 12.1.1 (rev a6d05475fb)
-SIMD -AVX (compiled)
+SIMD +AVX (runtime

How did you install ripgrep?

I run a locally compiled version.

What operating system are you using ripgrep on?

Ubuntu 20.04.

Describe your bug.

Searching for an e-mail pattern is faster using grep than rg. I experience this in my local Maildir and in the Linux kernels linux/fs/ subfolder.

What are the steps to reproduce the behavior?

I use the Linux kernels linux/fs (rev 85a2c56cb4454c73f56d3099d96942e7919b292f) subfolder as an example for a folder we both have access too. In practice I would like to run the commands in my Maildir.

Pattern without match (~25ms vs ~135ms)
cd /path/to/linux/fs
pattern="[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
# some grep warm up runs
time grep -rhioE "$pattern" | len
0
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -rhioE   0,02s user 0,01s system 89% cpu 0,025 total
wc -l  0,00s user 0,00s system 4% cpu 0,024 total
# some ripgrep warm up runs
time rg -ioI "$pattern" | len
0
/tmp/rg -ioI "$pattern"  0,34s user 0,01s system 252% cpu 0,137 total
wc -l  0,00s user 0,00s system 1% cpu 0,136 total
Pattern with some matches (~50ms vs ~60ms)
cd /path/to/linux/fs
pattern="[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
# some grep warm up runs
time grep -rhioE "$pattern" | len
10
grep --color=auto --exclude-dir={.bzr,CVS,.git,.hg,.svn,.idea,.tox} -rhioE   0,02s user 0,03s system 98% cpu 0,048 total
wc -l  0,00s user 0,00s system 5% cpu 0,047 total
# some ripgrep warm up runs
time rg -ioI "$pattern" | len
10
/tmp/rg -ioI "$pattern"  0,16s user 0,03s system 294% cpu 0,065 total
wc -l  0,00s user 0,00s system 4% cpu 0,065 total

What is the expected behavior?

I would expect ripgrep to be faster than grep.

@BurntSushi
Copy link
Owner

I cannot seem to reproduce this easily... But read on.

For your second example, I'd consider those times close enough (especially on such a small corpus) that they are effectively the same. But in any case, ripgrep is faster for me:

$ hyperfine 'rg -i "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):      19.5 ms ±   1.4 ms    [User: 115.1 ms, System: 22.6 ms]
  Range (min … max):    16.2 ms …  23.5 ms    115 runs

$ hyperfine 'grep -riE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: grep -riE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):      46.3 ms ±   1.1 ms    [User: 25.3 ms, System: 20.7 ms]
  Range (min … max):    43.6 ms …  48.7 ms    58 runs

Now, your first example is more interesting. That's a substantial difference. On my machine, ripgrep runs a hair faster, but they're effectively tied:

$ hyperfine --ignore-failure 'rg -i "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):      40.1 ms ±   1.6 ms    [User: 364.8 ms, System: 22.6 ms]
  Range (min … max):    36.7 ms …  44.4 ms    65 runs

  Warning: Ignoring non-zero exit code.

$ hyperfine --ignore-failure 'grep -riE "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: grep -riE "[A-Z0-9._%+-]*lin[A-Z0-9._%+-]*tor[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):      46.8 ms ±   1.4 ms    [User: 24.0 ms, System: 22.3 ms]
  Range (min … max):    43.9 ms …  50.7 ms    59 runs

  Warning: Ignoring non-zero exit code.

Do you perhaps have a ripgrep config file that is running ripgrep single threaded? If I force ripgrep into a single thread, then it gets closer to your reproduction:

$ hyperfine 'rg -i -j1 "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i -j1 "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):      91.7 ms ±   5.4 ms    [User: 71.1 ms, System: 20.3 ms]
  Range (min … max):    75.7 ms …  98.1 ms    29 runs

And if I have ripgrep search the same set of files as grep, then a real difference can be observed:

$ hyperfine 'rg -i -j1 -uuu "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"'
Benchmark #1: rg -i -j1 -uuu "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}"
  Time (mean ± σ):     151.0 ms ±  10.4 ms    [User: 129.2 ms, System: 21.3 ms]
  Range (min … max):   140.5 ms … 168.4 ms    17 runs

OK, let's see if this is a regex engine problem or a directory traversal problem. Easiest way to do that is to see if we can reproduce the performance difference on a single file. So I've just concatenated all of the files in linux/fs into a single file. And to make differences more pronounced (and eliminate the effects of noise), I've made the file 10x bigger:

$ find ./ -type f -print0 | xargs -0 cat > /tmp/linux-fs-all
$ for ((i=0;i<10;i++)); do cat /tmp/linux-fs-all; done > /tmp/linux-fs-all.10x

The first thing to note is that there is some binary data:

$ grep -iE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
grep: /tmp/linux-fs-all.10x: binary file matches

So let's use the -a/--text flag and see if we can re-create the performance difference on a single file:

$ time grep -aciE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
100

real    0.295
user    0.199
sys     0.096
maxmem  7 MB
faults  0

$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x
100

real    1.249
user    1.231
sys     0.017
maxmem  580 MB
faults  0

Great, got it. Running it under perf shows the total time split between Teddy (a fast literal prefilter) and the regex engine. My first thought is whether this is a case where ripgrep's more aggressive literal optimizations result in it being slower than grep. To figure that out, we need to find 1) which literals ripgrep is searching for and 2) whether they have a lot of matches or not. We can do (1) by running ripgrep with the --trace flag:

$ rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /tmp/linux-fs-all.10x --trace
...
DEBUG|grep_regex::literal|crates/regex/src/literal.rs:109: required literals found: [Cut(VID), Cut(vID), Cut(ViD), Cut(viD), Cut(VId), Cut(vId), Cut(Vid), Cut(vid), Cut(@), Cut(.)]
DEBUG|grep_regex::matcher|crates/regex/src/matcher.rs:50: extracted fast line regex: (?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\x2e)
...

Yeah... That doesn't look so good. Let's see how often the "fast line regex" matches:

$ time rg -c '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\x2e)' /tmp/linux-fs-all.10x
4489830

real    0.722
user    0.701
sys     0.020
maxmem  580 MB
faults  0

Yikes. That is almost certainly the problem. The killers are the @ and . literals that were extracted, because they make up the vast majority of all false positives. (0x2E is ASCII for ..) If we remove them from the literal search, the number of matches drops precipitously:

$ time rg -c '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid)' /tmp/linux-fs-all.10x
24320

real    0.157
user    0.139
sys     0.017
maxmem  580 MB
faults  0

For grins, observe how ripgrep is actually faster than GNU grep on these specific queries:

$ time grep -cE '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid|@|\.)' /tmp/linux-fs-all.10x
4773860

real    1.098
user    0.990
sys     0.107
maxmem  7 MB
faults  0

$ time grep -cE '(?-u:VID|vID|ViD|viD|VId|vId|Vid|vid)' /tmp/linux-fs-all.10x
22320

real    0.963
user    0.895
sys     0.067
maxmem  7 MB
faults  0

The problem here is that ripgrep's heuristic optimization is sub-optimal in this case. These sorts of things are difficult to fix, because fixing it for one case might actually result in making other cases slower. One possible fix would be to just remove @ and . from the set of literals we search. And indeed this is a correct transformation in this case. And removing things like that is probably a smart thing to do anyway, since they are very common. Well, at least . is. @ is common in some corpora and not others. We can see what performance could be like by replacing @ and . with \S to defeat the literal detection:

$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*\S[a-z]+\S[a-z]{2,}" /tmp/linux-fs-all.10x
10760

real    0.144
user    0.100
sys     0.043
maxmem  581 MB
faults  0

Obviously it doesn't produce the same results, but if you put it together into a pipeline...

$ time rg -ai "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*\S[a-z]+\S[a-z]{2,}" /tmp/linux-fs-all.10x | rg -aci '[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}'
100

real    0.168
user    0.134
sys     0.034
maxmem  579 MB
faults  0

Unfortunately, while removing . and @ from the literal detection is conceptually simple, the code for extracting literals is not really flexible enough to do this easily. I've been meaning to rewrite it for years, but it's pretty thorny.

Anyway, good find and thank you for the great report!

@BurntSushi BurntSushi added the bug A bug. label Nov 27, 2020
@BurntSushi
Copy link
Owner

This should be improved quite a bit in the ripgrep 14 release. It isn't still quite ideal, but better:

[andrew@duff fs]$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /dev/shm/linux-fs-all.10x
110

real    0.248
user    0.227
sys     0.020
maxmem  625 MB
faults  0
[andrew@duff fs]$ time rg-13.0.0 -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /dev/shm/linux-fs-all.10x
110

real    0.662
user    0.628
sys     0.033
maxmem  624 MB
faults  0
[andrew@duff fs]$ time rg -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /dev/shm/linux-fs-all.10x --no-mmap
110

real    0.106
user    0.043
sys     0.063
maxmem  15 MB
faults  0
[andrew@duff fs]$ time rg-13.0.0 -aci "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /dev/shm/linux-fs-all.10x --no-mmap
110

real    0.678
user    0.608
sys     0.070
maxmem  15 MB
faults  0
[andrew@duff fs]$ time grep -aciE "[A-Z0-9._%+-]*vid[A-Z0-9._%+-]*@[a-z]+\.[a-z]{2,}" /dev/shm/linux-fs-all.10x
110

real    0.166
user    0.116
sys     0.050
maxmem  15 MB
faults  0

The interesting thing to note above is that --no-mmap actually runs quite a bit faster than --mmap in the upcoming release. Explaining this is tricky, but I'll try:

  1. At time of writing, ripgrep master still tries to detect inner literals from the regex pattern (as it has always done). But, the regex engine has also grown more sophisticated inner literal detection, and ripgrep's inner literal extractor only runs if the regex engine's own detector fails. (ripgrep's is more flexible because of line oriented search.)
  2. In this case, the regex engine thinks its inner literal extraction is good. Indeed, it is, and it extracts the same literals as the ripgrep's extractor. That is, (?i:vid). Thus, ripgrep defers to the regex engine and doesn't try to do its own optimization. This is usually fine.
  3. Even the regex engine extracts the same literals, in order for the optimization to work in the context of a general regex engine, it has to put up some safeguards to avoid worst case quadratic behavior. Those safeguards end up activating in this benchmark which forces ripgrep to, in some cases, fall back to a non-literal accelerated search.
  4. The non-mmap version runs faster because ripgrep searches smaller buffers. So by a twist of fate, it spends less time in the non-accelerated regex engine than it does in the mmap version.

The main issue preventing improvement here is unfortunately abstraction boundaries. ripgrep can't really know exactly what the regex engine plans to do and how it might be sub-optimal here. The regex engine could expose something about its plan, but I'm hesitant to do such things because it is an implementation detail.

Probably the next step here is figuring out how to push ripgrep's more flexible inner literal optimization down into the regex engine so that everyone who uses the regex engine can benefit from it when doing line oriented search.

But I'm calling this good enough for now.

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Nov 25, 2023
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

2 participants