New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance with Searcher::search_slice #1513
Comments
This is a pretty amazing bug! Firstly, thank you for the easy reproduction. Very much appreciated. Secondly, there is an easy fix for you: enable line terminators when building your matcher. e.g., let matcher = regex::RegexMatcherBuilder::new()
.line_terminator(Some(b'\n'))
.build(r"\bfoo(bar|baz)")
.unwrap(); With that change, the benchmarks improve dramatically. Before:
After:
OK... So aside from the missed line terminator optimization, the actual reason why you're seeing a performance difference is actually quite interesting, and is a result of confounding factors. First and foremost, the DFA is quitting even though I claimed in the regex performance docs that the DFA should be able to handle Unicode word boundaries on pure ASCII input. The problem is the result of another optimization that shrinks the alphabet of the DFA into equivalence classes, which is almost always much smaller than the full 256 length of a the normal every-byte alphabet. It turns out in this case that characters like Now that doesn't explain everything since that should impact both Now in the case of We can test this explanation by simply putting a
And benchmarking:
I've filed a bug against |
Tip: the Without the
And with the
|
What version of ripgrep are you using?
I'm using the grep crate (0.2.4), master branch (50d2047) was also tested
What operating system are you using ripgrep on?
Arch Linux
Describe your question, feature request, or bug.
Disclaimer: this does seem to require
\b
(doesn't happen with(?-u:\b)
) which is known to have poor performance under some conditions involving unicode - though this behaviour happens for files and regexes that are purely ASCII (characters|{}
are fairly common).Under some conditions
Searcher::search_slice
(without an encoding / BOM sniffing - i.e. withSliceByLine
) is slower thanSearcher::search_reader
(ReadByLine
) by up to orders of magnitude even though usually the opposite is true (as I would expect it to because it adds an unnecessary transcoding step and needs to do some copying).If this is a bug, what are the steps to reproduce the behavior?
I've managed to produce this fairly minimal example to reproduce it:
https://github.com/lllusion3418/rust_grep_issue_bench
$ rustup override set nightly
$ cargo bench
and/or$ cargo run --release
It (obviously) also happens for some more real-life files/regexes which I'm not really comfortable sharing here - they are ~20kB html files with some
|
characters near the end.It also seems to happen for the
/usr/share/dict/american-english
on my system (2x difference)If this is a bug, what is the actual behavior?
If this is a bug, what is the expected behavior?
I would expect
Searcher::search_slice
to be at least as fast asSearcher::search_reader
.If the line
slice.extend("|".as_bytes());
is removed it behaves like this:Thanks for creating ripgrep and even nicely splitting it into crates that others can easily use - the tool where I originally encountered this problem would probably be 100x as long and complex if it wasn't for the grep crate.
On a system which doesn't have nearly enough ram (2GB) to cache every file I wanted to search through ~130.000 files with ~200 regexes and save the results (in json format) separately but for performance reason only read and decode (from windows-1252) every file once instead of just running
rg
for each of the ~200 regexes and combined with rayon it works quite well in only 150LOC.The text was updated successfully, but these errors were encountered: