-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very bad performance and returning the wrong results for a set of words to search #2518
Comments
Can you please provide a link to download |
Without commenting on correctness, your performance expectations are definitely wrong here. Adding one word can indeed slow things down by quite a bit. It really depends how common the word is. If adding a word results in a big increase in the number of matches, then I would expect that to slow things down. But I really need a full reproduction to say what's going on in this specific case. And also, why not say which word you removed that caused it to go faster? |
Apparently this is where the data comes from: http://mattmahoney.net/dc/textdata.html That link is waaaaaaay better than the one you gave. It took quite some time to find it. |
Yup. There is a bug here. Interestingly, even small changes to the words, like chaging It seems like what's happening is that something is triggering a bug that causes ripgrep to match at every position. That explains the slowdown and the large number of matches reported. This looks like another literal optimization bug. |
Here's a minimal reproduction as a Rust program using the regex engine directly: fn main() {
let needles = vec![
"aA", "bA", "cA", "dA", "eA", "fA", "gA", "hA", "iA", "jA", "kA", "lA",
"mA", "nA", "oA", "pA", "qA", "rA", "sA", "tA", "uA", "vA", "wA", "xA",
"yA", "zA",
];
let pattern = needles.join("|");
let re = regex::Regex::new(&pattern).unwrap();
let hay = "FUBAR";
assert_eq!(0, re.find_iter(hay).count());
} That assert should pass, but it panics because This is indeed a literal optimization bug. Sigh. I also don't have any great work arounds for you either. The bug happens as a result of using a sequence of literals with 26 different start bytes. It doesn't happen in every such case, because this particular bug also requires a suffix literal searcher to be extracted. One way to push ripgrep out of this bug is to add a pattern that can never match that also suppresses the suffix literal optimizer. For example, |
I filed a bug against the regex library as well: rust-lang/regex#999 |
Sorry to bother you again, I'm sure you're busy updating ripgrep. But this looks important. I was alerted to this article https://www.genivia.com/ugrep.html was mentioned in the Reddit post https://www.reddit.com/r/opensource/comments/160zla6/ugrep_ultra_fast_grep_with_tui_unicode_support/ Is this serious bug actually fixed in all ripgrep installs and all Rust regex library installs? I agree that this could be a security vulnerability, especially when it is more wide-spread than ripgrep (the Rust regex library.) I don't see a CVE on this issue or an advisory mention this issue. Other regex libraries have CVEs when such a bug was found. |
Can you provide links?
It's fixed on master: fc0d9b9 |
Searching for pcre cve gives hits, but I have not checked other libraries, just spent about a minute finding one like this: CVE-2015-8393 I'm not a security specialist. This stuff is out of my league. |
That CVE is quite sparse on the details, but it reads to me like it's talking about the But yeah you made a statement of fact
So I just assumed you actually had specific CVEs in mind that were relevant here. I'm not aware of any standing policy by any regex engine that treats match bugs as security vulnerabilities all on their own. That doesn't mean that treating match bugs as security problems is wrong, I'm just not aware of any precedent there. |
What version of ripgrep are you using?
ripgrep 13.0.0
How did you install ripgrep?
Macports
What operating system are you using ripgrep on?
Apple M1 Pro
MacOS 12.6.3
Describe your bug.
Very surprised this happens since we trusted ripgrep to return correct matches, but it doesn't.
Slow down is observed and wrong results are produced when running ripgrep with option
-o
against enwik8.txt when searching for this set of wordsThe search results are wrong and search takes 36 times as long compared to searching the same set of words minus one word removed or added or a different set of words. This also appears to happen with other combinations of 32 words and other files.
What are the steps to reproduce the behavior?
Download enwik8.txt and run
rg -on -f words.txt enwik8.txt
.What is the actual behavior?
Slowdown and wrong results:
What is the expected behavior?
A normal run would be similar to this, with these words and with
TEST
added (33 words to search):The text was updated successfully, but these errors were encountered: