Very bad performance and returning the wrong results for a set of words to search #2518

burntbitterbal · 2023-05-25T14:07:22Z

What version of ripgrep are you using?

ripgrep 13.0.0

How did you install ripgrep?

Macports

What operating system are you using ripgrep on?

Apple M1 Pro
MacOS 12.6.3

Describe your bug.

Very surprised this happens since we trusted ripgrep to return correct matches, but it doesn't.

Slow down is observed and wrong results are produced when running ripgrep with option -o against enwik8.txt when searching for this set of words

inquisitorial
Diablotins
Tolkien
Louverture
overpower
dhibh
basses
meteors
rumour
gimmors
FiringSquad
USNS
Berlina
uker
WKNR
Avia
Achish
Halys
sunnyfuerteventura
MedImmune
Oshikawa
Boliwia
Rfrisbie
ethnologue
On
thermally
Wiesenthal
Gesalec
punjabjustice
Zahuri
interpro
Deveson

The search results are wrong and search takes 36 times as long compared to searching the same set of words minus one word removed or added or a different set of words. This also appears to happen with other combinations of 32 words and other files.

What are the steps to reproduce the behavior?

Download enwik8.txt and run rg -on -f words.txt enwik8.txt.

What is the actual behavior?

Slowdown and wrong results:

/usr/bin/time rg -on -f words  enwik8.txt | wc
        7.70 real         7.53 user         0.13 sys
 100000000 100000000 802126412

What is the expected behavior?

A normal run would be similar to this, with these words and with TEST added (33 words to search):

/usr/bin/time rg -on -f words  enwik8.txt | wc
        0.21 real         0.20 user         0.01 sys
   14186   14186  145933

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-05-25T14:12:19Z

Can you please provide a link to download enwik8.txt? I followed the link you gave, but I see no download instructions.

BurntSushi · 2023-05-25T14:14:47Z

The search results are wrong and search takes 36 times as long compared to searching the same set of words minus one word removed or added or a different set of words.

Without commenting on correctness, your performance expectations are definitely wrong here. Adding one word can indeed slow things down by quite a bit. It really depends how common the word is. If adding a word results in a big increase in the number of matches, then I would expect that to slow things down.

But I really need a full reproduction to say what's going on in this specific case. And also, why not say which word you removed that caused it to go faster?

BurntSushi · 2023-05-25T14:22:19Z

Apparently this is where the data comes from: http://mattmahoney.net/dc/textdata.html

That link is waaaaaaay better than the one you gave. It took quite some time to find it.

BurntSushi · 2023-05-25T14:35:22Z

Yup. There is a bug here. Interestingly, even small changes to the words, like chaging Tolkien to tolkien make ripgrep produce the correct result.

It seems like what's happening is that something is triggering a bug that causes ripgrep to match at every position. That explains the slowdown and the large number of matches reported.

This looks like another literal optimization bug.

BurntSushi · 2023-05-25T15:14:23Z

Here's a minimal reproduction as a Rust program using the regex engine directly:

fn main() {
    let needles = vec![
        "aA", "bA", "cA", "dA", "eA", "fA", "gA", "hA", "iA", "jA", "kA", "lA",
        "mA", "nA", "oA", "pA", "qA", "rA", "sA", "tA", "uA", "vA", "wA", "xA",
        "yA", "zA",
    ];
    let pattern = needles.join("|");
    let re = regex::Regex::new(&pattern).unwrap();
    let hay = "FUBAR";
    assert_eq!(0, re.find_iter(hay).count());
}

That assert should pass, but it panics because find_iter found 6 matches even though there are zero.

This is indeed a literal optimization bug. Sigh.

I also don't have any great work arounds for you either. The bug happens as a result of using a sequence of literals with 26 different start bytes. It doesn't happen in every such case, because this particular bug also requires a suffix literal searcher to be extracted. One way to push ripgrep out of this bug is to add a pattern that can never match that also suppresses the suffix literal optimizer. For example, ZQZQZQZQZQZQ\w.

BurntSushi · 2023-05-25T15:28:18Z

I filed a bug against the regex library as well: rust-lang/regex#999

burntbitterbal · 2023-08-27T23:37:28Z

Sorry to bother you again, I'm sure you're busy updating ripgrep. But this looks important.

I was alerted to this article https://www.genivia.com/ugrep.html was mentioned in the Reddit post https://www.reddit.com/r/opensource/comments/160zla6/ugrep_ultra_fast_grep_with_tui_unicode_support/

Is this serious bug actually fixed in all ripgrep installs and all Rust regex library installs?

I agree that this could be a security vulnerability, especially when it is more wide-spread than ripgrep (the Rust regex library.) I don't see a CVE on this issue or an advisory mention this issue. Other regex libraries have CVEs when such a bug was found.

BurntSushi · 2023-08-27T23:40:11Z

Other regex libraries have CVEs when such a bug was found.

Can you provide links?

Is this serious bug actually fixed in all ripgrep installs and all Rust regex library installs?

It's fixed on master: fc0d9b9

burntbitterbal · 2023-08-28T01:34:05Z

Searching for pcre cve gives hits, but I have not checked other libraries, just spent about a minute finding one like this:

CVE-2015-8393
pcregrep in PCRE before 8.38 mishandles the -q option for binary files, which might allow remote attackers to obtain sensitive information via a crafted file, as demonstrated by a CGI script that sends stdout data to a client.

I'm not a security specialist. This stuff is out of my league.

BurntSushi · 2023-08-28T01:40:19Z

That CVE is quite sparse on the details, but it reads to me like it's talking about the --quiet flag in combination with specific types of binary files. That's not applicable to this bug, which can only be triggered by a certain class of regexes. If your attacker can already change the regex to change what matches are emitted, then the threat model of "produces matches when it shouldn't" doesn't seem relevant.

But yeah you made a statement of fact

Other regex libraries have CVEs when such a bug was found.

So I just assumed you actually had specific CVEs in mind that were relevant here.

I'm not aware of any standing policy by any regex engine that treats match bugs as security vulnerabilities all on their own. That doesn't mean that treating match bugs as security problems is wrong, I'm just not aware of any precedent there.

BurntSushi added the bug A bug. label May 25, 2023

BurntSushi closed this as completed in fc0d9b9 May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very bad performance and returning the wrong results for a set of words to search #2518

Very bad performance and returning the wrong results for a set of words to search #2518

burntbitterbal commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

burntbitterbal commented Aug 27, 2023

BurntSushi commented Aug 27, 2023

burntbitterbal commented Aug 28, 2023

BurntSushi commented Aug 28, 2023 •

edited

Loading

Very bad performance and returning the wrong results for a set of words to search #2518

Very bad performance and returning the wrong results for a set of words to search #2518

Comments

burntbitterbal commented May 25, 2023

What version of ripgrep are you using?

How did you install ripgrep?

What operating system are you using ripgrep on?

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

BurntSushi commented May 25, 2023

burntbitterbal commented Aug 27, 2023

BurntSushi commented Aug 27, 2023

burntbitterbal commented Aug 28, 2023

BurntSushi commented Aug 28, 2023 • edited Loading

BurntSushi commented Aug 28, 2023 •

edited

Loading