Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster (10x or more) matching of patterns with leading "wildcards" (preview) #288

Closed
genivia-inc opened this issue Aug 28, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@genivia-inc
Copy link
Member

genivia-inc commented Aug 28, 2023

Ugrep runs fast on almost all search patterns, but not all. There is room for improvement for patterns with "leading wildcards".

I'm excited to work on this to improve general regex pattern matching with efficient DFAs. I will post progress in this issue thread. Benchmarks will show how effective this approach is.

Leading wildcards like .+foo or more specific ones like \w+foo are not optimized at all in ugrep yet, in contrast to all other regex pattern search cases. I've put this on the back-burner and wanted to address it earlier, but didn't have the time until now. So at the moment, using a pattern that matches about anything up front won't give you a speed boost yet.

However, if you use ug option -P then the performance is good. Also simple cases like [a-z]*foo are faster, but [a-z]+foo is slow again. This is caused by backtracking.

Let me explain why and how this is addressed in a future ugrep release.

The regex [a-z]*z is converted to the DFA with a back-edge to the start state on one char, which is optimized by the matcher to avoid backtracking:
image

The regex [a-z]+z is converted to the DFA without such a back-edge to the start state, which therefore isn't recognized by the matcher as a back-edge that restarts the pattern, which causes backtracking:
image

FYI. these images are automatically generated with RE/flex (used by ugrep) script reflex/tests/test.sh

The matcher's logic doesn't yet recognize the opportunity to advance forward without backtracking.

Most of the work on ugrep so far went into faster pattern match algorithms that have a positive impact on pattern search algorithms. Because part of this approach is new and no libraries exist that can do this, one cannot expect everything to be done and completed right away in a new tool/library like ugrep based on RE/flex. Rome wasn't built in one day...

To address this, avoiding backtracking has always been the primary optimization in regex libraries, e.g. PCRE2 and RE2 NFA/DFA Perl-compatible regex matching also attempt to limit backtracking. Secondly, there are other optimizations possible. The way pattern search is typically optimized is to look for less common patterns in the input. So we want to avoid searching \w+ that matches a lot when foo is clearly the one to look for when we search with the regex \w+foo.

As you can see, it's rather a trivial thing to do conceptually. Updating the regex pattern conversion to DFA and Bloom filters will work just like before as we do now in ugrep to speed up searching. It is just a matter of time to implement it.

I'm always cautious to not rush into optimizations that have a nonzero chance of breaking ugrep. Testing regex pattern search is very time consuming, despite additional sets of unit tests I have deployed to do so. That's why it hasn't been done yet.

Also adding new features (e.g. Boolean search, fuzzy search, file indexing, options -ABC with -o and other new additions) without negatively impacting search performance is tricky, which has put additional constraints on what could be done immediately to address this opportunity and other opportunities for optimization.

This optimization and the other optimizations mentioned will be included in a future release with benchmarks. Optimized search for patterns in ugrep already looks for less frequent combinations, but not for patterns with up-front wildcards like \w+foo.

In the meantime, when you see someone using examples like \w+foo to show ugrep is not that fast for this pattern, then remember that this is only a temporary problem.

@genivia-inc
Copy link
Member Author

genivia-inc commented Dec 8, 2023

The next update v5.0 will include the planned new regex engine I am working on to speed this up properly. It speeds up matching patterns with "wildcards" at the start of a pattern, but should also speed up other patterns.

@genivia-inc genivia-inc self-assigned this Dec 20, 2023
@genivia-inc
Copy link
Member Author

I'm picking up pace to work on this improvement. I got distracted by other work, including work on ugrep's new features. This improvement will optimize more than just patterns like \w+foo. The idea is to find a minimal set of patterns in the DFA constructed from the regex pattern. This minimal set of patterns is then used to optimize the pattern search/match predictor.

That will be a sweet addition to ugrep to further speed it up.

To give you an example of what I'm talking about, consider [a-z]+(foo|bar)|baz which has the DFA:
image
Note that the following transitions are on single characters associated with foo, bar and baz, which we can find with a breadth-first search over the DFA graph by following forward edges:

3: a -> 4
3: b -> 5
3: f -> 6
2: b -> 5
2: f -> 6

Therefore, a match can be predicted if we find a, b or f which are ar, az, bar, baz or foo. The first two are redundant and can be merged with the other. The start of these patterns is essentially formed by a digraph s-t cut that also ignores back edges to states before the cut. When a match is predicted based on these minimal patterns, we then look back to get the full match. This is something that some regex engines do in various (simple) ways, but I'm interested in algorithms and optimality. Note that a graph-theoretic min-cut forms an optimal set of edges to prime the predict matcher with, but computing a min-cut is too expensive for large DFA graphs. Hence, we must resort to a linear-time graph algorithms and heuristics to determine cuts for a minimal set of patterns.

@genivia-inc
Copy link
Member Author

A quick update.

The algorithm with O(|E|+|V|) time complexity that I came up is not trivial, nor something I found and expect other regex libraries to use (but I could be wrong, there is so much stuff out there that is not documented). It operates on the DFA to find a small set of "important forward edges" that form an s-t cut from which the match predictors are primed. I should write this up sometime. The match predictors perform much better and will speed up searching. The algorithm and implementation are now reasonably complete, except for profiling, tweaking and clean up. Additional testing and evaluation of edge cases is important and takes some time, because I have to make sure there are no bugs with this optimization.

@genivia-inc
Copy link
Member Author

genivia-inc commented Feb 4, 2024

Preliminary performance results look encouraging. Searching is 10x faster already for some typical examples, even without additional fine-tuning that I still need to do.

Old:

ugrep -wc '[a-z]+ing' enwik8 --stats
118643
Searched 1 file in 0.932 seconds: 1 matching (100%)

New:

ugrep -wc '[a-z]+ing' enwik8 --stats
118643
Searched 1 file in 0.075 seconds: 1 matching (100%)

This is almost as fast as searching ugrep -c ing with a simple string ing to match.

Now, searching [a-z]+ing does not look impressive to optimize, right? After all, it has a string fragment ing. But this is only a simplistic example to illustrate the performance impact. The method is generic and handles any regex by extracting a sub-pattern that is more efficient to match. It does this by analyzing the DFA as I've shown earlier to essentially make an s-t cut that minimizes edges. Subsequent edges are strategically cut to form a subgraph that is a super efficient engine to search for sub-pattern matches. For this simple example the following DFA graph (generated with RE/flex graphviz output) illustrates the parts that are cut after a breadth-first search and a subsequent mark-sweep phase to cut edges and states in O(|V|+|E|) time. Note that loop means that there are back edges, in this case in [a-z]+ before ing that overlap with ing.

image

A set of example patterns that are optimized this way and are much faster to search:

([f-i]y|(by|py)r|1y3|4y6|7y9|uyw|xyz|kym)zz
(y|(bay|pay)r|1by3|4cy6|7dy9|ueyw|xfyz|kgym)
(z|(bay|pay)r|1by3|4cy6|7dy9|ueyw|xfyz|kgym)
([aA01]|yy)(b\b)?[cC3]d
([aA01]|yyy)(b\b)?[cC3]d
([aA012]|xyz)(b\b)?[cC3]d
([aA0123]|uvwxyz)(b\b)?[cC1][dD]
([aA012]|xyz)(b\b)?[cC3][dD]|uvw
([aA012]|y\ny)(b\b)?[cC3]d
([aA01234]|\nxyz)(b\b)?[cC3]d
([aA012]|y\ny)(bb)?[cC3]d
([aA012]|x\ny)*(bb)?[cC3]d
([aA012]|xyz)(bb)?[cC3]d
[ab]*abb
[ab]+abb
[abcd]+abbe
[a-z]+(foo|bar)
[a-z]+(foo|bar)|baz
[0-9]+(foo|bar)
[0-9]+(foo|bar)|baz
[a-z0-9]+(foo|bar)|baz
.*foo
.*fx*oo
.(foo|bar)
.*(foo|bar)
.+(foo|bar)
\w+(foo|bar)
.(fx*oo|bx*ar)
.(fx*oo|bx*ar|baz)
.(fx*oo|bx*ar|\n*baz)
.*(fx*oo|bx*ar|\n*baz)
\w*foo
\w*(foo|bar)
\w*(foo|bar)|baz
\w+foo
\w+(foo|bar)
\w+(foo|bar)|baz
[^ ]*foo
[^ ]+foo

@genivia-inc genivia-inc changed the title Faster matching of patterns with leading "wildcards" (preview) Faster (10x or more) matching of patterns with leading "wildcards" (preview) Feb 4, 2024
@genivia-inc
Copy link
Member Author

Version 5.0 will be released very soon. This effort took a bit longer than I anticipated to find enough time to work on it on a daily basis by putting other things aside for a while. Three versions of the algorithm were implemented and tested. Eventually I settled on the final one, which works well for many regex patterns. The algorithm to optimize the match prediction is generic and does not depend on the machine architecture it runs on. A lot of time was spent on performance testing and tuning, then rinse and repeat. There is always an opportunity to work on this further if I find additional tweaks worthwhile. But I don't want to hold off any longer to release this update.

genivia-inc added a commit that referenced this issue Feb 16, 2024
faster: implement #288
new feature #342
new feature #349
fix #350
fix #355
@genivia-inc
Copy link
Member Author

Implemented.

Updated benchmark results are posted with four new test cases to cover the updated algorithm: https://github.com/Genivia/ugrep-benchmarks

A lot more test cases were evaluated and tested on our side, some use complex regex expressions (much more complex than in these benchmarks).

@genivia-inc genivia-inc unpinned this issue Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant