New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster (10x or more) matching of patterns with leading "wildcards" (preview) #288
Comments
The next update v5.0 will include the planned new regex engine I am working on to speed this up properly. It speeds up matching patterns with "wildcards" at the start of a pattern, but should also speed up other patterns. |
A quick update. The algorithm with O(|E|+|V|) time complexity that I came up is not trivial, nor something I found and expect other regex libraries to use (but I could be wrong, there is so much stuff out there that is not documented). It operates on the DFA to find a small set of "important forward edges" that form an s-t cut from which the match predictors are primed. I should write this up sometime. The match predictors perform much better and will speed up searching. The algorithm and implementation are now reasonably complete, except for profiling, tweaking and clean up. Additional testing and evaluation of edge cases is important and takes some time, because I have to make sure there are no bugs with this optimization. |
Preliminary performance results look encouraging. Searching is 10x faster already for some typical examples, even without additional fine-tuning that I still need to do. Old:
New:
This is almost as fast as searching Now, searching A set of example patterns that are optimized this way and are much faster to search:
|
Version 5.0 will be released very soon. This effort took a bit longer than I anticipated to find enough time to work on it on a daily basis by putting other things aside for a while. Three versions of the algorithm were implemented and tested. Eventually I settled on the final one, which works well for many regex patterns. The algorithm to optimize the match prediction is generic and does not depend on the machine architecture it runs on. A lot of time was spent on performance testing and tuning, then rinse and repeat. There is always an opportunity to work on this further if I find additional tweaks worthwhile. But I don't want to hold off any longer to release this update. |
Implemented. Updated benchmark results are posted with four new test cases to cover the updated algorithm: https://github.com/Genivia/ugrep-benchmarks A lot more test cases were evaluated and tested on our side, some use complex regex expressions (much more complex than in these benchmarks). |
Ugrep runs fast on almost all search patterns, but not all. There is room for improvement for patterns with "leading wildcards".
I'm excited to work on this to improve general regex pattern matching with efficient DFAs. I will post progress in this issue thread. Benchmarks will show how effective this approach is.
Leading wildcards like
.+foo
or more specific ones like\w+foo
are not optimized at all in ugrep yet, in contrast to all other regex pattern search cases. I've put this on the back-burner and wanted to address it earlier, but didn't have the time until now. So at the moment, using a pattern that matches about anything up front won't give you a speed boost yet.However, if you use ug option
-P
then the performance is good. Also simple cases like[a-z]*foo
are faster, but[a-z]+foo
is slow again. This is caused by backtracking.Let me explain why and how this is addressed in a future ugrep release.
The regex
[a-z]*z
is converted to the DFA with a back-edge to the start state on one char, which is optimized by the matcher to avoid backtracking:The regex
[a-z]+z
is converted to the DFA without such a back-edge to the start state, which therefore isn't recognized by the matcher as a back-edge that restarts the pattern, which causes backtracking:FYI. these images are automatically generated with RE/flex (used by ugrep) script reflex/tests/test.sh
The matcher's logic doesn't yet recognize the opportunity to advance forward without backtracking.
Most of the work on ugrep so far went into faster pattern match algorithms that have a positive impact on pattern search algorithms. Because part of this approach is new and no libraries exist that can do this, one cannot expect everything to be done and completed right away in a new tool/library like ugrep based on RE/flex. Rome wasn't built in one day...
To address this, avoiding backtracking has always been the primary optimization in regex libraries, e.g. PCRE2 and RE2 NFA/DFA Perl-compatible regex matching also attempt to limit backtracking. Secondly, there are other optimizations possible. The way pattern search is typically optimized is to look for less common patterns in the input. So we want to avoid searching
\w+
that matches a lot whenfoo
is clearly the one to look for when we search with the regex\w+foo
.As you can see, it's rather a trivial thing to do conceptually. Updating the regex pattern conversion to DFA and Bloom filters will work just like before as we do now in ugrep to speed up searching. It is just a matter of time to implement it.
I'm always cautious to not rush into optimizations that have a nonzero chance of breaking ugrep. Testing regex pattern search is very time consuming, despite additional sets of unit tests I have deployed to do so. That's why it hasn't been done yet.
Also adding new features (e.g. Boolean search, fuzzy search, file indexing, options
-ABC
with-o
and other new additions) without negatively impacting search performance is tricky, which has put additional constraints on what could be done immediately to address this opportunity and other opportunities for optimization.This optimization and the other optimizations mentioned will be included in a future release with benchmarks. Optimized search for patterns in ugrep already looks for less frequent combinations, but not for patterns with up-front wildcards like
\w+foo
.In the meantime, when you see someone using examples like
\w+foo
to show ugrep is not that fast for this pattern, then remember that this is only a temporary problem.The text was updated successfully, but these errors were encountered: