New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Subset Regex to Benchmarks #13
Comments
Please link the paper. |
That benchmark is in this repo. But it's not part of the curated set. Please see contributing guidelines for proposing a change to the curated set. |
The paper is here. If it's already part of the repo, I'm not personally sure of whether it should be part of the curated set, I just thought it was an interesting case that might be worth testing. |
All righty, now that I have hands on a keyboard, I took a look at this. I defined a new set of benchmarks based on this pattern and varied it a bit. I was in particular interested to see how perf change depending on two factors: whether Unicode mode was enabled and the size of the bounded repeat. Unicode matters here because it changes the meaning of The first thing I did was compare the current release of the
The patterns used are as follows:
Here, we can see that the latest version of the
Because of the delicate interaction with the suffix literal optimization, I added another benchmark using the pattern
In this case, there are no literal optimizations. Performance is unchanged with the For the regex crate, one can largely "fix" this problem by giving the lazy DFA more cache. Potentially a lot more cache. It's a legitimate work-around, but the cost is more memory. Now let's compare with some other engines. Given the .NET folks were the ones who published the paper, let's start there (this is .NET 8):
This is a rather good showing for .NET. The gap has been closed between regex 1.7 and regex 1.10, but still, the OK, let's zoom out a bit and get a broader picture of things:
I was not expecting that!
So yeah, I had hoped that the |
This adds benchmarks with the pattern `[a-q][^u-z]{13}x` and an assortment of variants. It is a particularly gnarly regex that some engines choke on and others do quite well on. I'm considering adding it to the curated set given how interesting the results are. Ref #13
Here are the benchmark definitions: https://github.com/BurntSushi/rebar/blob/f9a4f5c9efda069e7986a262efe8649aa78c0933/benchmarks/definitions/reported/i13-subset-regex.toml |
From Microsoft's paper on their new derivative-based regex engine (
RegexOptions.NonBacktracking
):It might be worthwhile including this case in the benchmarks.
The text was updated successfully, but these errors were encountered: