-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) #14475
Comments
From warren.hyde@amd.comCreated by warren.hyde@amd.comCertain types of (poorly-written) regular expressions have become Specifically, consider this (good, previous behavior) example: 5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' (Using case-insensitive matching disallows the optimizer from looking Now, compare that with the work done by Perl since 5.18: 5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000: 0c" =~ /.*:\s*ab/i' Why does it walk forward through the string trying to guess the start of Since 5.18 (or perhaps 5.17.x), the performance of this regular expression You can see how badly this affects things with the following comparison: 5.16.1> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*:\s*ab/i' 5.18.2> $ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*:\s*ab/i' Spending almost 4.5 seconds matching this regular expression against 1000 Perl Info
|
From @demerphqOn 5 February 2015 at 16:46, via RT <perlbug-followup@perl.org> wrote:
Thank you for the report. FWIW, I consider this a very serious,
As you say, this pattern was previously fast, and is not now. I suspect there is an issue with handling the implicit SBOL/MBOL as
The reason we see this from a pattern with no "^" in it like we do The idea of this optimization is to avoid quadratic backtracking for Depending on whether /s was used or not .* either matches everything, In the case you show (with no /s modifier) it should be at the
Indeed. That is a rather diplomatic way of putting it. In conversation
We need to bisect to find this[1], but it would not surprise me if IMO we must fix this before the next release of Perl. Patterns of this Yves |
The RT System itself - Status changed from 'new' to 'open' |
From @cpansproutOn Sun Feb 08 18:48:21 2015, demerphq wrote:
$ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f: 0c", $_), 1..50; $t = time; /.*:\s*ab/i; die if time - $t > 1' 3465e1f is the first bad commit regcomp.c: Optimize EXACTFish nodes without folds to EXACT -- Father Chrysostomos |
From @demerphqOn 9 February 2015 at 05:16, Father Chrysostomos via RT
Thanks, but I am pretty sure this is a false positive. I think a Yves -- |
From @cpansproutOn Sun Feb 08 20:54:52 2015, demerphq wrote:
Same result with: $ ../perl.git/Porting/bisect.pl --start=v5.16.0 --end=v5.18.0 -e '$_= readpipe(q|./perl -Ilib -Mre=debug -e '\''"%01000.0f: 0c" =~ /.*:\s*ab/i'\'' 2>&1|) ; warn $_ =~ tr/\n//; die if $_=~ tr/\n// > 50' 3465e1f again. -- Father Chrysostomos |
From @iabynOn Mon, Feb 09, 2015 at 03:47:45AM +0100, demerphq wrote:
I'm digging... -- |
From @iabynOn Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote:
I've dug. It's actually a long-standing issue with /.*/ patterns which In the following, $s = ('0' x 200_000) . '::: 0c'; all the non-//i ones are quadratic in all perls since 5.8-ish. The following commit which I've just pushed makes all 8 of the above run commit 0fa70a0 simpify and speed up /.*.../ handling -- |
From warren.hyde@amd.comExcellent diagnosis and response, as usual, from the Perl community. Much thanks to Dave and the others who took a look at this. Another uninformed question is how Perl's regex engine winds up in PCRE, and whether that would also be affected? Cheers, -----Original Message----- On Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote:
I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too. In the following, $s = ('0' x 200_000) . '::: 0c'; all the non-//i ones are quadratic in all perls since 5.8-ish. The following commit which I've just pushed makes all 8 of the above run in millisecond time again. commit 0fa70a0 simpify and speed up /.*.../ handling -- |
From @maukeAm 10.02.2015 um 16:11 schrieb Hyde, Warren:
It doesn't.
AFAIK perl and PCRE don't share any code. -- |
@iabyn - Status changed from 'open' to 'pending release' |
From warren.hyde@amd.comDave, A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below. Patterns like these are extremely common in a "TWiki Formatted Search", for example. -----Original Message----- On Mon, Feb 09, 2015 at 10:55:57AM +0000, Dave Mitchell wrote:
I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too. In the following, $s = ('0' x 200_000) . '::: 0c'; all the non-//i ones are quadratic in all perls since 5.8-ish. The following commit which I've just pushed makes all 8 of the above run in millisecond time again. commit 0fa70a0 simpify and speed up /.*.../ handling -- |
From @demerphqOn 12 February 2015 at 01:32, Hyde, Warren <Warren.Hyde@amd.com> wrote:
Why are they common? Can you give us more context? Yves -- |
From warren.hyde@amd.comYves, TWiki documents the extraction of text from a topic using a regular expression here: http://twiki.org/cgi-bin/view/TWiki/FormattedSearch#Extract_some_text_from_a_topic_u The regex is defined as needing to match the entire document (or line) as follows:
I realize this is not an ideal situation, but if Perl takes a long time backtracking through leading dot-star (even with minimal matching), this may cause a request to timeout in the browser. That's what I meant by "common", because lots of things document this as a specific use-case, and TWiki came to mind because I recalled having stumbled across this before. My question was whether the fix as implemented also covered leading .*? as well as leading .*, since I didn't think to include this case in the original perlbug submission. Cheers, -----Original Message----- On 12 February 2015 at 01:32, Hyde, Warren <Warren.Hyde@amd.com> wrote:
Why are they common? Can you give us more context? Yves -- |
From @demerphqOn 13 February 2015 at 00:38, Hyde, Warren <Warren.Hyde@amd.com> wrote:
It sounds to me like Twiki is going to automatically turn /PAT/ into /^(?:PAT)$/m And if it doesn't it could. :-) Anyway, I dont see any reason not to enable the same optimisation for yves |
From @iabynOn Thu, Feb 12, 2015 at 04:38:01PM +0000, Hyde, Warren wrote:
Yes, it also covered /.*?/. I've added some extra speed tests with -- |
From @khwilliamsonThanks for submitting this ticket The issue should be resolved with the release today of Perl v5.22, available at http://www.perl.org/get.html -- |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#123743 (status was 'resolved')
Searchable as RT123743$
The text was updated successfully, but these errors were encountered: