-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regex optimiser wrongly rejects certain matches involving embedded comments #12755
Comments
From @nwc10Created by @nwc10It looks like the regex optimiser is wrongly concluding "can never match" Pathological patterns that put a embedded comment between an atom and its Not a regression - has been present since embedded comments were added in $ Still present in blead. The smoking gun is the debugging output Compare the pattern /.b(?#Comment){2}c/ which enters the engine and matches: $ ./perl -Dr -le 'print "abbc" =~ /.b(?#Comment){2}c/ ? "Y" : "n"' EXECUTING... Guessing start of match in sv for REx ".b(?#Comment){2}c" against "abbc" with /ab(?#Comment){2}c/ which is (wrongly) rejected by the optimiser, and $ ./perl -Dr -le 'print "abbc" =~ /ab(?#Comment){2}c/ ? "Y" : "n"' EXECUTING... String shorter than min possible regex match Note, multiline comments, using /x, are not affected by this bug: $ ./perl -le 'print "abbc" =~ /ab#Comment' -e '{2}c/x ? "Y" : "n"' Nicholas Clark Perl Info
|
From @nwc10On Tue, Feb 05, 2013 at 01:26:30AM -0800, Nicholas Clark wrote:
OK, seems that it's more ugly, and affects more things. Or maybe this is a $ ./perl -Dr -le 'print "abbc" =~ /a(?{"b"}){2}c/ ? "Y" : "n"' EXECUTING... Guessing start of match in sv for REx "a(?{%"b%"}){2}c" against "abbc" There is no fixed string "ac" in the pattern. Bad optimiser, no cookie. Nicholas Clark |
From @demerphqOn 5 February 2013 10:59, Nicholas Clark <nick@ccl4.org> wrote:
Bad programmer misread manual, no cookie. :-) The (?{...}) is a zero width pattern. It always matches, and the So the pattern is identical to: /ac/ And is correctly rejected by the optimizer. Perhaps you meant (??{ ... }) which is quite different. Also, you must be testing on an old perl no? I dont recall seeing that Yves |
The RT System itself - Status changed from 'new' to 'open' |
From @demerphqOn 5 February 2013 10:26, Nicholas Clark <perlbug-followup@perl.org> wrote:
I consider this a documentation bug and not a feature bug. You simply Yves -- |
From @nwc10On Tue, Feb 05, 2013 at 02:16:52PM +0100, demerphq wrote:
I disagree. In that it should not matter whether the first symbol of my Also, as best I can tell from the *implementation* of comments, the intent Nicholas Clark |
From @arcNicholas Clark <perlbug-followup@perl.org> wrote:
This seems very wrong — the pattern's been compiled as if the initial $ ./miniperl -Dr -le 'print "abbc" =~ /ab(?#rem)*c/ ? "Y" : "n"' -- |
From @demerphqOn 5 February 2013 14:25, Nicholas Clark <nick@ccl4.org> wrote:u
I have just looked at the output. The cause of this behavior is not What happens is that the (?#....) is parsed out, the {2} then applies Which if you remove the (?#...) looks like: /(?:ab){2}c/ At which point the optimizer quite correctly expects this pattern to In the case with the leading dot, the parsing step which merges 'a' Anyway, none of this is an optimizer bug, its either a doc bug, or a Also notice that you happen to see the optimiser kick in, which is Take a look at the debug output: $ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /ab(?#Comment){2}c/
$ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/
You didn't look hard enough then. Sorry.
If so its been broken for a long time. Afaik S_nextchar is only called This is a farily typical piece of code IMO: case '<': /* (?<...) */ No "nextchar" there. And IMO, what /some/ of the nextchars suggest as being legal, such as We know that regex comments were a quick hack added by Larry when Yves -- |
From @demerphqOn 5 February 2013 14:57, Aaron Crane <arc@cpan.org> wrote:
This is byproduct of how we parse strings into EXACT nodes. When we About the only thing I can see here likely to change is that we should The docpatch should probably say "You cannot put a comment in between cheers, |
From @demerphqOn 5 February 2013 15:02, demerphq <demerphq@gmail.com> wrote:
Also, I'd like to point out that fixing this without modifying the IMO comments in insane places in regexes is not worth the price. Yves -- |
From @khwilliamsonOn 02/05/2013 07:10 AM, demerphq wrote:
+1 |
From @rjbs* demerphq <demerphq@gmail.com> [2013-02-05T08:58:19]
Thanks, this was a really interesting message. -- |
From @rjbs* demerphq <demerphq@gmail.com> [2013-02-05T09:10:42]
I think it's fine to document that this won't work. I wonder whether we can On the other hand, the way in which this fails is pretty lame to allow to -- |
From @demerphqOn 6 February 2013 01:04, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
My inclination is that we should just allow quantifiers on a (?#...) block. Then we could warn something like "quantifier on comment, maybe you
Unfortunately tho the existing docs on the subject are pretty sparse.
Heh. :-) I haven't checked the code yet to see what is involved, but I imagine Cheers, -- |
From @hvdsdemerphq <demerphq@gmail.com> wrote: It is not a useful thing to use in any context; I'd prefer just to make Allowing it with a warning only makes sense to me if we want to be able Hugo |
From @demerphqOn 6 February 2013 01:01, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
You are welcome. Very happy to be of service. :-) BTW, I have started investigating a fix. The offending code is as follows: case '#': /* (?#...) */ The documentation for this construct is: "(?#text)" The docs do not document that you cannot put a quantifier on such a Actually the behavior of this construct is quite weird. I would argue $ ./perl -Ilib -Mre=debug -le'my I would also argue that this is a bug: $ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' I also think that this is weird: $ ./perl -Ilib -Mre=debug -le'my $x="(?#foo\0bar)"; /$x/' Anyway, this construct has all the fingerprints of "poorly thought cheers, -- |
From @demerphqOn 6 February 2013 08:44, <hv@crypt.org> wrote:
Thanks for the analysis, I agree. If I can figure it out at all I will cheers, |
From @demerphqOn 6 February 2013 09:53, demerphq <demerphq@gmail.com> wrote:
The tests for this feature are: t/re/re_tests:^a(?#xxx){3}c aaac y $& aaac Which means someone thought that putting a comment in between the Btw, according to blame these tests are from Ilya and not Larry. c277df4 t/op/re_tests (Ilya Zakharevich 1997-11-15 19:29:39 I checked, and the feature was added by Larry as part of the 5.000 The docs from that commit are: +=item (?#text) As far as I can tell there were no tests for the feature at that time. Anyway, I do not think it is a feasible or reasonable to fix this to Yves -- |
From @arcRicardo Signes <perl.p5p@rjbs.manxome.org> wrote:
I'm not entirely convinced. We already allow a comment between an $ perl -Mre=debug -wE '"" =~ /ab# I think that means we have three options, in the abstract: 1. We preserve this inconsistency between /x comments and (?#) comments 2. We start forbidding (whitespace and?) comments between an atom and 3. We start permitting (?#) comments between an atom and its quantifier I believe option 2 risks breaking existing code. I'm happy to look at doing option 3, but I probably won't have chance
If we can detect the situation well enough to issue a warning,
My initial response is to agree that it's crazy, but on reflection, I $line =~ m{ at which point it's not much of a stretch for someone to write that like this: $line =~ m{ which has whitespace and a comment before the "one or more times" quantifier. -- |
From Eirik-Berg.Hanssen@allverden.noOn Wed, Feb 6, 2013 at 10:08 AM, demerphq <demerphq@gmail.com> wrote:
The way you describe it, and given my ignorance of the implementation, eirik@bluebird[10:38:44]~$ perl say "/x:\t", "food" =~ say "(?#):\t", "food" =~ /fo(?#then){2}(?#twice)d/ ? "match" : "miss"; say "both:\t", "food" =~ Couldn't these be expected – and made? – to parse identically? Eirik |
From @demerphqOn 6 February 2013 10:43, Eirik Berg Hanssen
IMO not very easily no. You have to understand how the parser works, but basically the code we However the code that handles (?#...) is the same code that handles cheers, -- |
From @demerphqOn 6 February 2013 10:31, Aaron Crane <arc@cpan.org> wrote:
Yep. But that is handled by the same code that handles EXACT. See the mail I just sent to Eirik for more details but also take a
I dont believe that 3 is a good idea, 2 is unnecessary. I think that 1 Basically the fix I want to make is: treat a quantifier following a IMO the only thing this might break is broken code already, and we
Nope I dont think so. And I think the code to do so would be horrible,
Yeah, this is no problem
This is also not a problem, and I have no intention to change it. The problem is strictly the use of (?#...) in between an atom and its On the other hand /x with # style escapes is widely used, is not a Commentary: $ ./perl -Ilib -Mre=Debug,ALL -le 'print "abbc" =~ /.b(?#Comment){2}c/
Anyway, if you can see an easy fix then of course go for it. But I Yves -- |
From @nwc10On Wed, Feb 06, 2013 at 11:34:50AM +0100, demerphq wrote:
Specifically, embedded comments.
That might be as simple as treating embedded comments the same way as $ perl -le '/ab(?i)+c/' (Might be a bit more tricky to get an error that says "comment" for comments
Yes. Right now, I feel that the most bonkers thing is that the (mis)parsing
It wouldn't surprise me that it's much easier to detect the problem (after
Agree. In that I feel that it's perfectly reasonable to document that (?#)
Yes, I'm not going to object to Aaron having fun grovelling in the code Nicholas Clark |
From @demerphqOn 6 February 2013 12:10, Nicholas Clark <nick@ccl4.org> wrote:
Exactly.
Ah good catch. I guess that is also there for things like /(|)*/
Indeed. I also have a suspicion that an exhaustive search would find I suspect there is deeper fu involved here that has not been remarked /(?#$foo)/ where $foo is not expanded is very strange, and cant be accounted for
Thanks. That is a much more succinct description of what I was trying
Especially if he proves me wrong. :-) (secretly this thread is all Yves -- |
From @rjbs* demerphq <demerphq@gmail.com> [2013-02-06T03:54:55]
Agreed, too, thanks Hugo. -- |
From @demerphqOn 7 February 2013 00:58, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
A patch to implement this has been pushed as: Yves -- |
From @tonycozOn Thu Feb 07 22:11:16 2013, demerphq wrote:
Did this change smoke cleanly for you? I don't see anything at http://perl.develop-help.com/?b=smoke-me%2Fregex_comment_quantifier_fix that looks like it was caused by this patch. This branch no longer rebases cleanly on blead. Tony |
From @tonycozOn Wed Feb 06 15:59:37 2013, perl.p5p@rjbs.manxome.org wrote:
Karl deprecated at least some of this behaviour in 000947a. Is the intent to ban it entirely in 5.20, or wait longer? Tony |
From @khwilliamsonOn 11/14/2013 04:04 PM, Tony Cook via RT wrote:
Some releases ago we agreed to having all deprecations be at least 2 |
From @demerphqShould I investigate a new patch? On 10 July 2013 08:43, Tony Cook via RT <perlbug-followup@perl.org> wrote:
-- |
From @tonycozOn Fri Nov 15 01:21:14 2013, demerphq wrote:
It can wait, since it won't be in until 5.22. https://rt-archive.perl.org/perl5/Ticket/Display.html?id=116639#txn-1268303 Tony |
From @jkeenanOn Sun Nov 24 14:36:08 2013, tonyc wrote:
Would it be possible to get an update on the status of this ticket (as it's marked as a blocker for 5.22)? Thank you very much. -- |
From @demerphqOn 7 Mar 2015 03:26, "James E Keenan via RT" <perlbug-followup@perl.org>
As I recall I wrote my patch right before 5.20 was released and we were in I've complained in the past about how code freeze isnt necessary and causes We are now in another code freeze, again done suboptimally, and this change We should ditch the code freeze, and simply create a new prerelease branch Anyway, if ricardo wants to make an exception here its his call. I assume |
From perl5-porters@perl.orgYves Orton wrote:
All it takes is for someone to create a post-5.22 branch and rebase |
From @demerphqOn 11 March 2015 at 00:08, Father Chrysostomos <sprout@cpan.org> wrote:
IMO That is the wrong way around. When we create a release candidate we should create the release At that point that branch is "owned" by the release manager, and only Anything that is required for release can be applied to either blead Freezing blead just ignores the capabilities of git to merge and Yves -- |
From @khwilliamsonFixed by 45a2623346a3b7028c030eb1dc738723b216d2dc |
@khwilliamson - Status changed from 'open' to 'pending release' |
From @khwilliamsonI may have closed this prematurely. I had not read the extensive commentary on this when I closed it, only the original report. So I had forgotten the controversy over what should happen. To recap what has happened in blead: It turns out that no one (including me) thought about nextchr()'s behavior when the pattern is UTF-8 encoded. It did a simple ++ of the parse position, which is the wrong thing to do when the character is a multi-byte character. It would point to the 2nd byte of that, hence the tests it did after the increment for white space under /x would fail for white space that was multi-byte. When I tried to write tests after fixing that, I discovered that nothing I came up with would reliably fail. And valgrind showed that there reads outside the buffer of garbage data. That led to me fixing a bunch of nextchr calls, and that led to making all such stuff uniform. And that led to this bug being fixed. But do we really want a (?#...) comment between a character and its quantifier? I can see both sides of the issue, so am now bringing it up to discussion again. blead is now in a state where it would be easy to add the ability to choose which places allow (?#...) and which forbid it, but allow white space and regular # comments, both only under /x. We could allow (?#...) only under /x in such cases if we choose. It's easy to change it to do any of this, and I'm willing to do the work, once a decision has been made as to what to do. My only stance on this is that I think (but am convince-able the other way) that under /x, anywhere there is a # comment, should also allow a (?#...) comment |
From @demerphqOn 14 October 2015 at 00:42, Karl Williamson via RT
IMO no. Yves -- |
From @AbigailOn Tue, Oct 13, 2015 at 03:42:29PM -0700, Karl Williamson via RT wrote:
I'd vote yes. For consistency. See below.
I agree. And I'd throw whitespace in it as well: anywhere where we Ignorable whitespace between a character and its quantifier(s) is allowed: $ perl -wE 'say "aa" =~ /^a + +$/x' A comment there is also allowed: $ perl -wE 'say "aa" =~ /^a # Foo If we have different rules for whitespace and (?#), it won't be easy Abigail |
From @ikegamiOn Wed, Oct 14, 2015 at 4:05 AM, demerphq <demerphq@gmail.com> wrote:
So the follow question would be: What should happen when someone does that? |
From @demerphqOn 15 October 2015 at 02:29, Eric Brine <ikegami@adaelis.com> wrote:
Throw an error, modifier illegal on zero width match. Yves -- |
From @rjbs* Abigail <abigail@abigail.be> [2015-10-14T07:43:05]
I wish I'd read your testing out of this before going through and doing it Anyway, I agree. While I'd rather nobody actually write this, I think that's a -- |
From @khwilliamsonOn 10/17/2015 07:53 PM, Ricardo Signes wrote:
Just to make sure everyone understands. Currently (?#...) comments are allowed even when there is no /x. We I don't have an opinion on this. |
From @rjbs* Karl Williamson <public@khwilliamson.com> [2015-10-18T00:16:44]
Thanks, I was confused.
I'm not strongly opinionated on this, but: I think that I would find it useful If you want to put comments into a regular expression, you have two That is: always allow (?#...) in those places where space and comments become -- |
From @demerphqOn 19 October 2015 at 04:50, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
I really dont like this. A) it complicates the regex engine, and B) it so for instance to *me*: a{1,10} is a single token. a(?#whatever){1,10} is two "tokens". I think expecting people to grok that /a {1,10}/x is the same as /a{1,10}/x is IMO just asking for trouble and for people to misunderstand what is going on. Related, would you expect \(?#is this a \w match or not)w to be the same as \w ? Its not, but IMO if you think that a(?#foo){1,10} should be the same Of course, it wouldnt work, as \( would mean that (?#whatever) is not So this idea just increases inconsistency in our regexes for IMO Yves -- |
From @AbigailOn Mon, Oct 19, 2015 at 11:33:55AM +0200, demerphq wrote:
To me, it isn't. To Perl, it isn't either, as /a {1,10}/x matches up to 10 a's: $ perl -wE '"aaaaaaaaaa" =~ /a {1,10}/x; say $& // "UNDEF"' And it's not that /x just removes any whitespace it finds, as it $ perl -wE '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' (In this case, it tries to match the literal string "a{1,10}" -- one Compiling /a {1,10}/x shows the tokes perl think there are: 1: CURLY {1,10} (5)
I don't see any benefit of suddenly disallowing a space (under /x) Abigail |
From @khwilliamsonOn 10/19/2015 08:26 AM, Abigail wrote:
Note that this now give a warning, even without the '-w' command line $ blead -E '"aaaaaaaaaa" =~ /a { 1,10}/x; say $& // "UNDEF"' |
From @AbigailOn Mon, Oct 19, 2015 at 08:33:51AM -0600, Karl Williamson wrote:
Yes, I know. Is that useful? I'd wager that of all the people who write /a { 1,10}/x more than 99.9% of want it to behave like /a {1,10}/x and none of them actually want to match the literal string "a{1,10}" Abigail (who has put a space after the , in /x{n,m}/ way too often). |
From @ikegamiOn Mon, Oct 19, 2015 at 5:33 AM, demerphq <demerphq@gmail.com> wrote:
Whether whitespace is allowed within or not, exports would not consider
Still the same 6 tokens. |
From @khwilliamsonOn 10/19/2015 09:07 AM, Abigail wrote:
I think the warning is useful, so they aren't silently deceived. It was In 5.24, this will not compile. In 5.26, it will behave as you and I |
From @AbigailOn Mon, Oct 19, 2015 at 09:16:43AM -0600, Karl Williamson wrote:
Excellent. Abigail |
From @khwilliamsonThank you for submitting this report. You have helped make Perl better. Perl 5.24.0 may be downloaded via https://metacpan.org/release/RJBS/perl-5.24.0 |
@khwilliamson - Status changed from 'pending release' to 'resolved' |
Migrated from rt.perl.org#116639 (status was 'resolved')
Searchable as RT116639$
The text was updated successfully, but these errors were encountered: