Regex loop for \K in lookbehind #14638
Comments
From ph10@hermes.cam.ac.ukCreated by ph10@hermes.cam.ac.ukThe command perl -e 'while("aaaa" =~ /(?<=\Ka)/g){};' puts Perl into an infinite loop. Without the \K in the assertion, it Regards, -- Perl Info
|
From @jkeenanOn Tue Apr 07 09:46:45 2015, ph10@hermes.cam.ac.uk wrote:
A little diagnostic: ##### while ("aaaa" =~ /(?<=a)/g) { say ''; # Below: Infinite loop The top part produces: ##### The bottom part loops forever, producing: #####
Can you provide a link to that fix? Thank you very much. -- |
The RT System itself - Status changed from 'new' to 'open' |
From @khwilliamsonI bisected this, and it appears that this bug has been there from the very first, so this is not a regression: ee9b8ea is the first bad commit Add Regexp::Keep \K functionality to regex engine as well as add \v and \V, cleanup and more docs for regatom() -- |
From @khwilliamsonAlso, -Dr output looks like EXECUTING... Matching REx "(?<=\Ka)" against "aaaa" ad infinitum |
From ph10@hermes.cam.ac.ukOn Mon, 27 Apr 2015, James E Keenan via RT wrote:
Not very easily (I'd have to check the SVN manual to figure out how to However, even after matching a non-empty string, there is still one The code now checks for this case and advances the /g loop by one Regards, -- |
From @demerphqOn 28 April 2015 at 13:14, <ph10@hermes.cam.ac.uk> wrote:
Sounds like we have the same problem in Perl. Karl, if you dont see a way to fix this then let me know and I will pick it up. Cheers, -- |
From @khwilliamsonOn 04/28/2015 06:40 AM, demerphq wrote:
I haven't ever looked at the \K code; I was hoping someone else would |
From @demerphqOn 28 April 2015 at 22:28, Karl Williamson <public@khwilliamson.com> wrote:
Ok, then I will try to find time to address this. cheers, -- |
From @khwilliamsonOn 4/28/2015 3:26 PM, demerphq wrote:
Note that this is a 5.23 item, so there isn't a real urgency |
From @tonycozOn Tue Apr 28 14:27:34 2015, demerphq wrote:
What's the intended behaviour of \K in a look-(ahead|behind)? My first thught was that it should be ignored, something like: --- a/regcomp.c (with some extra code needed to handle look-aheads.) Or should it control the region matched, so \K in the example would set $& to "a", Tony |
From @ap* Tony Cook via RT <perlbug-followup@perl.org> [2015-05-27 07:05]:
It seems to me like it can only really have reasonable semantics at the So I would be most inclined to treat it as a compile error, at the least I have, FWIW, used it plenty but never used it inside a group. Anyone? Regards, |
From @tonycozOn Tue May 26 22:03:58 2015, tonyc wrote:
Which was broken in at least one way. Attached is a "better" patch, depending on how much I've managed Tony |
From @tonycoz0001-prevent-K-working-in-lookahead-behind-assertions-and.patchFrom 40f4dee6393db72c33d89117b6e092e4e349366c Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Thu, 28 May 2015 16:03:50 +1000
Subject: [PATCH] prevent \K working in lookahead/behind assertions (and warn)
this is probably incorrect, since I'm clueless about the regexp
engine
It may be that in_lookbehind and in_lookahead can be combined, since
in_lookbehind appears to be only used to maintain its own value.
---
pod/perldiag.pod | 5 +++++
regcomp.c | 51 +++++++++++++++++++++++++++++++++++++-----------
t/lib/warnings/regcomp | 12 ++++++++++++
t/re/pat_advanced.t | 14 +++++++++++++
4 files changed, 71 insertions(+), 11 deletions(-)
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index 93ae13b..fb00745 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -6556,6 +6556,11 @@ about the /d modifier.
(W misc) You have a \E in a double-quotish string without a C<\U>,
C<\L> or C<\Q> preceding it.
+=item Useless use of \K in lookbehind/lookahead in regex; marked by S<<-- HERE> in m/%s/
+
+(W regexp) Your regular expression used C<\K> in a lookhead or
+lookbehind assertion, where is has no effect.
+
=item Useless use of greediness modifier '%c' in regex; marked by S<<-- HERE> in m/%s/
(W regexp) You specified something like these:
diff --git a/regcomp.c b/regcomp.c
index 712c8ed7..945778d 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -177,6 +177,7 @@ struct RExC_state_t {
through */
U32 study_chunk_recursed_bytes; /* bytes in bitmap */
I32 in_lookbehind;
+ I32 in_lookahead;
I32 contains_locale;
I32 contains_i;
I32 override_recoding;
@@ -255,6 +256,7 @@ struct RExC_state_t {
#define RExC_study_chunk_recursed_bytes \
(pRExC_state->study_chunk_recursed_bytes)
#define RExC_in_lookbehind (pRExC_state->in_lookbehind)
+#define RExC_in_lookahead (pRExC_state->in_lookahead)
#define RExC_contains_locale (pRExC_state->contains_locale)
#define RExC_contains_i (pRExC_state->contains_i)
#define RExC_override_recoding (pRExC_state->override_recoding)
@@ -6633,6 +6635,7 @@ Perl_re_op_compile(pTHX_ SV ** const patternp, int pat_count,
RExC_seen = 0;
RExC_maxlen = 0;
RExC_in_lookbehind = 0;
+ RExC_in_lookahead = 0;
RExC_seen_zerolen = *exp == '^' ? -1 : 0;
RExC_extralen = 0;
RExC_override_recoding = 0;
@@ -9782,6 +9785,12 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
*flagp = 0; /* Tentatively. */
+ if (RExC_in_lookbehind) {
+ RExC_in_lookbehind++;
+ }
+ if (RExC_in_lookahead) {
+ RExC_in_lookahead++;
+ }
/* Make an OPEN node, if parenthesized. */
if (paren) {
@@ -10055,9 +10064,11 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
RExC_seen |= REG_LOOKBEHIND_SEEN;
RExC_in_lookbehind++;
RExC_parse++;
- /* FALLTHROUGH */
+ RExC_seen_zerolen++;
+ break;
case '=': /* (?=...) */
- RExC_seen_zerolen++;
+ RExC_in_lookahead++;
+ RExC_seen_zerolen++;
break;
case '!': /* (?!...) */
RExC_seen_zerolen++;
@@ -10685,6 +10696,9 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
if (RExC_in_lookbehind) {
RExC_in_lookbehind--;
}
+ if (RExC_in_lookahead) {
+ RExC_in_lookahead--;
+ }
if (after_freeze > RExC_npar)
RExC_npar = after_freeze;
return(ret);
@@ -11787,15 +11801,30 @@ S_regatom(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
*flagp |= SIMPLE;
goto finish_meta_pat;
case 'K':
- RExC_seen_zerolen++;
- ret = reg_node(pRExC_state, KEEPS);
- *flagp |= SIMPLE;
- /* XXX:dmq : disabling in-place substitution seems to
- * be necessary here to avoid cases of memory corruption, as
- * with: C<$_="x" x 80; s/x\K/y/> -- rgs
- */
- RExC_seen |= REG_LOOKBEHIND_SEEN;
- goto finish_meta_pat;
+ if (!RExC_in_lookbehind && !RExC_in_lookahead) {
+ RExC_seen_zerolen++;
+ ret = reg_node(pRExC_state, KEEPS);
+ *flagp |= SIMPLE;
+ /* XXX:dmq : disabling in-place substitution seems to
+ * be necessary here to avoid cases of memory corruption, as
+ * with: C<$_="x" x 80; s/x\K/y/> -- rgs
+ */
+ RExC_seen |= REG_LOOKBEHIND_SEEN;
+ }
+ else {
+ if (PASS2) {
+ /* adjust offset so <-- points at the K */
+ ++RExC_parse;
+ ckWARNreg(RExC_parse, "Useless use of \\K in lookbehind/lookahead");
+ --RExC_parse;
+ }
+ /* originally I did goto tryagain here, but that failed
+ * with an Internal urp when a ) immediately followed the \K.
+ * So return something, even if it's NOTHING.
+ */
+ ret = reg_node(pRExC_state, NOTHING);
+ }
+ goto finish_meta_pat;
case 'Z':
ret = reg_node(pRExC_state, SEOL);
*flagp |= SIMPLE;
diff --git a/t/lib/warnings/regcomp b/t/lib/warnings/regcomp
index b9943a0..9d569cd 100644
--- a/t/lib/warnings/regcomp
+++ b/t/lib/warnings/regcomp
@@ -36,3 +36,15 @@ $a = qr/[\c,]/;
EXPECT
"\c," is more clearly written simply as "l" at - line 9.
"\c," is more clearly written simply as "l" at - line 10.
+########
+# regcomp.c - \K in assertion
+use warnings;
+$x = "aaaa";
+$x =~ /(?<=\Ka)/;
+$x =~ /(?=a\Ka)aa/;
+no warnings 'regexp';
+$x =~ /(?<=\Ka)/;
+$x =~ /(?=a\Ka)aa/;
+EXPECT
+Useless use of \K in lookbehind/lookahead in regex; marked by <-- HERE in m/(?<=\K <-- HERE a)/ at - line 4.
+Useless use of \K in lookbehind/lookahead in regex; marked by <-- HERE in m/(?=a\K <-- HERE a)aa/ at - line 5.
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 891bb66..d8d5823 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -1498,6 +1498,20 @@ sub run_tests {
$x = "abcde";
$x =~ s/(.)\K/$1/g;
is($x, "aabbccddee", $message);
+
+ no warnings 'regexp';
+ $x = "aaaa";
+ $x =~ /(?<=\Ka)/;
+ is($&, "", "\\K in lookbehind meaningless");
+
+ $x =~ /(?<=(a)\K)/;
+ is($&, "", "\\K in lookbehind meaningless (with nesting)");
+
+ $x =~ /(?=a\Ka)aa/;
+ is($&, "aa", "\\K in lookahead meaningless");
+
+ $x =~ /(?=(a)\Ka)aa/;
+ is($&, "aa", "\\K in lookahead meaningless (nesting)");
}
{
--
1.7.10.4
|
From @TuxOn Wed, 27 May 2015 23:05:34 -0700, "Tony Cook via RT"
that looks like inconsistent indentation -- |
From @jkeenanOn Wed May 27 23:17:14 2015, hmbrand wrote:
Contributors to this RT: Do we have any more ideas on how to proceed further? Thank you very much. |
From @hvdsOn Wed Dec 09 15:49:38 2015, jkeenan wrote:
I think someone with regexp-engine knowledge needs to review the patch; ideally that would be Yves. If he doesn't have time, I can try to do this - it'll take a while, since I'm not yet familiar with the use or implementation of \K, but I should have some free time over the Christmas period. Hugo |
From @demerphqWhile I have not tested it the patch looks fine to me. I have to admit Yves On 28 May 2015 at 08:05, Tony Cook via RT <perlbug-followup@perl.org> wrote:
-- |
From @khwilliamsonOn Thu Dec 10 03:37:21 2015, demerphq wrote:
Tony, It looks to me like you can apply this patch, with a possible watchdog-timer test as well |
From @tonycozOn Sat, 06 Aug 2016 16:44:36 -0700, khw wrote:
Applied as 105c827, I haven't added a watchdog test, since the code doesn't get to that point anymore (the tests don't check against a string that would loop) Tony |
@tonycoz - Status changed from 'open' to 'pending release' |
It was disallowed because: - it was breaking things (GH Perl#14638) - at the time, no-one had the tuits and knowledge to select non-buggy semantics and implement them.
It was disallowed because: - it was breaking things (GH #14638) - at the time, no-one had the tuits and knowledge to select non-buggy semantics and implement them.
Migrated from rt.perl.org#124256 (status was 'pending release')
Searchable as RT124256$
The text was updated successfully, but these errors were encountered: