Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular Expressions Not Unsetting $1 Vars When Backtracking #2161

Closed
p5pRT opened this issue Jul 1, 2000 · 7 comments
Closed

Regular Expressions Not Unsetting $1 Vars When Backtracking #2161

p5pRT opened this issue Jul 1, 2000 · 7 comments

Comments

@p5pRT
Copy link

@p5pRT p5pRT commented Jul 1, 2000

Migrated from rt.perl.org#3455 (status was 'resolved')

Searchable as RT3455$

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 1, 2000

From mikelambert@home.com

Created by mikelambert@home.com

When perl matches using (), and then backtracks over that code, it does not
unset the $1 or $2 var associated with the match. In 99% of cases, this does
not matter since if the regex fails, all of $1 and $2 are unset, and if it
succeeds, the () is USUALLY run through again, causing it to be reset.

Here are some test cases which produce incorrect results​:
"abac" =~ /^([ab]*?)(b)?(c)$/

The execution of this is pretty simple, but the (b)? is the part that is
causing the problem. It goes through, matching "a", and then stashes the b
in the $2. When it goes to match "c" and fails, it backtracks all the way to
$1, and keeps appending to $1 until it gets to the end of "aba". It then
proceeds to try to match (b)?, and fails. However, this failure does NOT
unset $2, and so the previous $2 is still set as "b". Then it correctly
finds "c", and concludes it's search. If I then 'print join " ", $1,$2,$3',
I get "aba b c", as the old "b" is left around because it was never unset.

This behavior is stemming from the fact that when the ? is outside of the
parenthesis, and it does not match, it does not do anything to the $2 var
for those parenthesis. This can lead to behavior where $2 is undefined and
$3 is defined​:

"ac" =~ /^(a)(b)?(c)$/

It's not unsetting $2 when the (b)? match fails, but rather the undef of $2
is hodling over from the start of the match when it undefined all of the $1
variables.

This bug can manifest itself in more complex scenarios where a failed
attempt to match along one branch, which later succeeds on another branch,
can leave a mark behind on the $1 vars.

Two ways to fix this, are to unset $1 vars upon backtracking if the
parenthesis are followed by a * or ?, since either of those can cause a
subsequent failed optional match to not reset the $1 variable.

The other is to set all failed attempts at remembering data into $1 vars to
"" when they fail, rather than leaving them as in, in the case of (...)? or
(...)*

I've successfully duplicated the bug in perl 5.005_03 on unix, and on perl
5.6.0 on windows.

Please let me know the status of this bug, and what develops.

Thank you,
Mike Lambert

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl v5.6.0:

Configured by gsar at Fri Mar 24 12:36:06 2000.

Summary of my perl5 (revision 5 version 6 subversion 0) configuration:
  Platform:
    osname=MSWin32, osvers=4.0, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    usethreads=undef use5005threads=undef useithreads=define
usemultiplicity=define
    useperlio=undef d_sfio=undef uselargefiles=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
  Compiler:
    cc='cl.exe', optimize='-O1 -MD -DNDEBUG', gccversion=
    cppflags='-DWIN32'
    ccflags
='-O1 -MD -DNDEBUG -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT  -DPERL_
IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DPERL_MSVCRT_READFIX'
    stdchar='char', d_stdstdio=define, usevfork=false
    intsize=4, longsize=4, ptrsize=4, doublesize=8
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=4
    alignbytes=8, usemymalloc=n, prototype=define
  Linker and Libraries:
    ld='link', ldflags
'-nologo -nodefaultlib -release  -libpath:"C:\Perl\lib\CORE"  -machine:x86'
    libpth="C:\Perl\lib\CORE"
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib wsock32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib
msvcrt.lib
    libc=C:\Perl\lib\CORE\PerlCRT.lib, so=dll, useshrplib=yes,
libperl=perl56.lib
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ',
ddlflags='-dll -nologo -nodefaultlib -release  -libpath:"C:\Perl\lib\CORE"  
-machine:x86'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY


@INC for perl v5.6.0:
    C:/Perl/lib
    C:/Perl/site/lib
    .


Environment for perl v5.6.0:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)

PATH=C:\JDK1.3\BIN;C:\PERL\BIN;C:\WINDOWS;C:\WINDOWS\COMMAND;C:\WINDOWS\COMM
AND;C:\WINDOWS\SYSTEM
    PERL_BADLANG (unset)
    SHELL (unset)




@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 1, 2000

From @vanstyn

In <000701bfe384$ee4ee7c0$6701a8c0@​mongo>, "Mike Lambert" writes​:
:Here are some test cases which produce incorrect results​:
:"abac" =~ /^([ab]*?)(b)?(c)$/

I can confirm that this incorrectly leaves $2 set to "b" in 5.005 and
later; it is correctly undef in 5.004_05 and earlier.

Hugo

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 11, 2000

From @vanstyn

In <000701bfe384$ee4ee7c0$6701a8c0@​mongo>, "Mike Lambert" writes​:
:This is a bug report for perl from mikelambert@​home.com,
:generated with the help of perlbug 1.28 running under perl v5.6.0.
:
:-----------------------------------------------------------------
:[Please enter your report here]
:
:When perl matches using (), and then backtracks over that code, it does not
:unset the $1 or $2 var associated with the match. In 99% of cases, this does
:not matter since if the regex fails, all of $1 and $2 are unset, and if it
:succeeds, the () is USUALLY run through again, causing it to be reset.
:
:Here are some test cases which produce incorrect results​:
:"abac" =~ /^([ab]*?)(b)?(c)$/

The patch below fixes it. All tests (including the new one) pass here.

Hugo

Inline Patch
--- regexec.c.old	Mon May  8 23:30:51 2000
+++ regexec.c	Tue Jul 11 12:36:38 2000
@@ -2964,6 +2964,8 @@
 	    else
 		c1 = c2 = -1000;
 	    PL_reginput = locinput;
+	    if (paren)
+		PL_regendp[paren] = -1;
 	    if (minmod) {
 		CHECKPOINT lastcp;
 		minmod = 0;
--- t/op/re_tests.old	Fri Jan  7 04:16:23 2000
+++ t/op/re_tests	Tue Jul 11 12:30:34 2000
@@ -750,3 +750,4 @@
 ^([a-z]:)	C:/	n	-	-
 '^\S\s+aa$'m	\nx aa	y	-	-
 (^|a)b	ab	y	-	-
+^([ab]*?)(b)?(c)$	abac	y	-$2-	--
@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 11, 2000

From @vanstyn

In <200007111144.MAA04446@​crypt.compulink.co.uk>, I write​:
:In <000701bfe384$ee4ee7c0$6701a8c0@​mongo>, "Mike Lambert" writes​:
:​:This is a bug report for perl from mikelambert@​home.com,
:​:generated with the help of perlbug 1.28 running under perl v5.6.0.
:​:
:​:-----------------------------------------------------------------
:​:[Please enter your report here]
:​:
:​:When perl matches using (), and then backtracks over that code, it does not
:​:unset the $1 or $2 var associated with the match. In 99% of cases, this does
:​:not matter since if the regex fails, all of $1 and $2 are unset, and if it
:​:succeeds, the () is USUALLY run through again, causing it to be reset.
:​:
:​:Here are some test cases which produce incorrect results​:
:​:"abac" =~ /^([ab]*?)(b)?(c)$/
:
:The patch below fixes it. All tests (including the new one) pass here.

Here's the parallel patch for 5.005_03.

Hugo

Inline Patch
--- regexec.c.old	Sat Mar 27 17:56:09 1999
+++ regexec.c	Tue Jul 11 12:54:25 2000
@@ -1484,6 +1484,8 @@
 	    else
 		c1 = c2 = -1000;
 	    PL_reginput = locinput;
+	    if (paren)
+		PL_regendp[paren] = NULL;
 	    if (minmod) {
 		CHECKPOINT lastcp;
 		minmod = 0;
--- t/op/re_tests.old	Fri Oct 30 01:28:52 1998
+++ t/op/re_tests	Tue Jul 11 12:56:04 2000
@@ -489,3 +489,4 @@
 (^|x)(c)	ca	y	$2	c
 a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz	x	n	-	-
 round\(((?>[^()]+))\)	_I(round(xs * sz),1)	y	$1	xs * sz
+^([ab]*?)(b)?(c)$	abac	y	-$2-	--
@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 13, 2000

From [Unknown Contact. See original ticket]

Hugo wrote​:

In <000701bfe384$ee4ee7c0$6701a8c0@​mongo>, "Mike Lambert" writes​:
:
:Here are some test cases which produce incorrect results​:
:"abac" =~ /^([ab]*?)(b)?(c)$/

The patch below fixes it. All tests (including the new one) pass here.

Great! But I was hoping it would fix this too​:

$ ./perl -e 'print "not ok​:$1​:$2\n" if "abcab" =~ /(\w)?(abc)\1b/'
not ok​:a​:abc

Should I file a separate bug report?

@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 13, 2000

From @vanstyn

In <396E86B2.31C0AE48@​home.com>, Rick Delaney writes​:
:Hugo wrote​:
:>
:> In <000701bfe384$ee4ee7c0$6701a8c0@​mongo>, "Mike Lambert" writes​:
:> :
:> :Here are some test cases which produce incorrect results​:
:> :"abac" =~ /^([ab]*?)(b)?(c)$/
:>
:> The patch below fixes it. All tests (including the new one) pass here.
:
:Great! But I was hoping it would fix this too​:
:
:$ ./perl -e 'print "not ok​:$1​:$2\n" if "abcab" =~ /(\w)?(abc)\1b/'
:not ok​:a​:abc
:
:Should I file a separate bug report?

No, I did it wrong. The backreference needs to be invalidated deeper
in than I thought. Here's a new 5.6.0 patch to apply instead of the
previous version; I'll do an equivalent for 5.005_03 tomorrow, if you
don't have any other little test cases to surprise me with. :)

I'd welcome patches to add more comprehensive tests of this type of
regexp.

Hugo

Inline Patch
--- regexec.c.old	Mon May  8 23:30:51 2000
+++ regexec.c	Fri Jul 14 04:06:49 2000
@@ -221,6 +221,22 @@
 
 #define regcpblow(cp) LEAVE_SCOPE(cp)
 
+#define TRYPAREN(paren, n, input) {				\
+    if (paren) {						\
+	if (n) {						\
+	    PL_regstartp[paren] = HOPc(input, -1) - PL_bostr;	\
+	    PL_regendp[paren] = input - PL_bostr;		\
+	}							\
+	else							\
+	    PL_regendp[paren] = -1;				\
+    }								\
+    if (regmatch(next))						\
+	sayYES;							\
+    if (paren && n)						\
+	PL_regendp[paren] = -1;					\
+}
+
+
 /*
  * pregexec and friends
  */
@@ -2998,16 +3014,7 @@
 				sayNO;
 			}
 			/* PL_reginput == locinput now */
-			if (paren) {
-			    if (ln) {
-				PL_regstartp[paren] = HOPc(locinput, -1) - PL_bostr;
-				PL_regendp[paren] = locinput - PL_bostr;
-			    }
-			    else
-				PL_regendp[paren] = -1;
-			}
-			if (regmatch(next))
-			    sayYES;
+			TRYPAREN(paren, ln, locinput);
 			PL_reginput = locinput;	/* Could be reset... */
 			REGCP_UNWIND;
 			/* Couldn't or didn't -- move forward. */
@@ -3021,16 +3028,7 @@
 			UCHARAT(PL_reginput) == c1 ||
 			UCHARAT(PL_reginput) == c2)
 		    {
-			if (paren) {
-			    if (n) {
-				PL_regstartp[paren] = HOPc(PL_reginput, -1) - PL_bostr;
-				PL_regendp[paren] = PL_reginput - PL_bostr;
-			    }
-			    else
-				PL_regendp[paren] = -1;
-			}
-			if (regmatch(next))
-			    sayYES;
+			TRYPAREN(paren, n, PL_reginput);
 			REGCP_UNWIND;
 		    }
 		    /* Couldn't or didn't -- move forward. */
@@ -3064,16 +3062,7 @@
 			    UCHARAT(PL_reginput) == c1 ||
 			    UCHARAT(PL_reginput) == c2)
 			    {
-				if (paren && n) {
-				    if (n) {
-					PL_regstartp[paren] = HOPc(PL_reginput, -1) - PL_bostr;
-					PL_regendp[paren] = PL_reginput - PL_bostr;
-				    }
-				    else
-					PL_regendp[paren] = -1;
-				}
-				if (regmatch(next))
-				    sayYES;
+				TRYPAREN(paren, n, PL_reginput);
 				REGCP_UNWIND;
 			    }
 			/* Couldn't or didn't -- back up. */
@@ -3088,8 +3077,7 @@
 			    UCHARAT(PL_reginput) == c1 ||
 			    UCHARAT(PL_reginput) == c2)
 			    {
-				if (regmatch(next))
-				    sayYES;
+				TRYPAREN(paren, n, PL_reginput);
 				REGCP_UNWIND;
 			    }
 			/* Couldn't or didn't -- back up. */
--- t/op/re_tests.old	Fri Jul 14 04:14:36 2000
+++ t/op/re_tests	Fri Jul 14 03:47:13 2000
@@ -750,3 +750,5 @@
 ^([a-z]:)	C:/	n	-	-
 '^\S\s+aa$'m	\nx aa	y	-	-
 (^|a)b	ab	y	-	-
+^([ab]*?)(b)?(c)$	abac	y	-$2-	--
+(\w)?(abc)\1b	abcab	n	-	-
@p5pRT
Copy link
Author

@p5pRT p5pRT commented Jul 13, 2000

From @jhi

No, I did it wrong. The backreference needs to be invalidated deeper
in than I thought. Here's a new 5.6.0 patch to apply instead of the
previous version; I'll do an equivalent for 5.005_03 tomorrow, if you
don't have any other little test cases to surprise me with. :)

Thanks, I replaced change #4337 (the first version of this patch)
with this new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant