Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/g failure with zero-width patterns #9831

Closed
p5pRT opened this issue Aug 16, 2009 · 18 comments
Closed

/g failure with zero-width patterns #9831

p5pRT opened this issue Aug 16, 2009 · 18 comments
Labels

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Aug 16, 2009

Migrated from rt.perl.org#68564 (status was 'resolved')

Searchable as RT68564$

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 16, 2009

From @ikegami

Created by @ikegami

A regression was introduced into 5.10.0 concerning /g and zero-width
patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc
bc
c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc

I expect the 5.8 behaviour. 5.6.x and 5.8.x behave like 5.8.9. I've been
told 5.10.1-RC1 behaves like 5.10.0. I don't know about blead.

Eric "ikegami" Brine

Perl Info

Flags:
    category=core
    severity=low

Site configuration information for perl 5.10.0:

Configured by SYSTEM at Wed Sep  3 13:16:08 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
  Platform:
    osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE
-DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS
-DUSE_PERLIO -DPERL_MSVCRT_READFIX',
    optimize='-MD -Zi -DNDEBUG -O1',
    cppflags='-DWIN32'
    ccversion='12.00.8804', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64',
lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf
-libpath:"C:\progs\perl5100\lib\CORE"  -machine:x86'
    libpth=\lib
    libs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib
msvcrt.lib
    perllibs=  oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib  netapi32.lib
uuid.lib ws2_32.lib mpr.lib winmm.lib  version.lib odbc32.lib odbccp32.lib
msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug
-opt:ref,icf  -libpath:"C:\progs\perl5100\lib\CORE"  -machine:x86'

Locally applied patches:
    ACTIVEPERL_LOCAL_PATCHES_ENTRY
    33741 avoids segfaults invoking S_raise_signal() (on Linux)
    33763 Win32 process ids can have more than 16 bits
    32809 Load 'loadable object' with non-default file extension
    32728 64-bit fix for Time::Local


@INC for perl 5.10.0:
    c:/progs/perl5100/site/lib
    c:/progs/perl5100/lib
    .


Environment for perl 5.10.0:
    HOME (unset)
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)

PATH=c:\bin;c:\progs\perl5100\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\WBEM
    PERL_BADLANG (unset)
    SHELL (unset)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 16, 2009

From @hvds

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:A regression was introduced into 5.10.0 concerning /g and zero-width
:patterns. The demo speaks for itself​:
:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

With -Dr the first difference between the two cases is that the failing
case discovers optimizations "stclass NSPACE plus minlen 0" whereas the
other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0,
finds the /\S+/ as before, and then bails with​:
  Match possible, but length=0 is smaller than requested=1, failing!
  Contradicts stclass... [regexec_flags]
  Match failed
instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass
support to cope with the case, or whether we must instead revert to
either not detecting or throwing away the stclass in this scenario.

Hugo

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 16, 2009

The RT System itself - Status changed from 'new' to 'open'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 17, 2009

From @demerphq

2009/8/16 <hv@​crypt.org>​:

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:A regression was introduced into 5.10.0 concerning /g and zero-width
:patterns. The demo speaks for itself​:
:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

With -Dr the first difference between the two cases is that the failing
case discovers optimizations "stclass NSPACE plus minlen 0" whereas the
other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0,
finds the /\S+/ as before, and then bails with​:
 Match possible, but length=0 is smaller than requested=1, failing!
 Contradicts stclass... [regexec_flags]
 Match failed
instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass
support to cope with the case, or whether we must instead revert to
either not detecting or throwing away the stclass in this scenario.

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 17, 2009

From @iabyn

On Mon, Aug 17, 2009 at 04​:36​:08PM +0200, demerphq wrote​:

2009/8/16 <hv@​crypt.org>​:

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:A regression was introduced into 5.10.0 concerning /g and zero-width
:patterns. The demo speaks for itself​:
:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

With -Dr the first difference between the two cases is that the failing
case discovers optimizations "stclass NSPACE plus minlen 0" whereas the
other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0,
finds the /\S+/ as before, and then bails with​:
 Match possible, but length=0 is smaller than requested=1, failing!
 Contradicts stclass... [regexec_flags]
 Match failed
instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass
support to cope with the case, or whether we must instead revert to
either not detecting or throwing away the stclass in this scenario.

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well, I'd expect
a) a look-ahead assertion not to affect pos (so apart from causing
additional possible match failures, it should kind of be invisible
b) which means that the first match should leave pos unaffected,
c) which following the 'if pos unchanged bump it by 1 to avoid infinite
loops' rule, means that I would expect the pattern to be tried against
<abc >, <bc >, <c >, < > in turn (and fail on the fourth match).

--
The Enterprise is involved in a bizarre time-warp experience which is in
some way unconnected with the Late 20th Century.
  -- Things That Never Happen in "Star Trek" #14

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 17, 2009

From Eirik-Berg.Hanssen@allverden.no

demerphq <demerphq@​gmail.com> writes​:

2009/8/16 <hv@​crypt.org>​:

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:A regression was introduced into 5.10.0 concerning /g and zero-width
:patterns. The demo speaks for itself​:
:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

  Because, since the match is zero-length, we expect[1] pos() to be
advanced by 1 before trying to match again. Instead, it seems to be
advanced by the length of the captures or something ...

sidhekin@​blackbox[18​:07​:50]$ 588perl -le '$x=" abc xyz ";
print pos $x, " $1 ", length($&) while $x =
/(?=(\S+))/g'
1 abc 0
2 bc 0
3 c 0
5 xyz 0
6 yz 0
7 z 0
sidhekin@​blackbox[18​:07​:56]$ 5100perl -le '$x=" abc xyz ";
print pos $x, " $1 ", length($&) while $x =
/(?=(\S+))/g'
1 abc 0
5 xyz 0
sidhekin@​blackbox[18​:07​:58]~$

[1]​: Disclaimer​: Our expectations may well be the problem. :-\

Eirik
--
Machine. Unexpectedly, I'd invented a time
  - Alan Moore

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 17, 2009

From @ikegami

On Mon, Aug 17, 2009 at 10​:36 AM, demerphq <demerphq@​gmail.com> wrote​:

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well something is broken. With 5.10.0 (debian lenny)​:

$ perl -wle'print for "abc " =~ /(?=(\S+))/g'
abc

$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g'
abc
bc
c

$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g'
abc
bc
c

These should all return the same.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Aug 17, 2009

From @Abigail

On Mon, Aug 17, 2009 at 12​:37​:45PM -0400, Eric Brine wrote​:

On Mon, Aug 17, 2009 at 10​:36 AM, demerphq <demerphq@​gmail.com> wrote​:

"Eric Brine" (via RT) <perlbug-followup@​perl.org> wrote​:
:>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc
:bc
:c
:
:>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
:abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well something is broken. With 5.10.0 (debian lenny)​:

$ perl -wle'print for "abc " =~ /(?=(\S+))/g'
abc

$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g'
abc
bc
c

$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g'
abc
bc
c

These should all return the same.

Something related (broken only in 5.10, fixed in 5.10.1-RC)​:

  $ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
  abc
  bc
  c
  $ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
  $
  $ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
  abc
  bc
  c

Not sure what blead does - the last pull I did didn't build.

Abigail

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 10, 2010

From @cpansprout

On Sun Aug 16 01​:15​:58 2009, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width
patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc
bc
c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc

This was broken by commit 07be1b8,
which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1,} (which are meant to
be identical, but only {1,} was working). I noticed they diverged as a
result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code, and would much appreciate any
feedback as to whether this fix (or ‘fix’) will break anything else.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 10, 2010

From @cpansprout

From​: Father Chrysostomos <sprout@​cpan.org>

[perl #68564] /g failure with zero-width patterns

To make "abc " =~ /(?=(\S+))/g work again (it should return "abc",
"bc", "c" on each successive match, respectively; it was returning
just "abc"), this commit disables the PREGf_SKIP optimisation on a
PLUS if it is inside an IFMATCH (?=).

Inline Patch
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index b478273..bb1af44 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -698,6 +698,13 @@ Stringifying a scalar containing -0.0 no longer has the affect of turning
 false into true
 L<[perl #45133]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=45133>.
 
+=item *
+
+Using a regular expression lookhead assertion containing a subpattern
+quantified with a C<+> inside a global pattern (i.e.,
+C<"abc " =~ /(?=(\S+))/g>) now works again, as it did in 5.8.x
+L<[perl #68564]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=68564>.
+
 =back
 
 =head1 Known Problems
diff --git a/regcomp.c b/regcomp.c
index d6f3523..c3867e9 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -4595,6 +4595,7 @@ reStudy:
 	I32 last_close = 0; /* pointed to by data */
         regnode *first= scan;
         regnode *first_next= regnext(first);
+        bool lookahead = FALSE;
 	
 	/*
 	 * Skip introductions and multiplicators >= 1
@@ -4611,7 +4612,7 @@ reStudy:
 	       /* An OR of *one* alternative - should not happen now. */
 	    (OP(first) == BRANCH && OP(first_next) != BRANCH) ||
 	    /* for now we can't handle lookbehind IFMATCH*/
-	    (OP(first) == IFMATCH && !first->flags) || 
+	    (OP(first) == IFMATCH && !first->flags && (lookahead = 1)) || 
 	    (OP(first) == PLUS) ||
 	    (OP(first) == MINMOD) ||
 	       /* An {n,m} with n>0 */
@@ -4699,7 +4700,8 @@ reStudy:
 	    goto again;
 	}
 	if (sawplus && (!sawopen || !RExC_sawback)
-	    && !(RExC_seen & REG_SEEN_EVAL)) /* May examine pos and $& */
+	    && !(RExC_seen & REG_SEEN_EVAL)
+	    && !lookahead) /* May examine pos and $& */
 	    /* x+ must match at the 1st pos of run of x's */
 	    r->intflags |= PREGf_SKIP;
 
diff --git a/t/re/pat_rt_report.t b/t/re/pat_rt_report.t
index 33b6f7c..cf176ec 100644
--- a/t/re/pat_rt_report.t
+++ b/t/re/pat_rt_report.t
@@ -21,7 +21,7 @@ BEGIN {
 }
 
 
-plan tests => 2510;  # Update this when adding/deleting tests.
+plan tests => 2511;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -1181,6 +1181,14 @@ sub run_tests {
 
         iseq($first, $second);
     }    
+
+    {
+	local $BugId = 68564;	# minimal CURLYM limited to 32767 matches
+	iseq
+	  join("-", "abc " =~ /(?=(\S+))/g),
+	 "abc-bc-b",
+	 "[perl #68564] stclass optimisation does not break + inside (?=)";
+    }
 } # End of sub run_tests
 
 1;
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 11, 2010

From @nwc10

On Mon, Aug 17, 2009 at 10​:55​:09PM +0200, Abigail wrote​:

Something related (broken only in 5.10, fixed in 5.10.1-RC)​:

$ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
abc
bc
c
$ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
$
$ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'
abc
bc
c

Not sure what blead does - the last pull I did didn't build.

I bisected with

#!/bin/sh
git clean -dxf
touch .patchnum
touch .sha1
touch unpushed.h
# If you can use ccache, add -Dcc=ccache\ gcc -Dld=gcc to the Configure line
# if Encode is not needed for the test, you can speed up the bisect by
# excluding it from the runs with -Dnoextensions=Encode
sh Configure -des -Dusedevel -Uusethreads -Doptimize="-g" -Dcc=ccache\ gcc -Dld=gcc -Dnoextensions=IPC/SysV\ Encode\ DB_File
test -f config.sh || exit 125
# Correct makefile for newer GNU gcc
perl -ni -we 'print unless /<(?​:built-in|command)/' makefile x2p/makefile
# if you just need miniperl, replace test_prep with miniperl
make -j5 miniperl
[ -x ./miniperl ] || exit 125
./miniperl -wle 'push @​a, $_ for "abc" =~ /(?=)(?=(\S+))/g; exit !!@​a'
ret=$?
[ $ret -gt 127 ] && ret=127
git clean -dxf
exit $ret

and found that this was fixed by

commit 89c6a13
Author​: Ævar Arnfjörð Bjarmason <avar@​cpan.org>
Date​: Thu Apr 10 00​:38​:52 2008 +0000

  Re​: [perl #52672] regexp failure​: (?=) turns into OPFAIL
  From​: "Ævar Arnfjörð Bjarmason" <avarab@​gmail.com>
  Message-ID​: <51dd1af80804091738r15d37763lf900d59f8bcc5e81@​mail.gmail.com>
 
  p4raw-id​: //depot/perl@​33667

Inline Patch
diff --git a/regcomp.c b/regcomp.c
index 05b54ee..07d5535 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -5691,6 +5691,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth)
                 RExC_seen |= REG_SEEN_LOOKBEHIND;
 		RExC_parse++;
 	    case '=':           /* (?=...) */
+		RExC_seen_zerolen++;
+			break;
 	    case '!':           /* (?!...) */
 		RExC_seen_zerolen++;
 	        if (*RExC_parse == ')') {
diff --git a/t/op/re_tests b/t/op/re_tests
index 3df3745..6894458 100644
--- a/t/op/re_tests
+++ b/t/op/re_tests
@@ -414,6 +414,7 @@ a[-]?c	ac	y	$&	ac
 '(abc)\1'i	ABCABC	y	$1	ABC
 '([a-c]*)\1'i	ABCABC	y	$1	ABC
 a(?!b).	abad	y	$&	ad
+(?=)a	a	y	$&	a
 a(?=d).	abad	y	$&	ad
 a(?=c|d).	abad	y	$&	ad
 a(?:b|c|d)(.)	ace	y	$1	e



I don't know if that test is equivalent to your code. Should your code be added as an additional test?

Nicholas Clark

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 11, 2010

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun Aug 16 01​:15​:58 2009, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width
patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc
bc
c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc

This was broken by commit 07be1b8,
which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1,} (which are meant to
be identical, but only {1,} was working). I noticed they diverged as a
result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code, and would much appreciate any
feedback as to whether this fix (or ‘fix’) will break anything else.

I'm afraid I don't know this code either.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 12, 2010

From @ikegami

On Mon, Oct 11, 2010 at 10​:39 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

I don't know if that test is equivalent to your code.
Should your code be added as an additional test?

I don't see why not. Especially since it still fails for blead.

Patch with test attached.

- Eric

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 12, 2010

From @ikegami

0001-Test-for-RT-68564-perl-wle-print-for-abc-S-g.patch
From 4508e9547d2d335f019b9cdfb4b48881749a46be Mon Sep 17 00:00:00 2001
From: Eric Brine <ikegami@adaelis.com>
Date: Mon, 11 Oct 2010 17:57:03 -0700
Subject: [PATCH] Test for RT##68564: perl -wle"print for 'abc' =~ /(?=(\S+))/g"

---
 t/re/pat.t |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..b1d43f7 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@@ -202,6 +202,15 @@ sub run_tests {
     }
 
     {
+        local $Message = "Test zero-width match with /g";
+        local $" = ":";
+        $_ = "abc";
+        my @words = 'abc' =~ /(?=(\S+))/g;
+        my $exp   = "abc:cb:c";
+        iseq "@words", $exp;
+    }
+
+    {
         $_ = "abcdefghi";
 
         my $pat1 = 'def';
-- 
1.7.1.1

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 12, 2010

From @ikegami

On Mon, Oct 11, 2010 at 9​:23 PM, Eric Brine <ikegami@​adaelis.com> wrote​:

On Mon, Oct 11, 2010 at 10​:39 AM, Nicholas Clark <nick@​ccl4.org> wrote​:

I don't know if that test is equivalent to your code.
Should your code be added as an additional test?

I don't see why not. Especially since it still fails for blead.

Patch with test attached.

- Eric

Forgot to increment number of tests. Updated patch attached.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Oct 12, 2010

From @ikegami

From e81434cbf81a1bb93004b920bb35c502c51bf182 Mon Sep 17 00​:00​:00 2001
From​: Eric Brine <ikegami@​adaelis.com>
Date​: Mon, 11 Oct 2010 18​:43​:01 -0700
Subject​: [PATCH] Test for RT##68564​: perl -wle"print for 'abc' =~ /(?=(\S+))/g"


t/re/pat.t | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/t/re/pat.t b/t/re/pat.t
index c007880..c226b62 100644
--- a/t/re/pat.t
+++ b/t/re/pat.t
@​@​ -23,7 +23,7 @​@​ BEGIN {
}

-plan tests => 398; # Update this when adding/deleting tests.
+plan tests => 399; # Update this when adding/deleting tests.

run_tests() unless caller;

@​@​ -202,6 +202,15 @​@​ sub run_tests {
  }

  {
+ local $Message = "Test zero-width match with /g";
+ local $" = "​:";
+ $_ = "abc";
+ my @​words = 'abc' =~ /(?=(\S+))/g;
+ my $exp = "abc​:cb​:c";
+ iseq "@​words", $exp;
+ }
+
+ {
  $_ = "abcdefghi";

  my $pat1 = 'def';
--
1.7.1.1

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Nov 3, 2010

From @demerphq

On 10 October 2010 23​:27, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sun Aug 16 01​:15​:58 2009, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width
patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc
bc
c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g"
abc

This was broken by commit 07be1b8,
which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1,} (which are meant to
be identical, but only {1,} was working). I noticed they diverged as a
result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code, and would much appreciate any
feedback as to whether this fix (or ‘fix’) will break anything else.

I have looked into this more deeply and applied a modified version of
your patch, which IMO was more or less exactly correct.

Here is the commit message I wrote​:

commit e7f38d0
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Wed Nov 3 10​:23​:00 2010 +0100

  fix 68564​: /g failure with zero-width patterns

  This is based on a patch by Father Chrysostomos <sprout@​cpan.org>

  The start class optimisation has two modes, "try every valid start
  position" (doevery) and "flip flop mode" (!doevery) where it trys
  only the first valid start position in a sequence.

  Consider /(\d+)X/ and the string "123456Y", now we know that if we fail
  to match X after matching "123456" then we will also fail to match after
  "23456" (assuming no evil tricks are in place, which disable the
  optimisation anyway), so we know we can skip forward until the check
  /fails/ and only then start looking for a real match. This is flip-flop
  mode.

  Now consider the case with zero-width lookahead under /g​: /(?=(\d+)X)/.
  In this case we have an additional failure mode, that is failure when
  we match a zero-width string twice at the same pos(). So now, the
  "flip-flop" logic breaks as it /is/ possible that we could match at
  "23456" when we couldn't match at "123456" because of the zero-length
  twice at the same pos() rule. For instance​:

  print $1 for "123"=~/(?=(\d+))/g

  should first match "123". Since $& is zero length, pos() is not
  incremented. We then match again, successfully, except that the match
  is rejected despite technical-success because its $& is also zero
  length and pos() has not advanced. If the flip-flop mode is enabled
  we wont retry until we find a failing character first.

  The point here is that it makes perfect sense to disable the
  "flip-flop" mode optimisation when the start class is inside
  a lookahead as it really doesnt apply.

IMO your patch was quite right, although I had to dig fairly deep to
understand why.

Thanks for the patch.

BTW, I am a bit curious if there are any other flaws in the flip-flop logic.
I tried reasonably hard to make it fail without the zero-width
lookahead, and was
unable to find a failure case, but I still kinda feel like there might
be some interesting
edge case

Cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Nov 3, 2010

@cpansprout - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this Nov 3, 2010
@p5pRT p5pRT added the Severity Low label Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.