Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expression with lookahead does not match, while it should. #13954

Closed
p5pRT opened this issue Jun 24, 2014 · 11 comments
Labels

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Jun 24, 2014

Migrated from rt.perl.org#122171 (status was 'resolved')

Searchable as RT122171$

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 24, 2014

From dietmar.schindler@manroland-web.com

This is a bug report for perl from dietmar.schindler@​manroland-web.com,
generated with the help of perlbug 1.40 running under perl 5.20.0.


I observed the following bug in Perl versions 5.10.1, 5.18.2 and 5.20.0, while
versions 5.6.2 and 5.8.7 behaved correctly.
Regular expressions similar to /(?=.*P)P/ do not match certain text which they
should match. Example​:
  " P" =~ /(?=.*P)P/
yields no match, i. e.
  print " P" =~ /(?=.*P)P/
prints nothing instead of 1.



Flags​:
  category=core
  severity=high


Site configuration information for perl 5.20.0​:

Configured by strawberry-perl at Fri May 30 08​:28​:40 2014.

Summary of my perl5 (revision 5 version 20 subversion 0) configuration​:

  Platform​:
  osname=MSWin32, osvers=6.2, archname=MSWin32-x64-multi-thread
  uname='Win32 strawberry-perl 5.20.0.1 #1 Fri May 30 08​:27​:00 2014 x64'
  config_args='undef'
  hint=recommended, useposix=true, d_sigaction=undef
  useithreads=define, usemultiplicity=define
  use64bitint=define, use64bitall=undef, uselongdouble=undef
  usemymalloc=n, bincompat5005=undef
  Compiler​:
  cc='gcc', ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -DPERL_TEXTMODE_SCRIPTS -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -fwrapv -fno-strict-aliasing -mms-bitfields',
  optimize='-s -O2',
  cppflags='-DWIN32'
  ccversion='', gccversion='4.8.2', gccosandvers=''
  intsize=4, longsize=4, ptrsize=8, doublesize=8, byteorder=12345678
  d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
  ivtype='long long', ivsize=8, nvtype='double', nvsize=8, Off_t='long long', lseeksize=8
  alignbytes=8, prototype=define
  Linker and Libraries​:
  ld='g++', ldflags ='-s -L"C​:\STRAWB1\perl\lib\CORE" -L"C​:\STRAWB1\c\lib"'
  libpth=C​:\STRAWB1\c\lib C​:\STRAWB1\c\x86_64-w64-mingw32\lib C​:\STRAWB1\c\lib\gcc\x86_64-w64-mingw32\4.8.2
  libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
  perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
  libc=, so=dll, useshrplib=true, libperl=libperl520.a
  gnulibc_version=''
  Dynamic Linking​:
  dlsrc=dl_win32.xs, dlext=xs.dll, d_dlsymun=undef, ccdlflags=' '
  cccdlflags=' ', lddlflags='-mdll -s -L"C​:\STRAWB
1\perl\lib\CORE" -L"C​:\STRAWB~1\c\lib"'


@​INC for perl 5.20.0​:
  C​:/Strawberry/perl/site/lib
  C​:/Strawberry/perl/vendor/lib
  C​:/Strawberry/perl/lib
  .


Environment for perl 5.20.0​:
  HOME (unset)
  LANG (unset)
  LANGUAGE (unset)
  LD_LIBRARY_PATH (unset)
  LOGDIR (unset)
  PATH=C​:\Program Files (x86)\AMD APP\bin\x86_64;C​:\Program Files (x86)\AMD APP\bin\x86;C​:\oracle\ora11g_64bit\bin;C​:\oracle\ora11g_32bit\bin;C​:\Windows\system32;C​:\Windows;C​:\Windows\System32\Wbem;C​:\Windows\System32\WindowsPowerShell\v1.0\;C​:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C​:\Program Files\Microsoft Network Monitor 3\;C​:\Strawberry\c\bin;C​:\Strawberry\perl\site\bin;C​:\Strawberry\perl\bin
  PERL_BADLANG (unset)
  SHELL (unset)
________________________________
manroland web systems GmbH -- Managing Director​: Joern Gossé
Registered Office​: Augsburg -- Trade Register​: AG Augsburg -- HRB-No.​: 26816 -- VAT​: DE281389840

Confidentiality note​:
This eMail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you are hereby notified that any use or dissemination of this communication is strictly prohibited. If you have received this eMail in error, then please delete this eMail.

! Please consider your environmental responsibility before printing this eMail !
________________________________

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 24, 2014

From @jkeenan

On Tue Jun 24 00​:45​:27 2014, dietmar.schindler@​manroland-web.com wrote​:

This is a bug report for perl from dietmar.schindler@​manroland-
web.com,
generated with the help of perlbug 1.40 running under perl 5.20.0.

-----------------------------------------------------------------
I observed the following bug in Perl versions 5.10.1, 5.18.2 and
5.20.0, while
versions 5.6.2 and 5.8.7 behaved correctly.
Regular expressions similar to /(?=.*P)P/ do not match certain text
which they
should match. Example​:
" P" =~ /(?=.*P)P/
yields no match, i. e.
print " P" =~ /(?=.*P)P/
prints nothing instead of 1.
--------------------------------------------------------------

I don't believe this is a bug in Perl 5.

According to 'perldoc perlre' in blead​: "(?=pattern)" is used as "A zero-width positive look-ahead assertion."

If we are to look ahead, we have to have something in the pattern *from which we are looking ahead*. I don't see anything like that in your pattern. Did you perhaps want a positive look-behind pattern "(?<=pattern)"?

Thank you very much.
Jim Keenan

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 24, 2014

The RT System itself - Status changed from 'new' to 'open'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 24, 2014

From Eirik-Berg.Hanssen@allverden.no

On Tue, Jun 24, 2014 at 10​:52 PM, James E Keenan via RT <
perlbug-followup@​perl.org> wrote​:

I don't believe this is a bug in Perl 5.

According to 'perldoc perlre' in blead​: "(?=pattern)" is used as "A
zero-width positive look-ahead assertion."

If we are to look ahead, we have to have something in the pattern *from
which we are looking ahead*. I don't see anything like that in your
pattern. Did you perhaps want a positive look-behind pattern
"(?<=pattern)"?

  Huh? It's supposed to be looking ahead from the starting position, which
will advance on backtracking, since this pattern is not anchored. Observe​:

$ perl -wE '"foo bar" =~ /(?=[bo])(?{ say pos })(?!)/'
1
2
4
$

  Observe also​:

$ perl -E 'say " P" =~ /(?=P)P/'
1
$ perl -E 'say " P" =~ /(?=.*P)P/'

$

  Why shouldn't .* match the empty string? I'm pretty sure this is a
genuine bug in Perl 5.

Eirik

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 24, 2014

From @khwilliamson

On 06/24/2014 03​:39 PM, Eirik Berg Hanssen wrote​:

On Tue, Jun 24, 2014 at 10​:52 PM, James E Keenan via RT
<perlbug-followup@​perl.org <mailto​:perlbug-followup@​perl.org>> wrote​:

I don't believe this is a bug in Perl 5\.

According to 'perldoc perlre' in blead&#8203;:  "\(?=pattern\)" is used as "A
zero\-width positive look\-ahead assertion\."

If we are to look ahead\, we have to have something in the pattern
\*from which we are looking ahead\*\.  I don't see anything like that
in your pattern\.  Did you perhaps want a positive look\-behind
pattern "\(?\<=pattern\)"?

Huh? It's supposed to be looking ahead from the starting position,
which will advance on backtracking, since this pattern is not anchored.
Observe​:

$ perl -wE '"foo bar" =~ /(?=[bo])(?{ say pos })(?!)/'
1
2
4
$

Observe also​:

$ perl -E 'say " P" =~ /(?=P)P/'
1
$ perl -E 'say " P" =~ /(?=.*P)P/'

$

Why shouldn't .* match the empty string? I'm pretty sure this is a
genuine bug in Perl 5.

Eirik

The commit that changed the behavior from the original to the current
was this​:
07be1b8 is the first bad commit
commit 07be1b8
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Fri Jun 9 02​:56​:37 2006 +0200

  Re​: [PATCH] Better version of the Aho-Corasick patch and lots of
benchmarks.
  Message-ID​:
<9b18b3110606081556t779de698r82f361d82a05fbc8@​mail.gmail.com>

  (with tweaks)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 25, 2014

From @demerphq

On 25 June 2014 00​:28, Karl Williamson <public@​khwilliamson.com> wrote​:

On 06/24/2014 03​:39 PM, Eirik Berg Hanssen wrote​:

On Tue, Jun 24, 2014 at 10​:52 PM, James E Keenan via RT
<perlbug-followup@​perl.org <mailto​:perlbug-followup@​perl.org>> wrote​:

I don't believe this is a bug in Perl 5\.

According to 'perldoc perlre' in blead&#8203;:  "\(?=pattern\)" is used as "A
zero\-width positive look\-ahead assertion\."

If we are to look ahead\, we have to have something in the pattern
\*from which we are looking ahead\*\.  I don't see anything like that
in your pattern\.  Did you perhaps want a positive look\-behind
pattern "\(?\<=pattern\)"?

Huh? It's supposed to be looking ahead from the starting position,
which will advance on backtracking, since this pattern is not anchored.
Observe​:

$ perl -wE '"foo bar" =~ /(?=[bo])(?{ say pos })(?!)/'
1
2
4
$

Observe also​:

$ perl -E 'say " P" =~ /(?=P)P/'
1
$ perl -E 'say " P" =~ /(?=.*P)P/'

$

Why shouldn't .* match the empty string? I'm pretty sure this is a
genuine bug in Perl 5.

Eirik

The commit that changed the behavior from the original to the current was
this​:
07be1b8 is the first bad commit
commit 07be1b8
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Fri Jun 9 02​:56​:37 2006 +0200

Re&#8203;: \[PATCH\] Better version of the Aho\-Corasick patch and lots of

benchmarks.
Message-ID​: <9b18b3110606081556t779de698r8
2f361d82a05fbc8@​mail.gmail.com>

\(with tweaks\)

That surprises me given this​:

$ ./perl -Ilib -Mre=debug -le'" P" =~ /(?=.*P)P/'
Compiling REx "(?=.*P)P"
Final program​:
  1​: IFMATCH[0] (9)
  3​: STAR (5)
  4​: REG_ANY (0)
  5​: EXACT <P> (7)
  7​: SUCCEED (0)
  8​: TAIL (9)
  9​: EXACT <P> (11)
  11​: END (0)
anchored "P" at 0 (checking anchored) anchored(MBOL) implicit minlen 1
Matching REx "(?=.*P)P" against " P"
Intuit​: trying to determine minimum start position...
  Did not find anchored substr "P"...
Match rejected by optimizer
Freeing REx​: "(?=.*P)P"

Why did it decide it is anchored? It shouldn't have decided that.

If you bisected maybe this is a false positive.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 25, 2014

From @demerphq

On 25 June 2014 08​:36, demerphq <demerphq@​gmail.com> wrote​:

On 25 June 2014 00​:28, Karl Williamson <public@​khwilliamson.com> wrote​:

The commit that changed the behavior from the original to the current was
this​:
07be1b8 is the first bad commit
commit 07be1b8
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Fri Jun 9 02​:56​:37 2006 +0200

Re&#8203;: \[PATCH\] Better version of the Aho\-Corasick patch and lots of

benchmarks.
Message-ID​: <9b18b3110606081556t779de698r8
2f361d82a05fbc8@​mail.gmail.com>

\(with tweaks\)

That surprises me given this​:

$ ./perl -Ilib -Mre=debug -le'" P" =~ /(?=.*P)P/'
Compiling REx "(?=.*P)P"
Final program​:
1​: IFMATCH[0] (9)
3​: STAR (5)
4​: REG_ANY (0)
5​: EXACT <P> (7)
7​: SUCCEED (0)
8​: TAIL (9)
9​: EXACT <P> (11)
11​: END (0)
anchored "P" at 0 (checking anchored) anchored(MBOL) implicit minlen 1
Matching REx "(?=.*P)P" against " P"
Intuit​: trying to determine minimum start position...
Did not find anchored substr "P"...
Match rejected by optimizer
Freeing REx​: "(?=.*P)P"

Why did it decide it is anchored? It shouldn't have decided that.

If you bisected maybe this is a false positive.

Or maybe not, I see I did actually touch a bit of logic related to this in
that patch. (Oh the bad old days before git.)

Anyway, it is clearly the optimiser going wrong, as you can see below.
Maybe Dave M has some ideas?

If I find some time to debug further I will let you know what I find.

Yves

$ ./perl -Ilib -Mre=Debug,ALL -le'" P" =~ /(?=.*P)P/'
Assembling pattern from 1 elements
Compiling REx "(?=.*P)P"
Starting first pass (sizing)

(?=.*P)P< | 1| reg
  | | brnc
  | | piec
  | | atom
?=.*P)P< | | reg
.*P)P< | | brnc
  | | piec
  | | atom
P)P< | 2| inst - STAR
  | 3| piec
  | | atom
)P< | 6| inst - IFMATCH
P< | 9| piec
  | | atom
Required size 11 nodes
Starting second pass (creation)
(?=.*P)P< | 1| reg
  | | brnc
  | | piec
  | | atom
?=.*P)P< | | reg
.*P)P< | | brnc
  | | piec
  | | atom
P)P< | 2| inst - STAR
  | 3| piec
  | | atom
)P< | 5| tail~ STAR (1) -> EXACT
  | 6| lsbr~ tying lastbr STAR (1) to ender
SUCCEED (5) offset 4
  | | tail~ STAR (1)
  | | ~ EXACT <P> (3) -> SUCCEED
  | | inst - IFMATCH
  | 9| tsdy~ IFMATCH[0] (1) -> END
  | | ~ attach to TAIL (8) offset to 7
P< | | piec
  | | atom
< | 11| tail~ IFMATCH[0] (1)
  | | ~ TAIL (8) -> EXACT
  | 12| lsbr~ tying lastbr IFMATCH[0] (1) to ender END (11)
offset 10
  | | tail~ IFMATCH[0] (1)
  | | ~ TAIL (8)
  | | ~ EXACT <P> (9) -> END
first​:> 3​: STAR (5)
first​:> 4​: REG_ANY (0)
first at 4
RExC_seen​: REG_UNBOUNDED_QUANTIFIER_SEEN
study_chunk stopparen=-1 depth=0 recursed_depth=0
Peep​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​:'' @​ 0
Float​: '' @​ 0/0
Peep> 1​: IFMATCH[0] (8)
  study_chunk stopparen=-1 depth=1 recursed_depth=0
  Peep​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0
  Peep> 3​: STAR (5)
  study_chunk stopparen=-1 depth=1 recursed_depth=0
  Peep​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 INF
  Peep> 5​: EXACT <P> (7)
  join> 5​: EXACT <P> (7)
  study_chunk stopparen=-1 depth=1 recursed_depth=0
  Peep​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 INF
  Peep> 7​: SUCCEED (0)
  pre-fin​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 INF
  post-fin​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 INF
study_chunk stopparen=-1 depth=0 recursed_depth=0
Peep​:Pos​:0/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'' 0​:0/0 *Fixed​:'' @​ 0
Float​: '' @​ 0/0
Peep> 9​: EXACT <P> (11)
  join> 9​: EXACT <P> (11)
pre-fin​:Pos​:1/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'P' 1​:0/0 *Fixed​:'' @​ 0
Float​: '' @​ 0/0
post-fin​:Pos​:1/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'P' 1​:0/0 *Fixed​:'' @​ 0
Float​: '' @​ 0/0
commit​: Pos​:1/0 Flags​: 0x0 Whilem_c​: 0 Lcp​: 0 Last​:'P' -1​:0/0 *Fixed​:'P' @​
0 Float​: '' @​ 0/0
minlen​: 1 r->minlen​:0 maxlen​:0
RExC_seen​: REG_UNBOUNDED_QUANTIFIER_SEEN
Final program​:
  1​: IFMATCH[0] (9)
  3​: STAR (5)
  4​: REG_ANY (0)
  5​: EXACT <P> (7)
  7​: SUCCEED (0)
  8​: TAIL (9)
  9​: EXACT <P> (11)
  11​: END (0)
anchored "P" at 0 (checking anchored) anchored(MBOL) implicit minlen 1
r->extflags​: IS_ANCHORED UNBOUNDED_QUANTIFIER_SEEN USE_INTUIT_NOML
USE_INTUIT_ML
r->intflags​: IMPLICIT ANCH_MBOL
Matching REx "(?=.*P)P" against " P"
Intuit​: trying to determine minimum start position...
  substrs[0]​: min=0 max=0 end shift=0 useful=100 utf8=0 [PVMG("P"\0)]
  substrs[2]​: min=0 max=0 end shift=0 useful=100 utf8=0 [PVMG("P"\0)]
  At restart​: rx_origin=0 Check offset min​: 0 Start shift​: 0 End shift 0
Real end Shift​: 0
  fbm_instr len=1 str=< >
  Did not find anchored substr "P"...
Match rejected by optimizer
Freeing REx​: "(?=.*P)P"

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 25, 2014

From @demerphq

On 25 June 2014 08​:43, demerphq <demerphq@​gmail.com> wrote​:

On 25 June 2014 08​:36, demerphq <demerphq@​gmail.com> wrote​:

On 25 June 2014 00​:28, Karl Williamson <public@​khwilliamson.com> wrote​:

The commit that changed the behavior from the original to the current
was this​:
07be1b8 is the first bad commit
commit 07be1b8
Author​: Yves Orton <demerphq@​gmail.com>
Date​: Fri Jun 9 02​:56​:37 2006 +0200

Re&#8203;: \[PATCH\] Better version of the Aho\-Corasick patch and lots of

benchmarks.
Message-ID​: <9b18b3110606081556t779de698r8
2f361d82a05fbc8@​mail.gmail.com>

\(with tweaks\)

...

If you bisected maybe this is a false positive.

Or maybe not, I see I did actually touch a bit of logic related to this
in that patch. (Oh the bad old days before git.)

I pushed a fix. Since I am "retired" right now I did not push tests.
Someone else can do that.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 25, 2014

From @wolfsage

On Wed, Jun 25, 2014 at 3​:33 AM, demerphq <demerphq@​gmail.com> wrote​:

On 25 June 2014 08​:43, demerphq <demerphq@​gmail.com> wrote​:

I pushed a fix. Since I am "retired" right now I did not push tests. Someone
else can do that.

Thanks. I've added a small test case in 06ab3c1. More would be good though.

-- Matthew Horsfall (alh)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 27, 2014

From @khwilliamson

Fixed by c445c5b
with test added by 06ab3c1
--
Karl Williamson

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 27, 2014

@khwilliamson - Status changed from 'open' to 'resolved'

@p5pRT p5pRT closed this Jun 27, 2014
@p5pRT p5pRT added the Severity Low label Oct 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.