Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) #14475

Closed
p5pRT opened this issue Feb 5, 2015 · 18 comments
Closed

RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?) #14475

p5pRT opened this issue Feb 5, 2015 · 18 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Feb 5, 2015

Migrated from rt.perl.org#123743 (status was 'resolved')

Searchable as RT123743$

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2015

From warren.hyde@amd.com

Created by warren.hyde@amd.com

Certain types of (poorly-written) regular expressions have become
painfully slow since the refactoring of regexec.c at some point in
the 5.17 versions. I believe this to be a bug in the behavior, as
the RegEx engine is doing much more (wasted) work trying to match
certain patterns that previous Perl versions (through 5.16.1) had
no problems with.

Specifically, consider this (good, previous behavior) example​:

5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000​: 0c" =~ /.*​:\s*ab/i'
5.16.1> Compiling REx ".*​:\s*ab"
5.16.1> Final program​:
5.16.1> 1​: STAR (3)
5.16.1> 2​: REG_ANY (0)
5.16.1> 3​: EXACTF <​:> (5)
5.16.1> 5​: STAR (7)
5.16.1> 6​: SPACE (0)
5.16.1> 7​: EXACTF <ab> (9)
5.16.1> 9​: END (0)
5.16.1> anchored(MBOL) implicit minlen 3
5.16.1> Matching REx ".*​:\s*ab" against "00000000​: 0c"
5.16.1> 0 <> <00000000​: > | 1​:STAR(3)
5.16.1> REG_ANY can match 12 times out of 2147483647...
5.16.1> 8 <00000000> <​: 0c> | 3​: EXACTF <​:>(5)
5.16.1> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.16.1> SPACE can match 1 times out of 2147483647...
5.16.1> failed...
5.16.1> failed...
5.16.1> Match failed
5.16.1> Freeing REx​: ".*​:\s*ab"

(Using case-insensitive matching disallows the optimizer from looking
for the fixed text "ab".)

Now, compare that with the work done by Perl since 5.18​:

5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000​: 0c" =~ /.*​:\s*ab/i'
5.18.2> Compiling REx ".*​:\s*ab"
5.18.2> Final program​:
5.18.2> 1​: STAR (3)
5.18.2> 2​: REG_ANY (0)
5.18.2> 3​: EXACT <​:> (5)
5.18.2> 5​: STAR (7)
5.18.2> 6​: POSIXD[\s] (0)
5.18.2> 7​: EXACTF <ab> (9)
5.18.2> 9​: END (0)
5.18.2> floating "​:" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00000000​: 0c"
5.18.2> Found floating substr "​:" at offset 8...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> Matching REx ".*​:\s*ab" against "00000000​: 0c"
5.18.2> 0 <> <00000000​: > | 1​:STAR(3)
5.18.2> REG_ANY can match 12 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0000000​: 0c"
5.18.2> Found floating substr "​:" at offset 7...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 1 <0> <0000000​: 0> | 1​:STAR(3)
5.18.2> REG_ANY can match 11 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "000000​: 0c"
5.18.2> Found floating substr "​:" at offset 6...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 2 <00> <000000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 10 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00000​: 0c"
5.18.2> Found floating substr "​:" at offset 5...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 3 <000> <00000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 9 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0000​: 0c"
5.18.2> Found floating substr "​:" at offset 4...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 4 <0000> <0000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 8 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "000​: 0c"
5.18.2> Found floating substr "​:" at offset 3...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 5 <00000> <000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 7 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00​: 0c"
5.18.2> Found floating substr "​:" at offset 2...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 6 <000000> <00​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 6 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0​: 0c"
5.18.2> Found floating substr "​:" at offset 1...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 7 <0000000> <0​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 5 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "​: 0c"
5.18.2> Found floating substr "​:" at offset 0...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 8 <00000000> <​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 4 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against " 0c"
5.18.2> Did not find floating substr "​:"...
5.18.2> Match rejected by optimizer
5.18.2> Match failed
5.18.2> Freeing REx​: ".*​:\s*ab"

Why does it walk forward through the string trying to guess the start of
the match again when all prior Perls know to give up after the first
fail? This new behavior is causing massive wasted processing for a
certain class of regular expression, namely the naiive use of ".*",
which (rather unfortulately) shows up very often in the real world.

Since 5.18 (or perhaps 5.17.x), the performance of this regular expression
degrades geometrically with the length of the line before the colon,
which caused a big efficiency problem. I know it's not a good example
of a regex, but this presumes the coder knows how the internals of regex
matching works.

You can see how badly this affects things with the following comparison​:

5.16.1> $ seq -f "%01000.0f​: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*​:\s*ab/i'
5.16.1> 0.00user 0.00system 0​:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k
5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps

5.18.2> $ seq -f "%01000.0f​: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*​:\s*ab/i'
5.18.2> 4.31user 0.00system 0​:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k
5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps

Spending almost 4.5 seconds matching this regular expression against 1000
lines of text is a big step back in the performance of the flagship text
processing language. Something is amiss when backtracking ".*" now.

Perl Info

Flags:
    category=core
    severity=medium

Site configuration information for perl 5.18.2:

Configured by pandora at Tue May  6 14:41:42 EDT 2014.

Summary of my perl5 (revision 5 version 18 subversion 2) configuration:
   
  Platform:
    osname=linux, osvers=2.6.18-308.8.2.el5, archname=x86_64-linux-thread-multi
    uname='linux atlvbuild-el5-64 2.6.18-308.8.2.el5 #1 smp tue may 29 11:54:17 edt 2012 x86_64 x86_64 x86_64 gnulinux '
    config_args='-des -Dcc=gcc -Dccflags=-fPIC -Dcflags=-fPIC -Uinstallusrbinperl -Dperladmin=pandora-admin@mpdtxmail -Dprefix=/tool/pandora64/.package/perl-5.18.2-gcc481 -Dstartperl=#!/tool/pandora64/.package/perl-5.18.2-gcc481/bin/perl -Dlibpth=/tool/pandora64/lib /lib64 /usr/lib64 -Duseshrplib -Duseithreads -Dusethreads -Duselargefiles -Duse64bitint -Dd_semctl_semun -Di_db -Ui_ndbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Ubincompat5005 -Uversiononly -Dpager=/tool/pandora64/bin/less -isr'
    hint=recommended, useposix=true, d_sigaction=define
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=define, use64bitall=define, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O2',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fPIC -fno-strict-aliasing -pipe -fstack-protector'
    ccversion='', gccversion='4.8.1', gccosandvers=''
    intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678
    d_longlong=define, longlongsize=8, d_longdbl=undef, longdblsize=
    ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='gcc', ldflags =' -fstack-protector'
    libpth=/tool/pandora64/lib /lib64 /usr/lib64
    libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc
    libc=/lib/libc-2.5.so, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.5'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi/CORE'
    cccdlflags='-fPIC', lddlflags='-shared -O2 -fstack-protector'

Locally applied patches:
    


@INC for perl 5.18.2:
    /proj/verif_release_ro/p4w/current
    /proj/verif_release_ro/p4_mkwa/current
    /proj/verif_release_ro/siteconfig/current/lib/perl
    /proj/verif_release_ro/error_reporting/25/lib/perl
    /proj/verif_release_ro/cbwa_bootcore/current/lib/perl
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2/x86_64-linux-thread-multi
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/site_perl/5.18.2
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2/x86_64-linux-thread-multi
    /tool/pandora64/.package/perl-5.18.2-gcc481/lib/5.18.2
    .


Environment for perl 5.18.2:
    HOME=/home/whyde
    LANG=C
    LANGUAGE (unset)
    LC_COLLATE=C
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/tool/pandora64/.package/perl-5.18.2-gcc481/bin:/tool/pandora64/.package/perl-5.18.1/bin:/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/tool/pandora64/.package/subversion-1.6.3/bin:/proj/verif_release_ro/svn_ssh/31/bin:/tool/pandora64/.package/tcltk-8.4.6-a/bin:/tool/pandora64/.package/perl-5.8.8/bin:/proj/verif_release_ro/siteconfig/current/bin:/tool/pandora64/bin:/tool/pandora/bin:/proj/verif_release_ro/cbwa_bootcore/current/bin:/bin:/usr/bin:/usr/lib64/qt-3.3/bin:/usr/X11R6/bin:/usr/kerberos/bin:/sbin:/usr/sbin:/tool/lsf/bin:/home/whyde/bin:/usr/NX/bin:/usr/NX/bin:/proj/soc_infra/bin
    PERL5LIB=/proj/verif_release_ro/p4w/current:/proj/verif_release_ro/p4_mkwa/current:/proj/verif_release_ro/siteconfig/current/lib/perl:/proj/verif_release_ro/error_reporting/25/lib/perl:/proj/verif_release_ro/cbwa_bootcore/current/lib/perl
    PERL_BADLANG (unset)
    PERL_HOME=/tool/pandora64/.package/perl-5.18.2-gcc481
    SHELL=/tool/pandora/binlatest/tcsh

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

From @demerphq

On 5 February 2015 at 16​:46, via RT <perlbug-followup@​perl.org> wrote​:

# New Ticket Created by
# Please include the string​: [perl #123743]
# in the subject line of all future correspondence about this issue.
# <URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=123743 >

This is a bug report for perl from warren.hyde@​amd.com,
generated with the help of perlbug 1.39 running under perl 5.18.2.

-----------------------------------------------------------------
[Please describe your issue here]

Certain types of (poorly-written) regular expressions have become
painfully slow since the refactoring of regexec.c at some point in
the 5.17 versions. I believe this to be a bug in the behavior, as
the RegEx engine is doing much more (wasted) work trying to match
certain patterns that previous Perl versions (through 5.16.1) had
no problems with.

Thank you for the report. FWIW, I consider this a very serious,
release-blocker level regression.

Specifically, consider this (good, previous behavior) example​:

5.16.1> $ perl5.16.1 -Mre=debug -e '"00000000​: 0c" =~ /.*​:\s*ab/i'
5.16.1> Compiling REx ".*​:\s*ab"
5.16.1> Final program​:
5.16.1> 1​: STAR (3)
5.16.1> 2​: REG_ANY (0)
5.16.1> 3​: EXACTF <​:> (5)
5.16.1> 5​: STAR (7)
5.16.1> 6​: SPACE (0)
5.16.1> 7​: EXACTF <ab> (9)
5.16.1> 9​: END (0)
5.16.1> anchored(MBOL) implicit minlen 3
5.16.1> Matching REx ".*​:\s*ab" against "00000000​: 0c"
5.16.1> 0 <> <00000000​: > | 1​:STAR(3)
5.16.1> REG_ANY can match 12 times out of 2147483647...
5.16.1> 8 <00000000> <​: 0c> | 3​: EXACTF <​:>(5)
5.16.1> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.16.1> SPACE can match 1 times out of 2147483647...
5.16.1> failed...
5.16.1> failed...
5.16.1> Match failed
5.16.1> Freeing REx​: ".*​:\s*ab"

(Using case-insensitive matching disallows the optimizer from looking
for the fixed text "ab".)

Now, compare that with the work done by Perl since 5.18​:

5.18.2> $ perl5.18.2 -Mre=debug -e '"00000000​: 0c" =~ /.*​:\s*ab/i'
5.18.2> Compiling REx ".*​:\s*ab"
5.18.2> Final program​:
5.18.2> 1​: STAR (3)
5.18.2> 2​: REG_ANY (0)
5.18.2> 3​: EXACT <​:> (5)
5.18.2> 5​: STAR (7)
5.18.2> 6​: POSIXD[\s] (0)
5.18.2> 7​: EXACTF <ab> (9)
5.18.2> 9​: END (0)
5.18.2> floating "​:" at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 3
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00000000​: 0c"
5.18.2> Found floating substr "​:" at offset 8...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> Matching REx ".*​:\s*ab" against "00000000​: 0c"
5.18.2> 0 <> <00000000​: > | 1​:STAR(3)
5.18.2> REG_ANY can match 12 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0000000​: 0c"
5.18.2> Found floating substr "​:" at offset 7...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 1 <0> <0000000​: 0> | 1​:STAR(3)
5.18.2> REG_ANY can match 11 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "000000​: 0c"
5.18.2> Found floating substr "​:" at offset 6...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 2 <00> <000000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 10 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00000​: 0c"
5.18.2> Found floating substr "​:" at offset 5...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 3 <000> <00000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 9 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0000​: 0c"
5.18.2> Found floating substr "​:" at offset 4...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 4 <0000> <0000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 8 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "000​: 0c"
5.18.2> Found floating substr "​:" at offset 3...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 5 <00000> <000​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 7 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "00​: 0c"
5.18.2> Found floating substr "​:" at offset 2...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 6 <000000> <00​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 6 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "0​: 0c"
5.18.2> Found floating substr "​:" at offset 1...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 7 <0000000> <0​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 5 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against "​: 0c"
5.18.2> Found floating substr "​:" at offset 0...
5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0
5.18.2> 8 <00000000> <​: 0c> | 1​:STAR(3)
5.18.2> REG_ANY can match 4 times out of 2147483647...
5.18.2> 8 <00000000> <​: 0c> | 3​: EXACT <​:>(5)
5.18.2> 9 <00000000​:> < 0c> | 5​: STAR(7)
5.18.2> POSIXD[\s] can match 1 times out of 2147483647...
5.18.2> failed...
5.18.2> failed...
5.18.2> Guessing start of match in sv for REx ".*​:\s*ab" against " 0c"
5.18.2> Did not find floating substr "​:"...
5.18.2> Match rejected by optimizer
5.18.2> Match failed
5.18.2> Freeing REx​: ".*​:\s*ab"

Why does it walk forward through the string trying to guess the start of
the match again when all prior Perls know to give up after the first
fail? This new behavior is causing massive wasted processing for a
certain class of regular expression, namely the naiive use of ".*",
which (rather unfortulately) shows up very often in the real world.

Since 5.18 (or perhaps 5.17.x), the performance of this regular expression
degrades geometrically with the length of the line before the colon,
which caused a big efficiency problem. I know it's not a good example
of a regex, but this presumes the coder knows how the internals of regex
matching works.

As you say, this pattern was previously fast, and is not now.

I suspect there is an issue with handling the implicit SBOL/MBOL as
you can see here​:

5.18.2> Position at offset 0 does not contradict /^/m...
5.18.2> Guessed​: match at offset 0

The reason we see this from a pattern with no "^" in it like we do
here is that we automatically prepend a SBOL or MBOL to patterns
starting with .* (depending on whether it was /s) so it does not go
quadratic due to advancing the start pointer needlessly. It appears
that in 5.18 we broke this somehow.

The idea of this optimization is to avoid quadratic backtracking for
unanchored patterns starting with .*, which is exactly what is
occurring here. (It also seems like an *anchored* pattern could now be
quadratic but I have not checked).

Depending on whether /s was used or not .* either matches everything,
or it matches everything but newlines. This means that a pattern
starting with .* can and is (albeit apparently with no impact in 5.18)
implicitly anchored either at the beginning of the string, or at the
beginning of a line, since advancing the start pointer cannot make the
pattern match (unless it is advanced to after a newline on a non /s
pattern). The reason it cannot is that if it could it would have
matched without advancing the start pointer and we would not have
needed to advance the start pointer at all.

In the case you show (with no /s modifier) it should be at the
beginning of each line in the string being matched, which in your case
is equivalent to the beginning of the string. So we should see the
start class logic fire twice, once to say "yes we are the start of
the string" and once later to say "there are no other newlines in this
pattern", after which we should stop matching. It appears that this
last step no longer happens.

You can see how badly this affects things with the following comparison​:

5.16.1> $ seq -f "%01000.0f​: 0c" 1000 | /usr/bin/time perl5.16.1 -ne '/.*​:\s*ab/i'
5.16.1> 0.00user 0.00system 0​:00.01elapsed 0%CPU (0avgtext+0avgdata 7488maxresident)k
5.16.1> 0inputs+0outputs (0major+515minor)pagefaults 0swaps

5.18.2> $ seq -f "%01000.0f​: 0c" 1000 | /usr/bin/time perl5.18.2 -ne '/.*​:\s*ab/i'
5.18.2> 4.31user 0.00system 0​:04.41elapsed 97%CPU (0avgtext+0avgdata 7296maxresident)k
5.18.2> 0inputs+0outputs (0major+503minor)pagefaults 0swaps

Spending almost 4.5 seconds matching this regular expression against 1000
lines of text is a big step back in the performance of the flagship text
processing language.

Indeed. That is a rather diplomatic way of putting it. In conversation
I probably would use much harsher language. I imagine you did too. :-)

Something is amiss when backtracking ".*" now.

We need to bisect to find this[1], but it would not surprise me if
this was fallout from Dave M's work on cleaning up start class
optimizations in the regex engine. Alternatively it might be my fault
when I cleaned up logic related to SBOL/MBOL. My memory fails me as to
exactly when these two changes were made so without more digging I
can't be sure. A last possibility is that we have somehow broken the
super-linear cache, but I discount that one as I don't think it fires
in your case at all.

IMO we must fix this before the next release of Perl. Patterns of this
form, probably *because* of the optimization that appears to be
broken, are all to common. We should have noticed this regression, and
definitely should fix it ASAP. I will try to find some time to dig,
but my schedule is tight and I encourage others to dig as well.

Yves
[1] And I am on a new laptop where it will take me a while to get
setup for bisects and in general Perl dev.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

The RT System itself - Status changed from 'new' to 'open'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

From @cpansprout

On Sun Feb 08 18​:48​:21 2015, demerphq wrote​:

We need to bisect to find this

$ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f​: 0c", $_), 1..50; $t = time; /.*​:\s*ab/i; die if time - $t > 1'
...

3465e1f is the first bad commit
commit 3465e1f
Author​: Karl Williamson <public@​khwilliamson.com>
Date​: Thu Oct 11 12​:15​:53 2012 -0600

  regcomp.c​: Optimize EXACTFish nodes without folds to EXACT

--

Father Chrysostomos

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

From @demerphq

On 9 February 2015 at 05​:16, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sun Feb 08 18​:48​:21 2015, demerphq wrote​:

We need to bisect to find this

$ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0 --end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f​: 0c", $_), 1..50; $t = time; /.*​:\s*ab/i; die if time - $t > 1'
...

Thanks, but I am pretty sure this is a false positive. I think a
better bisect would re 'debug' and count how many lines the OP's
example outputs. If it is more than say 50 then you have found the
bug.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

From @cpansprout

On Sun Feb 08 20​:54​:52 2015, demerphq wrote​:

On 9 February 2015 at 05​:16, Father Chrysostomos via RT
<perlbug-followup@​perl.org> wrote​:

On Sun Feb 08 18​:48​:21 2015, demerphq wrote​:

We need to bisect to find this

$ ../perl.git/Porting/bisect.pl --target=miniperl --start=v5.16.0
--end=v5.18.0 -e '$_ = join "", map sprintf("%01000.0f​: 0c", $_),
1..50; $t = time; /.*​:\s*ab/i; die if time - $t > 1'
...

Thanks, but I am pretty sure this is a false positive. I think a
better bisect would re 'debug' and count how many lines the OP's
example outputs. If it is more than say 50 then you have found the
bug.

Same result with​:

$ ../perl.git/Porting/bisect.pl --start=v5.16.0 --end=v5.18.0 -e '$_= readpipe(q|./perl -Ilib -Mre=debug -e '\''"%01000.0f​: 0c" =~ /.*​:\s*ab/i'\'' 2>&1|) ; warn $_ =~ tr/\n//; die if $_=~ tr/\n// > 50'

3465e1f again.

--

Father Chrysostomos

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 9, 2015

From @iabyn

On Mon, Feb 09, 2015 at 03​:47​:45AM +0100, demerphq wrote​:

definitely should fix it ASAP. I will try to find some time to dig,
but my schedule is tight and I encourage others to dig as well.

I'm digging...

--
"Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 10, 2015

From @iabyn

On Mon, Feb 09, 2015 at 10​:55​:57AM +0000, Dave Mitchell wrote​:

On Mon, Feb 09, 2015 at 03​:47​:45AM +0100, demerphq wrote​:

definitely should fix it ASAP. I will try to find some time to dig,
but my schedule is tight and I encourage others to dig as well.

I'm digging...

I've dug. It's actually a long-standing issue with /.*/ patterns which
are intuit-able. Karls optimisation just made some /.*.../i patterns
intuitable too.

In the following,

  $s = ('0' x 200_000) . '​::​: 0c';
  ok ($s !~ /.*​::​:\s*ab/, 'PREGf_IMPLICIT');
  ok ($s !~ /.*​::​:\s*ab/i, 'PREGf_IMPLICIT/i');
  ok ($s !~ /.*​::​:\s*ab/m, 'PREGf_IMPLICIT/m');
  ok ($s !~ /.*​::​:\s*ab/mi, 'PREGf_IMPLICIT/mi');
  ok ($s !~ /.*​::​:\s*ab/s, 'PREGf_IMPLICIT/s');
  ok ($s !~ /.*​::​:\s*ab/si, 'PREGf_IMPLICIT/si');
  ok ($s !~ /.*​::​:\s*ab/ms, 'PREGf_IMPLICIT/ms');
  ok ($s !~ /.*​::​:\s*ab/msi,'PREGf_IMPLICIT/msi');

all the non-//i ones are quadratic in all perls since 5.8-ish.
The //i ones additionally went quadratic in 5.18.0 and later.

The following commit which I've just pushed makes all 8 of the above run
in millisecond time again.

commit 0fa70a0
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Tue Feb 10 12​:17​:51 2015 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Tue Feb 10 12​:37​:10 2015 +0000

  simpify and speed up /.*.../ handling
 
  See RT ##123743.
 
  A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along
  with PREGf_IMPLICIT.
 
  The idea is that, with /.*.../s, if the NFA don't match when started at
  pos 0, then it's not going to match if started at any other position
  either; while /.*.../ won't match at any other start position up until
  the next \n.
 
  However, the branch in regexec() that implemented this was a bit a mess
  (like much in the perl core, it had gradually accreted), and caused
  intuit-enabled /.*.../ and /.*...patterns to go quadratic.
 
  The branch looked roughly like​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  if (can_intuit) {
  while (s < end) {
  s = intuit(s+1);
  if (!s) goto fail;
  if (regtry(s)) goto success;
  }
  }
  else {
  while (s < end) {
  s = skip_to_next_newline(s);
  if (regtry(s)) goto success;
  }
  }
  }
 
  The problem is that in the presence of a .* at the start of the pattern,
  intuit() will always return either NULL on failure, or the start position,
  rather than any later position. So the can_intuit branch above calls
  regtry() on every character position.
 
  This commit fixes this by changing the structure of the code to be like
  this, where it only tries things on newline boundaries​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  while (1) {
  s = skip_to_next_newline(s);
  if (can_intuit) {
  s = intuit(s+1);
  if (!s) goto fail;
  }
  if (regtry(s)) goto success;
  }
  }
 
  This makes the code a lot simpler, and mostly avoids quadratic behaviour
  (you can still get it with a string consisting mainly of newlines).

--
Nothing ventured, nothing lost.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 10, 2015

From warren.hyde@amd.com

Excellent diagnosis and response, as usual, from the Perl community. Much thanks to Dave and the others who took a look at this. Another uninformed question is how Perl's regex engine winds up in PCRE, and whether that would also be affected?

Cheers,
--
Warren Hyde

-----Original Message-----
From​: Dave Mitchell via RT [mailto​:perlbug-followup@​perl.org]
Sent​: Tuesday, February 10, 2015 8​:52 AM
To​: Hyde, Warren
Subject​: Re​: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)

On Mon, Feb 09, 2015 at 10​:55​:57AM +0000, Dave Mitchell wrote​:

On Mon, Feb 09, 2015 at 03​:47​:45AM +0100, demerphq wrote​:

definitely should fix it ASAP. I will try to find some time to dig,
but my schedule is tight and I encourage others to dig as well.

I'm digging...

I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too.

In the following,

  $s = ('0' x 200_000) . '​::​: 0c';
  ok ($s !~ /.*​::​:\s*ab/, 'PREGf_IMPLICIT');
  ok ($s !~ /.*​::​:\s*ab/i, 'PREGf_IMPLICIT/i');
  ok ($s !~ /.*​::​:\s*ab/m, 'PREGf_IMPLICIT/m');
  ok ($s !~ /.*​::​:\s*ab/mi, 'PREGf_IMPLICIT/mi');
  ok ($s !~ /.*​::​:\s*ab/s, 'PREGf_IMPLICIT/s');
  ok ($s !~ /.*​::​:\s*ab/si, 'PREGf_IMPLICIT/si');
  ok ($s !~ /.*​::​:\s*ab/ms, 'PREGf_IMPLICIT/ms');
  ok ($s !~ /.*​::​:\s*ab/msi,'PREGf_IMPLICIT/msi');

all the non-//i ones are quadratic in all perls since 5.8-ish.
The //i ones additionally went quadratic in 5.18.0 and later.

The following commit which I've just pushed makes all 8 of the above run in millisecond time again.

commit 0fa70a0
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Tue Feb 10 12​:17​:51 2015 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Tue Feb 10 12​:37​:10 2015 +0000

  simpify and speed up /.*.../ handling
 
  See RT ##123743.
 
  A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along
  with PREGf_IMPLICIT.
 
  The idea is that, with /.*.../s, if the NFA don't match when started at
  pos 0, then it's not going to match if started at any other position
  either; while /.*.../ won't match at any other start position up until
  the next \n.
 
  However, the branch in regexec() that implemented this was a bit a mess
  (like much in the perl core, it had gradually accreted), and caused
  intuit-enabled /.*.../ and /.*...patterns to go quadratic.
 
  The branch looked roughly like​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  if (can_intuit) {
  while (s < end) {
  s = intuit(s+1);
  if (!s) goto fail;
  if (regtry(s)) goto success;
  }
  }
  else {
  while (s < end) {
  s = skip_to_next_newline(s);
  if (regtry(s)) goto success;
  }
  }
  }
 
  The problem is that in the presence of a .* at the start of the pattern,
  intuit() will always return either NULL on failure, or the start position,
  rather than any later position. So the can_intuit branch above calls
  regtry() on every character position.
 
  This commit fixes this by changing the structure of the code to be like
  this, where it only tries things on newline boundaries​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  while (1) {
  s = skip_to_next_newline(s);
  if (can_intuit) {
  s = intuit(s+1);
  if (!s) goto fail;
  }
  if (regtry(s)) goto success;
  }
  }
 
  This makes the code a lot simpler, and mostly avoids quadratic behaviour
  (you can still get it with a string consisting mainly of newlines).

--
Nothing ventured, nothing lost.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 10, 2015

From @mauke

Am 10.02.2015 um 16​:11 schrieb Hyde, Warren​:

Excellent diagnosis and response, as usual, from the Perl community.
Much thanks to Dave and the others who took a look at this. Another
uninformed question is how Perl's regex engine winds up in PCRE,

It doesn't.

and whether that would also be affected?

AFAIK perl and PCRE don't share any code.

--
Lukas Mai <plokinom@​gmail.com>

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 11, 2015

@iabyn - Status changed from 'open' to 'pending release'

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 12, 2015

From warren.hyde@amd.com

Dave,

  A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along
  with PREGf_IMPLICIT.

Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below.

Patterns like these are extremely common in a "TWiki Formatted Search", for example.
 
Cheers,
--
Warren Hyde

-----Original Message-----
From​: Dave Mitchell via RT [mailto​:perlbug-followup@​perl.org]
Sent​: Tuesday, February 10, 2015 8​:52 AM
To​: Hyde, Warren
Subject​: Re​: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)

On Mon, Feb 09, 2015 at 10​:55​:57AM +0000, Dave Mitchell wrote​:

On Mon, Feb 09, 2015 at 03​:47​:45AM +0100, demerphq wrote​:

definitely should fix it ASAP. I will try to find some time to dig,
but my schedule is tight and I encourage others to dig as well.

I'm digging...

I've dug. It's actually a long-standing issue with /.*/ patterns which are intuit-able. Karls optimisation just made some /.*.../i patterns intuitable too.

In the following,

  $s = ('0' x 200_000) . '​::​: 0c';
  ok ($s !~ /.*​::​:\s*ab/, 'PREGf_IMPLICIT');
  ok ($s !~ /.*​::​:\s*ab/i, 'PREGf_IMPLICIT/i');
  ok ($s !~ /.*​::​:\s*ab/m, 'PREGf_IMPLICIT/m');
  ok ($s !~ /.*​::​:\s*ab/mi, 'PREGf_IMPLICIT/mi');
  ok ($s !~ /.*​::​:\s*ab/s, 'PREGf_IMPLICIT/s');
  ok ($s !~ /.*​::​:\s*ab/si, 'PREGf_IMPLICIT/si');
  ok ($s !~ /.*​::​:\s*ab/ms, 'PREGf_IMPLICIT/ms');
  ok ($s !~ /.*​::​:\s*ab/msi,'PREGf_IMPLICIT/msi');

all the non-//i ones are quadratic in all perls since 5.8-ish.
The //i ones additionally went quadratic in 5.18.0 and later.

The following commit which I've just pushed makes all 8 of the above run in millisecond time again.

commit 0fa70a0
Author​: David Mitchell <davem@​iabyn.com>
AuthorDate​: Tue Feb 10 12​:17​:51 2015 +0000
Commit​: David Mitchell <davem@​iabyn.com>
CommitDate​: Tue Feb 10 12​:37​:10 2015 +0000

  simpify and speed up /.*.../ handling
 
  See RT ##123743.
 
  A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along
  with PREGf_IMPLICIT.
 
  The idea is that, with /.*.../s, if the NFA don't match when started at
  pos 0, then it's not going to match if started at any other position
  either; while /.*.../ won't match at any other start position up until
  the next \n.
 
  However, the branch in regexec() that implemented this was a bit a mess
  (like much in the perl core, it had gradually accreted), and caused
  intuit-enabled /.*.../ and /.*...patterns to go quadratic.
 
  The branch looked roughly like​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  if (can_intuit) {
  while (s < end) {
  s = intuit(s+1);
  if (!s) goto fail;
  if (regtry(s)) goto success;
  }
  }
  else {
  while (s < end) {
  s = skip_to_next_newline(s);
  if (regtry(s)) goto success;
  }
  }
  }
 
  The problem is that in the presence of a .* at the start of the pattern,
  intuit() will always return either NULL on failure, or the start position,
  rather than any later position. So the can_intuit branch above calls
  regtry() on every character position.
 
  This commit fixes this by changing the structure of the code to be like
  this, where it only tries things on newline boundaries​:
 
  if (anchored) {
  if (regtry(s)) goto success;
  while (1) {
  s = skip_to_next_newline(s);
  if (can_intuit) {
  s = intuit(s+1);
  if (!s) goto fail;
  }
  if (regtry(s)) goto success;
  }
  }
 
  This makes the code a lot simpler, and mostly avoids quadratic behaviour
  (you can still get it with a string consisting mainly of newlines).

--
Nothing ventured, nothing lost.

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 12, 2015

From @demerphq

On 12 February 2015 at 01​:32, Hyde, Warren <Warren.Hyde@​amd.com> wrote​:

Dave,

A pattern that starts /\.\*/ has a fake MBOL or SBOL flag added\, along
with PREGf\_IMPLICIT\.

Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below.

Patterns like these are extremely common in a "TWiki Formatted Search", for example.

Why are they common? Can you give us more context?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 12, 2015

From warren.hyde@amd.com

Yves,

TWiki documents the extraction of text from a topic using a regular expression here​:

http​://twiki.org/cgi-bin/view/TWiki/FormattedSearch#Extract_some_text_from_a_topic_u

The regex is defined as needing to match the entire document (or line) as follows​:

$pattern(reg-exp) A regular expression pattern to extract some text from a topic (does not search meta data; use $formfield instead). In case of a multiple="on" search, the pattern is applied to the line found in each search hit.
• Specify a RegularExpression that covers the whole text (topic or line), which typically starts with .*, and must end in .*
• Put text you want to keep in parenthesis, like $pattern(.*?(from here.*?to here).*)
• Example​: $pattern(.*?\*.*?Email\​:\s*([^\n\r]+).*) extracts the e-mail address from a bullet of format * Email​: ...
• This example has non-greedy .*? patterns to scan for the first occurance of the Email bullet; use greedy .* patterns to scan for the last occurance
• Limitation​: Do not use .*) inside the pattern, e.g. $pattern(.*foo(.*)bar.*) does not work, but $pattern(.*foo(.*?)bar.*) does
• Note​: Make sure that the integrity of a web page is not compromised; for example, if you include an HTML table make sure to include everything including the table end tag

I realize this is not an ideal situation, but if Perl takes a long time backtracking through leading dot-star (even with minimal matching), this may cause a request to timeout in the browser.

That's what I meant by "common", because lots of things document this as a specific use-case, and TWiki came to mind because I recalled having stumbled across this before.

My question was whether the fix as implemented also covered leading .*? as well as leading .*, since I didn't think to include this case in the original perlbug submission.

Cheers,
--
Warren Hyde

-----Original Message-----
From​: yves orton via RT [mailto​:perlbug-followup@​perl.org]
Sent​: Thursday, February 12, 2015 10​:21 AM
To​: Hyde, Warren
Subject​: Re​: [perl #123743] RegEx ".*" Backtracking slow since 5.18 (maybe 5.17.?)

On 12 February 2015 at 01​:32, Hyde, Warren <Warren.Hyde@​amd.com> wrote​:

Dave,

A pattern that starts /\.\*/ has a fake MBOL or SBOL flag added\, along
with PREGf\_IMPLICIT\.

Thanks again for the fix (to appear in 5.22, I assume). Will this also cover the MINMOD case of a leading /.*?.../, which I also see is quadratic in 5.18.2? I didn't see any test code for that situation in your fix below.

Patterns like these are extremely common in a "TWiki Formatted Search", for example.

Why are they common? Can you give us more context?

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 13, 2015

From @demerphq

On 13 February 2015 at 00​:38, Hyde, Warren <Warren.Hyde@​amd.com> wrote​:

Yves,

TWiki documents the extraction of text from a topic using a regular expression here​:

http​://twiki.org/cgi-bin/view/TWiki/FormattedSearch#Extract_some_text_from_a_topic_u

The regex is defined as needing to match the entire document (or line) as follows​:

$pattern(reg-exp) A regular expression pattern to extract some text from a topic (does not search meta data; use $formfield instead). In case of a multiple="on" search, the pattern is applied to the line found in each search hit.
• Specify a RegularExpression that covers the whole text (topic or line), which typically starts with .*, and must end in .*
• Put text you want to keep in parenthesis, like $pattern(.*?(from here.*?to here).*)
• Example​: $pattern(.*?\*.*?Email\​:\s*([^\n\r]+).*) extracts the e-mail address from a bullet of format * Email​: ...
• This example has non-greedy .*? patterns to scan for the first occurance of the Email bullet; use greedy .* patterns to scan for the last occurance
• Limitation​: Do not use .*) inside the pattern, e.g. $pattern(.*foo(.*)bar.*) does not work, but $pattern(.*foo(.*?)bar.*) does
• Note​: Make sure that the integrity of a web page is not compromised; for example, if you include an HTML table make sure to include everything including the table end tag

I realize this is not an ideal situation, but if Perl takes a long time backtracking through leading dot-star (even with minimal matching), this may cause a request to timeout in the browser.

That's what I meant by "common", because lots of things document this as a specific use-case, and TWiki came to mind because I recalled having stumbled across this before.

My question was whether the fix as implemented also covered leading .*? as well as leading .*, since I didn't think to include this case in the original perlbug submission.

It sounds to me like Twiki is going to automatically turn

/PAT/

into

/^(?​:PAT)$/m

And if it doesn't it could. :-)

Anyway, I dont see any reason not to enable the same optimisation for
the minmod case. I was just curious what the issue was with Twiki.

yves

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 13, 2015

From @iabyn

On Thu, Feb 12, 2015 at 04​:38​:01PM +0000, Hyde, Warren wrote​:

My question was whether the fix as implemented also covered leading .*?
as well as leading .*, since I didn't think to include this case in the
original perlbug submission.

Yes, it also covered /.*?/. I've added some extra speed tests with
3e33f67.

--
This is a great day for France!
  -- Nixon at Charles De Gaulle's funeral

@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 2, 2015

From @khwilliamson

Thanks for submitting this ticket

The issue should be resolved with the release today of Perl v5.22, available at http​://www.perl.org/get.html
If you find that the problem persists, feel free to reopen this ticket

--
Karl Williamson for the Perl 5 porters team

@p5pRT p5pRT closed this Jun 2, 2015
@p5pRT
Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 2, 2015

@khwilliamson - Status changed from 'pending release' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.