Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp with recursive subpatterns matches incorrectly #18096

Closed
t-a-k opened this issue Aug 28, 2020 · 9 comments
Closed

Regexp with recursive subpatterns matches incorrectly #18096

t-a-k opened this issue Aug 28, 2020 · 9 comments

Comments

@t-a-k
Copy link
Contributor

t-a-k commented Aug 28, 2020

Description

Regular expression with recursive subpatterns ((?PARNO)) matches incorrectly.

Steps to Reproduce

perl -wE '"a + b + (c + d)" =~ /^((\w|\((\s)*(?1)(?3)*\))(?:(?3)*\+(?3)*(?2))*)(?3)*\+/ or die; say $1'

Expected behavior

In this regular expression, $1 is expected to match a expression-like string with balanced parenthesis, and whole pattern is to match partial expression without the last (parenthesized in this example) term (a + b +).
So this one-liner should print a + b, but actually prints a + b + (c .

Perl configuration

Summary of my perl5 (revision 5 version 33 subversion 1) configuration:
  Commit id: eab86acb0400b955ce0cd22c53d66e84ec8f3b7d
  Platform:
    osname=linux
    osvers=4.19.0-9-686-pae
    archname=i686-linux
    uname='linux x41 4.19.0-9-686-pae #1 smp debian 4.19.118-2+deb10u1 (2020-06-07) i686 gnulinux '
    config_args='-d -Dusedevel'
    hint=recommended
    useposix=true
    d_sigaction=define
    useithreads=undef
    usemultiplicity=undef
    use64bitint=undef
    use64bitall=undef
    uselongdouble=undef
    usemymalloc=n
    default_inc_excludes_dot=define
  Compiler:
    cc='cc'
    ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'
    optimize='-O2'
    cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include'
    ccversion=''
    gccversion='8.3.0'
    gccosandvers=''
    intsize=4
    longsize=4
    ptrsize=4
    doublesize=8
    byteorder=1234
    doublekind=3
    d_longlong=define
    longlongsize=8
    d_longdbl=define
    longdblsize=12
    longdblkind=3
    ivtype='long'
    ivsize=4
    nvtype='double'
    nvsize=8
    Off_t='off_t'
    lseeksize=8
    alignbytes=4
    prototype=define
  Linker and Libraries:
    ld='cc'
    ldflags =' -fstack-protector-strong -L/usr/local/lib'
    libpth=/usr/local/lib /usr/lib/gcc/i686-linux-gnu/8/include-fixed /usr/include/i386-linux-gnu /usr/lib /lib/i386-linux-gnu /lib/../lib /usr/lib/i386-linux-gnu /usr/lib/../lib /lib /lib64 /usr/lib64
    libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat
    perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc
    libc=libc-2.28.so
    so=so
    useshrplib=false
    libperl=libperl.a
    gnulibc_version='2.28'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs
    dlext=so
    d_dlsymun=undef
    ccdlflags='-Wl,-E'
    cccdlflags='-fPIC'
    lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector-strong'


Characteristics of this binary (from libperl): 
  Compile-time options:
    HAS_TIMES
    PERLIO_LAYERS
    PERL_COPY_ON_WRITE
    PERL_DONT_CREATE_GVSV
    PERL_MALLOC_WRAP
    PERL_OP_PARENT
    PERL_PRESERVE_IVUV
    PERL_USE_DEVEL
    USE_LARGE_FILES
    USE_LOCALE
    USE_LOCALE_COLLATE
    USE_LOCALE_CTYPE
    USE_LOCALE_NUMERIC
    USE_LOCALE_TIME
    USE_PERLIO
    USE_PERL_ATOF
  Built under linux
  Compiled at Aug 28 2020 19:07:39
  @INC:
    /usr/local/lib/perl5/site_perl/5.33.1/i686-linux
    /usr/local/lib/perl5/site_perl/5.33.1
    /usr/local/lib/perl5/5.33.1/i686-linux
    /usr/local/lib/perl5/5.33.1
@richardleach
Copy link
Contributor

Had 5 mins to quickly check older release behaviour: v5.20.0 prints the desired a + b, but v5.22.0 and stable releases thereafter (and blead) print a + b + (c.

Will try to bisect tonight, unless someone beats me to it.

@richardleach
Copy link
Contributor

richardleach commented Aug 28, 2020

@demerphq - bisect.pl says a51d618a82a7057c3aabb600a7a8691d27f44a34 is the commit that changed the behaviour

commit a51d618a82a7057c3aabb600a7a8691d27f44a34
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Sep 19 19:57:34 2014 +0200

    rt 122283 - do not recurse into GOSUB/GOSTART when not SCF_DO_SUBSTR
    
    See also comments in patch. A complex regex "grammar" like that in
    RT 122283 causes perl to take literally forever, and exhaust all
    memory during the pattern optimization phase.
    
    Unfortunately I could not track down exacty why this occured, but
    it was very clear that the excessive recursion was unnecessary and
    excessive. By simply eliminating the unncessary recursion performance
    goes back to being acceptable.
    
    I have not thought of a good way to test this change, so this patch
    does not include any tests. Perhaps we can test it using alarm, but
    I will follow up on that later.

:100644 100644 be9c184db9d24f30d2a45637173671070ffd2159 3e8c42eaf7d9e04d2260da8cd0e322d8525fecde M	regcomp.c

@t-a-k
Copy link
Contributor Author

t-a-k commented Sep 1, 2020

In addition to the case in the first comment, the same regexp also fails incorrectly.

perl -wE '"a + (b + c) + d" =~ /^((\w|\((\s)*(?1)(?3)*\))(?:(?3)*\+(?3)*(?2))*)(?3)*\+/ or die; say $1'

This should print a + (b + c), but actually dies.

According to -Mre=debug output, in the both case the repetition (?:(?3)*\+(?3)*(?2))* seems to backtrack into wrong point. Especially, in the "a + (b + c) + d" case it seems to step over the beginning of the target string.

@richardleach
Copy link
Contributor

This should print a + (b + c), but actually dies.

I checked just to be sure and immediately prior to a51d618, it did print a + (b + c), then started dying with that commit.

@hvds
Copy link
Contributor

hvds commented Sep 14, 2020

Oh, this is wrongly optimizing the central part of the regexp (?:(?3)*\+(?3)*(?2))* to CURLYM:

  34:   CURLYM[0]{0,INFTY} (57)
  36:     CURLYX[2]{0,INFTY} (42)
  38:       GOSUB3[-27:11] (41)
  41:     WHILEM[2/6] (0)
  42:     NOTHING (43)
  43:     EXACT <+> (45)
  45:     CURLYX[2]{0,INFTY} (51)
  47:       GOSUB3[-36:11] (50)
  50:     WHILEM[3/6] (0)
  51:     NOTHING (52)
  52:     GOSUB2[-48:4] (55)
  55:     SUCCEED (0)
  56:   NOTHING (57)

.. but CURLYM is supposed to be used only in cases where each iteration of the content to be iterated has the same width. When that is not the case, it will do the wrong thing.

The specific check that should be stopping it from trying to use CURLYM here is regcomp.c:5685:

                      && !deltanext     /* atom is fixed width */

.. so the problem is that deltanext (the difference between min and max possible widths of a match) has been left as zero - I expect that we are treating GOSUBs not recursed into as a result of the bisected commit as matching zero-width.

I think the fix needs to be that GOSUBs not recursed into should be treated as matching widths in the range (0, \inf), with the potential risk that in some cases this may reject a valid pattern as infinite; the patch below appears to fix it, but I'm uncertain if there's some additional bookkeeping required (eg setting is_inf or similar).

--- a/regcomp.c
+++ b/regcomp.c
@@ -5213,6 +5213,7 @@ S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp,
                      * but this doesn't make us mismatch, just try a bit
                      * harder than we should.
                      * */
+                    delta = OPTIMIZE_INFTY;
                     scan= regnext(scan);
                     continue;
                 }

If anyone can confirm the validity of this, I can add testcases and update the preceding comments to be a bit more accurate.

@t-a-k
Copy link
Contributor Author

t-a-k commented Sep 14, 2020

+                    delta = OPTIMIZE_INFTY;

This seems to work for me.

Below is the testcase I have tested (just rewrites of one-line tests above):

--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -2023,6 +2023,8 @@ AB\s+\x{100}	AB \x{100}X	y	-	-
 /(?iaax:A? \K +)/	African_Feh	c	-	\\K + is forbidden - matches null string many times in regex
 /(?iaa:A?\K+)/	African_Feh	c	-	\\K+ is forbidden - matches null string many times in regex
 /(?iaa:A?\K*)/	African_Feh	c	-	\\K* is forbidden - matches null string many times in regex
+^((\w|\((\s)*(?1)(?3)*\))(?:(?3)*\+(?3)*(?2))*)(?3)*\+	a + b + (c + d)	y	$1	a + b		# [GH #18096]
+^((\w|\((\s)*(?1)(?3)*\))(?:(?3)*\+(?3)*(?2))*)(?3)*\+	a + (b) + c	y	$1	a + (b)		# [GH #18096]
 # Keep these lines at the end of the file
 # pat	string	y/n/etc	expr	expected-expr	skip-reason	comment
 # vim: softtabstop=0 noexpandtab

@hvds
Copy link
Contributor

hvds commented Sep 14, 2020

This seems to work for me.

Thanks, in the absence of any confirmation from others I'll have a go tomorrow at tracing through what follows to see if I can verify the need or lack of need for additional bookkeeping.

@hvds
Copy link
Contributor

hvds commented Sep 15, 2020

I've now created a PR #18138 for this.

Looking at the code branch where we do recurse, I see it marks the same case with is_inf = is_inf_internal = 1, so I've gone with that rather than delta = OPTIMIZE_INFTY, and confirmed it gives the same result.

I also tweaked your testcase to replace the parens with angle brackets, just to make the patterns marginally easier to read.

@hvds
Copy link
Contributor

hvds commented Sep 22, 2020

That PR now applied as commit f4cd5e2.

@hvds hvds closed this as completed Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants