Skip to content

Commit

Permalink
regcomp.c - Resolve issues clearing buffers in CURLYX (MAJOR-CHANGE)
Browse files Browse the repository at this point in the history
CURLYX doesn't reset capture buffers properly. It is possible
for multiple buffers to be defined at once with values from
different iterations of the loop, which doesn't make sense really.

An example is this:

  "foobarfoo"=~/((foo)|(bar))+/

after this matches $1 should equal $2 and $3 should be undefined,
or $1 should equal $3 and $2 should be undefined. Prior to this
patch this would not be the case.

The solution that this patches uses is to introduce a form of
"layered transactional storage" for paren data. The existing
pair of start/end data for capture data is extended with a
start_new/end_new pair. When the vast majority of our code wants
to check if a given capture buffer is defined they first check
"start_new/end_new", if either is -1 then they fall back to
whatever is in start/end.

When a capture buffer is CLOSEd the data is written into the
start_new/end_new pair instead of the start/end pair. When a CURLYX
loop is executing and has matched something (at least one "A" in
/A*B/ -- thus actually in WHILEM) it "commits" the start_new/end_new
data by writing it into start/end. When we begin a new iteration of
the loop we clear the start_new/end_new pairs that are contained by
the loop, by setting them to -1. If the loop fails then we roll back
as we used to. If the loop succeeds we continue. When we hit an END
block we commit everything.

Consider the example above. We start off with everything set to -1.

 $1 = (-1,-1):(-1,-1)
 $2 = (-1,-1):(-1,-1)
 $3 = (-1,-1):(-1,-1)

In the first iteration we have matched "foo" and end up with this:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 0, 3)
 $3 = (-1,-1):(-1,-1)

We commit the results of $2 and $3, and then clear the new data in
the beginning of the next loop:

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):(-1,-1)

We then match "bar":

 $1 = (-1,-1):( 0, 3)
 $2 = ( 0, 3):(-1,-1)
 $3 = (-1,-1):( 3, 7)

and then commit the result and clear the new data:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):(-1,-1)
 $3 = ( 3, 7):(-1,-1)

and then we match "foo" again:

 $1 = (-1,-1):( 0, 3)
 $2 = (-1,-1):( 7,10)
 $3 = ( 3, 7):(-1,-1)

And we then commit. We do a regcppush here as normal.

 $1 = (-1,-1):( 0, 3)
 $2 = ( 7,10):( 7,10)
 $3 = (-1,-1):(-1,-1)

We then clear it again, but since we don't match when we regcppop
we store the buffers back to the above layout. When we finally
hit the END buffer we also do a commit as well on all buffers, including
the 0th (for the full match).

Fixes GH Issue #18865, and adds tests for it and other things.
  • Loading branch information
demerphq committed Jan 26, 2023
1 parent 4ef62b4 commit 0156ef9
Show file tree
Hide file tree
Showing 12 changed files with 235 additions and 58 deletions.
8 changes: 4 additions & 4 deletions pod/perldebguts.pod
Expand Up @@ -752,13 +752,13 @@ will be lost.
PLUS node Match this (simple) thing 1 or more times:
/A{1,}B/ where A is width 1 char

CURLY sv 2 Match this (simple) thing {n,m} times:
CURLY sv 4 Match this (simple) thing {n,m} times:
/A{m,n}B/ where A is width 1 char
CURLYN no 2 Capture next-after-this simple thing:
CURLYN no 4 Capture next-after-this simple thing:
/(A){m,n}B/ where A is width 1 char
CURLYM no 2 Capture this medium-complex thing {n,m}
CURLYM no 4 Capture this medium-complex thing {n,m}
times: /(A){m,n}B/ where A is fixed-length
CURLYX sv 2 Match/Capture this complex thing {n,m}
CURLYX sv 4 Match/Capture this complex thing {n,m}
times.

# This terminator creates a loop structure for CURLYX
Expand Down
8 changes: 6 additions & 2 deletions pp_ctl.c
Expand Up @@ -381,9 +381,9 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)

if (!p || p[1] < RX_NPARENS(rx)) {
#ifdef PERL_ANY_COW
i = 7 + (RX_NPARENS(rx)+1) * 2;
i = 7 + (RX_NPARENS(rx)+1) * 4;
#else
i = 6 + (RX_NPARENS(rx)+1) * 2;
i = 6 + (RX_NPARENS(rx)+1) * 4;
#endif
if (!p)
Newx(p, i, UV);
Expand All @@ -409,6 +409,8 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
for (i = 0; i <= RX_NPARENS(rx); ++i) {
*p++ = (UV)RX_OFFSp(rx)[i].start;
*p++ = (UV)RX_OFFSp(rx)[i].end;
*p++ = (UV)RX_OFFSp(rx)[i].start_new;
*p++ = (UV)RX_OFFSp(rx)[i].end_new;
}
}

Expand Down Expand Up @@ -440,6 +442,8 @@ S_rxres_restore(pTHX_ void **rsp, REGEXP *rx)
for (i = 0; i <= RX_NPARENS(rx); ++i) {
RX_OFFSp(rx)[i].start = (I32)(*p++);
RX_OFFSp(rx)[i].end = (I32)(*p++);
RX_OFFSp(rx)[i].start_new = (I32)(*p++);
RX_OFFSp(rx)[i].end_new = (I32)(*p++);
}
}

Expand Down
13 changes: 13 additions & 0 deletions regcomp.c
Expand Up @@ -4612,6 +4612,7 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
const char * const origparse = RExC_parse;
I32 min;
I32 max = REG_INFTY;
I32 npar_before = RExC_npar-1;

/* Save the original in case we change the emitted regop to a FAIL. */
const regnode_offset orig_emit = RExC_emit;
Expand All @@ -4627,6 +4628,7 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
RETURN_FAIL_ON_RESTART_OR_FLAGS(flags, flagp, TRYAGAIN);
FAIL2("panic: regatom returned failure, flags=%#" UVxf, (UV) flags);
}
I32 npar_after = RExC_npar-1;

op = *RExC_parse;
switch (op) {
Expand Down Expand Up @@ -4792,6 +4794,17 @@ S_regpiece(pTHX_ RExC_state_t *pRExC_state, I32 *flagp, U32 depth)
ARG1_SET(REGNODE_p(ret), min);
ARG2_SET(REGNODE_p(ret), max);

/* if we had a npar_after then we need to increment npar_before,
* we want to track the range of parens we need to reset each iteration
*/
if (npar_after!=npar_before) {
ARG3_SET(REGNODE_p(ret), (U16)npar_before+1);
ARG4_SET(REGNODE_p(ret), (U16)npar_after);
} else {
ARG3_SET(REGNODE_p(ret), 0);
ARG4_SET(REGNODE_p(ret), 0);
}

done_main_op:

/* Process any greediness modifiers */
Expand Down
24 changes: 23 additions & 1 deletion regcomp.h
Expand Up @@ -217,7 +217,10 @@ struct regnode_2L {
I32 arg2;
};

/* 'Two field' -- Two 32 bit signed args */
/* 'Two field' -- Two 32 bit signed args.
* First fields must match regnode. Currently unused except to
* facilitate regnode_4 behavior. Not simplifying that as this
* node type could still be useful for other regops. */
struct regnode_2 {
U8 flags;
U8 type;
Expand All @@ -226,6 +229,19 @@ struct regnode_2 {
I32 arg2;
};

/* 'Four field' -- Two 32 bit signed args, Two 16 bit unsigned args
* Used for CURLY and CURLYX node types to track min/max and
* first_paren/last_paren. First fields must match regnode_2 */
struct regnode_4 {
U8 flags;
U8 type;
U16 next_off;
I32 arg1;
I32 arg2;
U16 arg3;
U16 arg4;
};

#define REGNODE_BBM_BITMAP_LEN \
/* 6 info bits requires 64 bits; 5 => 32 */ \
((1 << (UTF_CONTINUATION_BYTE_INFO_BITS)) / CHARBITS)
Expand Down Expand Up @@ -347,11 +363,15 @@ struct regnode_ssc {
#define ARGp(p) ARGp_VALUE_inline(p)
#define ARG1(p) ARG_VALUE(ARG1_LOC(p))
#define ARG2(p) ARG_VALUE(ARG2_LOC(p))
#define ARG3(p) ARG_VALUE(ARG3_LOC(p))
#define ARG4(p) ARG_VALUE(ARG4_LOC(p))
#define ARG2L(p) ARG_VALUE(ARG2L_LOC(p))

#define ARG_SET(p, val) ARG__SET(ARG_LOC(p), (val))
#define ARG1_SET(p, val) ARG__SET(ARG1_LOC(p), (val))
#define ARG2_SET(p, val) ARG__SET(ARG2_LOC(p), (val))
#define ARG3_SET(p, val) ARG__SET(ARG3_LOC(p), (val))
#define ARG4_SET(p, val) ARG__SET(ARG4_LOC(p), (val))
#define ARG2L_SET(p, val) ARG__SET(ARG2L_LOC(p), (val))
#define ARGp_SET(p, val) ARGp_SET_inline((p),(val))

Expand Down Expand Up @@ -437,6 +457,8 @@ struct regnode_ssc {
#define ARGp_BYTES_LOC(p) (((struct regnode_p *)p)->arg1_sv_ptr_bytes)
#define ARG1_LOC(p) (((struct regnode_2 *)p)->arg1)
#define ARG2_LOC(p) (((struct regnode_2 *)p)->arg2)
#define ARG3_LOC(p) (((struct regnode_4 *)p)->arg3)
#define ARG4_LOC(p) (((struct regnode_4 *)p)->arg4)
#define ARG2L_LOC(p) (((struct regnode_2L *)p)->arg2)


Expand Down
8 changes: 4 additions & 4 deletions regcomp.sym
Expand Up @@ -217,10 +217,10 @@ TAIL NOTHING, no ; Match empty string. Can jump here from outsi
STAR STAR, node 0 V ; Match this (simple) thing 0 or more times: /A{0,}B/ where A is width 1 char
PLUS PLUS, node 0 V ; Match this (simple) thing 1 or more times: /A{1,}B/ where A is width 1 char

CURLY CURLY, sv 2 V ; Match this (simple) thing {n,m} times: /A{m,n}B/ where A is width 1 char
CURLYN CURLY, no 2 V ; Capture next-after-this simple thing: /(A){m,n}B/ where A is width 1 char
CURLYM CURLY, no 2 V ; Capture this medium-complex thing {n,m} times: /(A){m,n}B/ where A is fixed-length
CURLYX CURLY, sv 2 V ; Match/Capture this complex thing {n,m} times.
CURLY CURLY, sv 4 V ; Match this (simple) thing {n,m} times: /A{m,n}B/ where A is width 1 char
CURLYN CURLY, no 4 V ; Capture next-after-this simple thing: /(A){m,n}B/ where A is width 1 char
CURLYM CURLY, no 4 V ; Capture this medium-complex thing {n,m} times: /(A){m,n}B/ where A is fixed-length
CURLYX CURLY, sv 4 V ; Match/Capture this complex thing {n,m} times.

#*This terminator creates a loop structure for CURLYX
WHILEM WHILEM, no 0 V ; Do curly processing and see if rest matches.
Expand Down
2 changes: 2 additions & 0 deletions regcomp_debug.c
Expand Up @@ -464,6 +464,8 @@ Perl_regprop(pTHX_ const regexp *prog, SV *sv, const regnode *o, const regmatch_
}
} else if (k == CURLY) {
U32 lo = ARG1(o), hi = ARG2(o);
if (ARG3(o) || ARG4(o))
Perl_sv_catpvf(aTHX_ sv, "<%d:%d>", ARG3(o),ARG4(o)); /* paren before, paren after */
if (op == CURLYM || op == CURLYN || op == CURLYX)
Perl_sv_catpvf(aTHX_ sv, "[%d]", o->flags); /* Parenth number */
Perl_sv_catpvf(aTHX_ sv, "{%u,", (unsigned) lo);
Expand Down

0 comments on commit 0156ef9

Please sign in to comment.