Skip to content

Commit

Permalink
regcomp.c etc - rework branch reset so it works properly
Browse files Browse the repository at this point in the history
Branch reset was hacked in without much thought about how it might interact
with other features. Over time we added named capture and recursive patterns
with GOSUB, but I guess because branch reset is somewhat esoteric we didnt
notice the accumulating issues related to it.

The main problem was my original hack used a fairly simple device to give
multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced
this was fine. When GOSUB was added later however, we overlooked at that this
broke a key part of the book-keeping for GOSUB.

A GOSUB regop needs to know where to jump to, and which close paren to stop
at. However the structure of the regexp program can change from the time the
regop is created. This means we keep track of every OPEN/CLOSE regop we
encounter during parsing, and when something is inserted into the middle of
the program we make sure to move the offsets we store for the OPEN/CLOSE data.
This is essentially keyed and scaled to the number of parens we have seen.
When branch reset is used however the number of OPEN/CLOSE regops is more than
the number of logical buffers we have seen, and we only move one of the
OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things.

Another issues with branch reset is that it creates weird artifacts like this:
/(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a)
capture buffer because they both have the same id. Another case is that you
cannot check if $+{b} matched and $+{a} did not, because conceptually they
were the same buffer under the hood.

These bugs are now fixed. The "aliasing" of capture buffers to each other is
now done virtually, and under the hood each capture buffer is distinct. We
introduce the concept of a "logical parno" which is the user visible capture
buffer id, and keep it distinct from the true capture buffer id. Most of the
internal logic uses the "true parno" for its business, so a bunch of problems
go away, and we keep maps from logical to physical parnos, and vice versa,
along with a map that gives use the "next physical parno with the same
logical parno". Thus we can quickly skip through the physical capture buffers
to find the one that matched. This means we also have to introduce a
logical_total_parens as well, to complement the already existing total_parens.
The latter refers to the true number of capture buffers. The former represents
the logical number visible to the user.

It is helpful to consider the following table:

  Logical:    $1      $2     $3       $2     $3     $4     $2     $5
  Physical:    1       2      3        4      5      6      7      8
  Next:        0       4      5        7      0      0      0      0
  Pattern:   /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/

The names are mapped to physical buffers. So $+{b} will show what is in
physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched.
Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all
6 named buffers.

Since the need to map these values is rare, we only store these maps when they
are needed and branch reset has been used, when they are NULL it is assumed
that physical and logical buffers are identical.

Currently the way this change is implemented will likely break plug in regexp
engines because they will be missing the new logical_total_parens field at
the very least. Given that the perl internals code is somewhat poorly
abstracted from the regexp engine, with parts of the abstraction leaking out,
I think this is acceptable. If we want to make plug in regexp engines work
properly IMO we need to add some more hooks that they need to implement than
we currently do. For instance mg.c does more work than it should. Given there
are only a few plug in regexp engines and that it is specialized work, I
think this is acceptable. We can work with the authors to refine the API
properly later.
  • Loading branch information
demerphq committed Jan 12, 2023
1 parent 17419a8 commit fe5492d
Show file tree
Hide file tree
Showing 11 changed files with 413 additions and 82 deletions.
8 changes: 7 additions & 1 deletion embed.fnc
Original file line number Diff line number Diff line change
Expand Up @@ -2811,8 +2811,14 @@ Cp |SV * |reg_named_buff_all \
|const U32 flags

: FIXME - is anything in re using this now?
EXp |void |reg_numbered_buff_fetch_flags \
|NN REGEXP * const re \
|const I32 paren \
|NULLOK SV * const sv \
|U32 flags
: FIXME - is anything in re using this now?
EXp |void |reg_numbered_buff_fetch \
|NN REGEXP * const rx \
|NN REGEXP * const re \
|const I32 paren \
|NULLOK SV * const sv
: FIXME - is anything in re using this now?
Expand Down
1 change: 1 addition & 0 deletions embed.h
Original file line number Diff line number Diff line change
Expand Up @@ -1688,6 +1688,7 @@
# define reg_named_buff(a,b,c,d) Perl_reg_named_buff(aTHX_ a,b,c,d)
# define reg_named_buff_iter(a,b,c) Perl_reg_named_buff_iter(aTHX_ a,b,c)
# define reg_numbered_buff_fetch(a,b,c) Perl_reg_numbered_buff_fetch(aTHX_ a,b,c)
# define reg_numbered_buff_fetch_flags(a,b,c,d) Perl_reg_numbered_buff_fetch_flags(aTHX_ a,b,c,d)
# define reg_numbered_buff_length(a,b,c) Perl_reg_numbered_buff_length(aTHX_ a,b,c)
# define reg_numbered_buff_store(a,b,c) Perl_reg_numbered_buff_store(aTHX_ a,b,c)
# define reg_qr_package(a) Perl_reg_qr_package(aTHX_ a)
Expand Down
52 changes: 32 additions & 20 deletions mg.c
Original file line number Diff line number Diff line change
Expand Up @@ -638,13 +638,15 @@ Perl_magic_regdata_cnt(pTHX_ SV *sv, MAGIC *mg)
const SSize_t n = (SSize_t)mg->mg_obj;
if (n == '+') { /* @+ */
/* return the number possible */
return RX_NPARENS(rx);
return RX_LOGICAL_NPARENS(rx);
} else { /* @- @^CAPTURE @{^CAPTURE} */
I32 paren = RX_LASTPAREN(rx);

/* return the last filled */
while ( paren >= 0 && !RX_OFFS_VALID(rx,paren) )
paren--;
if (paren && RX_PARNO_TO_LOGICAL(rx))
paren = RX_PARNO_TO_LOGICAL(rx)[paren];
if (n == '-') {
/* @- */
return (U32)paren;
Expand All @@ -665,32 +667,35 @@ int
Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
{
PERL_ARGS_ASSERT_MAGIC_REGDATUM_GET;

if (PL_curpm) {
REGEXP * const rx = PM_GETRE(PL_curpm);
if (rx) {
const SSize_t n = (SSize_t)mg->mg_obj;
/* @{^CAPTURE} does not contain $&, so we need to increment by 1 */
const I32 paren = mg->mg_len
+ (n == '\003' ? 1 : 0);
SSize_t s;
SSize_t t;
if (paren < 0)
return 0;
if (paren <= (I32)RX_NPARENS(rx) &&
((s = RX_OFFS_START(rx,paren)) != -1) &&
((t = RX_OFFS_END(rx,paren)) != -1))
REGEXP * const rx = PL_curpm ? PM_GETRE(PL_curpm) : NULL;

if (rx) {
const SSize_t n = (SSize_t)mg->mg_obj;
/* @{^CAPTURE} does not contain $&, so we need to increment by 1 */
const I32 paren = mg->mg_len
+ (n == '\003' ? 1 : 0);
SSize_t s;
SSize_t t;
if (paren < 0)
return 0;
if (n != '+' && n != '-') {
CALLREG_NUMBUF_FETCH(rx,paren,sv);
return 0;
}
if (paren <= (I32)RX_LOGICAL_NPARENS(rx)) {
I32 true_paren = RX_LOGICAL_TO_PARNO(rx)
? RX_LOGICAL_TO_PARNO(rx)[paren]
: paren;
do {
if (((s = RX_OFFS_START(rx,true_paren)) != -1) &&
((t = RX_OFFS_END(rx,true_paren)) != -1))
{
SSize_t i;

if (n == '+') /* @+ */
i = t;
else if (n == '-') /* @- */
i = s;
else { /* @^CAPTURE @{^CAPTURE} */
CALLREG_NUMBUF_FETCH(rx,paren,sv);
return 0;
}

if (RX_MATCH_UTF8(rx)) {
const char * const b = RX_SUBBEG(rx);
Expand All @@ -703,6 +708,11 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
sv_setuv(sv, i);
return 0;
}
if (RX_PARNO_TO_LOGICAL_NEXT(rx))
true_paren = RX_PARNO_TO_LOGICAL_NEXT(rx)[true_paren];
else
break;
} while (true_paren);
}
}
sv_set_undef(sv);
Expand Down Expand Up @@ -1095,6 +1105,8 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg)
case '\016': /* ^N */
if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
paren = RX_LASTCLOSEPAREN(rx);
if (RX_PARNO_TO_LOGICAL(rx))
paren = RX_PARNO_TO_LOGICAL(rx)[paren];
if (paren)
goto do_numbuf_fetch;
}
Expand Down
9 changes: 7 additions & 2 deletions proto.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit fe5492d

Please sign in to comment.